Digital Data Gathering vs Data Processing in Enterprise

Enterprise Digital Transformation

Recent years have seen an explosion in digital transformation across numerous industries, where traditional approaches and methods to enterprise processes and functions are being challenged. These approaches and methods are being replaced by digital systems, processes and functions. In many cases these traditional methods are being reimagined entirely with completely new green-field digital systems, processes and functions across the enterprise.  

An integral part of this transformation is the digitization of many existing processes, practices and functions across the enterprise. These transformations include the collection, flow, and processing of data, information, and documents across enterprise processes. Every enterprise process involves gathering and processing of data information and documents.  

As part of the digital transformation of enterprise, it is vital that we distinguish two seperate digitization efforts:  

  • Data Gathering Digitization  

  • Data Processing Digitization  

We discuss in more details what each of these steps entails, potential solutions for each, and their differences. For a successful digital transformation of a process both aspects should be carefully considered and a complete transformation solution for each should be provided which are compatible with each other.  

-- For the digital transformation revolution to reach its full potential enterprises need address both data gathering digitization and data processing digitization --  

Data Gathering Digitization  

The process of collecting data, information and documents from external and internal sources and systems. This information generally comes in two categories (i) Machine Readable vs.non Machine Readable and (ii) Standard (templated) vs non Standard (non templated). In the following we go over these categories.  

Machine readable  

Data, information and documents provided in a format which a computer can process. Examples include word documents, excel sheets, json files, XML files, ... . These can be readily read and “understood” by a computer and can be subsequently processed.

Machine Readable sources can be divided into two general format categories  

  • Standard (templated)  

Data source which follows a predetermined template. This template generally is defined through a set data structure and data fields and possibly a data model governing the relationship between the fields. The most common forms of this category include Json files, XML files, …  

For digitization purposes this is the best form the data and information and documents can be gathered, as the computer can easily “read” and “understand” the data fields and knows what each data fields represents (“means”).

Further, this form allows for standard tests to be performed on the data as soon as they are received. This is both for (i) validating that the data entry in fact conforms to the format specified for that field, and to (ii) verify that the data field in fact complies with predetermined rules imposed on that data field.

For example, it is possible to immediately (i) validate if the “date” data field is in fact of the correct format and (ii) verify if “the date is in the past 90 days from todays date”. This will allow any violations of the validation and/or verification rules to be flagged immediately upon receiving the data and prevents the propagation of “incorrect” data down the processing pipeline.

  • Non-standard (non-templated)

Data sources which do not follow a predetermined template. Free flowing text or a collection of unstructured data are the most common form of this category, generally in the form of Word docs, csv files, ...

For digitization purposes, in this category, the computer can “read” the data fields, but does not necessarily “understand” the data and as a result does not necessarily know what each data fields represents (“means”).

These data sources require “learning” (inferring) what the data “means”. This can generally be accomplished through predetermined rules provided externally, or by learning techniques which can be employed to learn the values of interest (“meaning” of the data).

In this form once the values of interest have been identified from the input data then the Validation and Verification steps can be performed on the data, before data processing can start.

Non-machine readable

Data, information and documents provided in a format which a computer cannot process as text. Examples include images, video, audio files. Importantly hand-written text, scanned pdf files are primary sources of saving data in enterprise environments.

Digitization of each of these sources are important and active areas of interest in digital transformation efforts of enterprises and the technology start up community. While there has been significant progress in each if these areas, none are still at a highly reliable, commercially viable, production ready state. While there are manual and hybrid manual-automated approaches to digitization, a fully automated approach to digitization remains a technical challenge.

As a result data, information, document sources which are non-machine readable remain a major challenge to full digitization of existing enterprise processes.

Non-Machine Readable sources can be divided into two general format categories

  • Standard (templated [Ref1, Ref2])

Data source which follows a predetermined template. This template generally is defined through a template for the document which contains the data. This template will typically place various pieces of data and information in specific formats in specific locations on the document. Examples include tax forms, payslips, utility bills, ...

For digitization purposes OCR technology does very well at scanning and “interpreting” the data contained in these sources (documents) this is primarily due to the fact that prior to scanning a “map” of the document can be provided to the OCR process, i.e. the location and the type of data to expect at that location can be provided to the OCR process. That way the OCR process looks for a set of letters, digits, and characters in a the specified location on the map and “reads” the data.

A simple but very effective example is Apple Pay, which scans your physical credit card, extracts the card number, expiry date, … and stores them in a machine readable format. All you need to do is hold your camera above the card at a certain height so that the edges of the card fit into the view of the camera, the software “knows” where exactly to look for the card number on the image the camera sees and captures it. This is possible since the software has a “map” of the physical card.

Following that, as the computer can easily “read” and “understand” the data fields and knows what each data field represents (“means”).

Further, this allows for standard tests to be performed on the data as soon as it is read. This is both for (i) validating that the data entry in fact conforms to the format specified for that field, and to (ii) verify that the data field in fact complies with predetermined rules imposed on that data field.

On a separate note, effort should be made to digitize the source of the information. Given the data is already formatted as a template this will be a relatively straightforward process to provide the data in digitized format, as the fields and their format are already known.

  • Non-standard (non-templated [Ref3])

This is the most difficult format the data can be provided. To the extent OCR technology can extract the data in machine readable format, this data can be digitized.

OCR technology is being actively developed and enhanced to provide increasingly higher levels of reliability in reading this type of data, however this is still an evolving technology which is yet to reach its full potential.

It might be possible to teach the OCR system to look for certain fields, values, words, numbers, which might help transition the problem from attempting to completely extract the entire data from a document, to simply looking for certain specific data.

On a separate note, effort should be made to digitize the source of the information.

Machine Readable Non Machine Readable
Standardized Ideal for data processing Technologies such as OCR could be very effective at transforming data into machine readable format
Non- Standardized
  • Attempts should be made to standardize the source (provider) of the data
  • Rules based (knowledge graphs) or machine learning techniques can be used to provide data structure and standardization
  • In certain cases it might be possible to process the data directly processed through machine learning techniques
  • Attempts should be made to standardize the source (provider) of the data
  • Rules based (knowledge graphs) or machine learning techniques can be used to provide data structure and standardization
  • In certain cases it might be possible to process the data directly processed through machine learning techniques
  • Attempts should be made to standardize the source (provider) of the data
  • Attempts should be made to make the source (provider) of the data machine readable
  • Technologies such as OCR could be relatively effective at transforming data into machine readable format




-- Data sources should migrate towards fully machine readable and fully standardized (templated) formats. This is an essential step towards full digital transformation of enterprises ... --

Data Processing Digitization

Once the data is in machine readable and standard format, it can be processed in a number of different ways, via (i) predetermined set of rules (business logic, business rules), (ii) statistical analysis of the data, (iii) machine learning. Each of these areas encompass huge areas of enterprise data processing. A detailed view of the data processing approaches and techniques is beyond the scope of this presentation.

Digitization of data gathering results in considerable improvement and efficiency in processing the data, and opens the door for numerous new and exciting techniques and approaches to data processing, which are not possible without standardized machine readable data gathering.

-- Data processing is vastly enhanced and improved by fully machine readable and fully standardized (templated) data formats --

Implementation in Enterprise Environments

Enterprises need to focus on and invest heavily in standardized and machine readable source of data. This effort reaches far beyond a specific technology used for processing the data. This effort will make enterprises ready for digital transformation and processing efforts of today and tomorrow and beyond.

On the digitization front there have been considerable effort and improvement in recent years on several fronts, including on the OCR front, as well as transcription of voice.

On creating structured data, there are numerous enterprise transformations under way to digitize existing and new source of data across enterprises, and in the process structuring the data and information provided by the various sources.

Further, there are artificial intelligence approaches to providing “meaning” to unstructured data. That is, to “learn” the structure of the data, as opposed to explicitly model the structure. The explicit modelling of data in a structured way and the implicit learning of the data structure are two complementary approaches, each of which is suited to specific use cases. The explicit structuring of the data vs implicit learning of the data structure in part of a broader field of “Symbolic Reasoning” vs. “Machine Learning”.

-- To the extent possible sources of data and information in enterprise need to migrate towards machine readability and standardization.

Beyond that, focus on technologies such as OCR, voice transcription, deriving implicit structure through learning from the data itself, could lead to major leaps forward towards the goals of enterprise digital transformation --

Implications for Enterprise Blockchain Technology

Enterprises need to focus on and invest heavily in standardized and machine readable source of data. While this is important for the implementation of Blockchain technology, this effort  goes far beyond a specific technology used for processing the data.

Blockchain Tech is primarily a data processing technology, which in its various implementations receives information from different sources, validates and verifies data and information, processes the data and information, triggers the next steps and stores the data and metadata associated with the process.

The digitization (machine readability) of the sources of information and data and documents associated with a process or network sit outside Blockchains core focus. Blockchain is effectively the recipient of these digitized sources, which it subsequently processes.

Blockchain technology is rather flexible in terms of the exact digitization and formating of the data and information it receives, as long as the data is presented to the Blockchain in a templated machine readable format. This can range from advanced API connectivity, to excel sheets uploaded into a folder. As long as the data is templated and machine readable, Blockchain will do the rest.

-- Blockchain tech is primarily a data processing technology, and is effectively the recipient of digitized data sources, which it subsequently processes.

Blockchain tech is rather flexible in terms of the exact digitization formating of the data it receives, as long as it receives the data in a templated machine readable format --

Conclusion

For enterprise digitization transformation efforts to bear fruit and reach their full potential it is vital that data gathering digitization be considered as an integral part of planning any new digital application. Attention needs to be given to data being available in machine readable and standardized formats, for data processing to be reliable and efficient.

-- For enterprise digitization transformation to reach its full potential it is vital that

data gathering digitization be addressed and implemented --

-- For reliable and efficient data processing,

data sources need to provide data in machine readable and standardized formats --


[Ref1] For a more formal treatment of the subject please see: https://en.wikipedia.org/wiki/Semi-structured_data

[Ref2] For a more formal treatment of the subject please see: https://en.wikipedia.org/wiki/Data_model

[Ref3] For a more formal treatment of the subject please see: https://en.wikipedia.org/wiki/Unstructured_data

Hossein Kakavand