The Dark Side of Machine Learning OCRs in Data Protection

Explore in:

The Dark Side of Machine Learning OCRs in Data Protection

Does GDPR Or HIPAA Ring A Bell?

They are just two of the several Data Protection regulations that enforce specific compliance frameworks to organizations on the way they manage and protect their data.

These regulations have a significant impact on software systems since most data is currently handled in digital format.

In particular, since most sensitive data is exchange in the form of documents, these regulations strongly impact document management processes and OCR implementations.

From a simplistic perspective, both ECM (enterprise content managers) and OCR (optical character recognition) are essential tools to help organizations implement compliant processes.

These technologies help organizations understand the meaning of each document exchange, while also providing the means to store data securely and manage access to a digital archive.

The transparency and reliability provided by such systems are essential to reduce compliance risks and ensure that proper audit tools are available.

But is everything positive? No! This article discusses a significant challenge behind the use of modern OCR technologies for handling and processing data and how it may impact your Data Protection compliance.

Modern OCR Technologies or OCR 2.0

In recent years, there has been a quite significant change over the OCR landscape. More and more vendors are introducing Machine Learning and Deep Learning as core technologies in their products.

Conceptually, Machine Learning techniques will be able to change the value proposition behind standard OCR technologies significantly.

This is not only because they might improve data capture accuracy but also because they might reduce the need for complex and manual configurations that, nowadays, have a significant impact on the entry costs of these technologies.

Let’s take a step back and understand the basis of Deep Learning techniques.

Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example.

Deep learning models are dependent on training, most of the time based on historical data (often called a training dataset).

In practice, companies may use their documents history as input for a Deep Learning model to configure their OCR data capture engine. By doing so, they may avoid the implementation of static, hard-coded rules that many times change from layout to layout (templates).

Moreover, Deep Learning models may take into consideration a significant amount of extraction variables, many of which are inferred by the model itself. These models process enormous batches of data under a few minutes and improve extraction quality over time without the need for human intervention.

If fed with the correct amount of training, deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance.

Cloud Computing and Machine Learning are at the core of the next generation OCR technologies.

So What Is The Impact Of Machine Learning-Based OCR On Data Protection Regulation?

Most modern OCR technologies have security and privacy as a top priority. Vendors provide a wide range of security assurance policies over data protection regulations and compliance frameworks.

But if this is true for simple things such as data geolocation or document removing, researchers show limited success in getting rid of data present within Machine Learning models.

When trained with historical data, OCR and Data Capture models will infer by themselves which information is relevant and which is not, and then build an internal intricate knowledge base that cannot be easily decoded by humans.

That’s no good in an age where individuals can request their personal data to be removed from company databases under privacy measures like Europe’s GDPR policies.

So, how do you remove a person’s sensitive information from a machine learning model that has already been trained?

The reality is that it’s not possible to simply pull out discretionary parts of the knowledge base.

“Deletion is difficult because most machine learning models are complex black boxes, so it is not clear how a data point or a set of data points is really being used," as stated by James Zou, an assistant professor of biomedical data science at Stanford University.

One approach would be to retrain the model, removing the data that need to be removed. But such a solution is often not feasible, both because it requires to persist all raw information and also because it requires significant processing power, which costs time and money.

The reality is that there are no close answers to this problem. Technology is advancing at a faster pace than regulation, raising issues that are not trivial to understand or address.

Data deletion in such models is not entirely impossible, but there are still no standard tools to support it.

At DocDigitizer, we view these technology trends as great opportunities. Thus, we have a continuous focus on assessing their potential on the development of next-generation data capture engines.

Our proprietary technology stack leverages several of these technologies, as we are entirely confident in their potential to change the way the world exchanges and digests information.

At the same time, we make security and privacy a priority, following industry best practices to protect our customer’s data.

Balancing technology advances and compliance is at the core of our product strategy.

As a result, we have a continuous prototype pipeline, where we evaluate the potential and impact of different data capture approaches. In consequence, we collected a significant amount of lessons learned about the good, the bad, and the ugly of using Machine Learning within a Data Capture service.

One significant insight we take from this experience was the impact of model opacity not only on data deletion but also on system audits. Black box models impose severe restrictions on data management and output explicability.

Gradually we have been moving away from deploying such models within our main data flow, thus focusing on more transparent and open models.

This approach ensures that we have an end-to-end track of data and decisions made over data and that we may perform several steps of anonymization while, at the same time, not losing the ability to support continuous learning.

Tracking this data flow also allows us to easily segment training data and remove or recover critical data by request while not impacting the overall training.

We don’t see our approach as a commercial advantage. Instead, we see it as a sincere contribution to the development of a more sustainable, transparent, and secure machine learning system compliant with present and future regulation.

If you have any insight over similar or alternative strategies to deal with this challenge, please drop us a line or comment below.

Start now your digital transformation!

Get Started

Book a Demo

Watch a Demo

Name	Provider	Finality	Validity	Type
wordpress_{hash}	Wordpress	WordPress uses the login wordpress_{hash} cookie to store authentication details. Its use is limited to the Administration Screen area, /wp-admin/	session	Core
wordpress_logged_in_{hash}	Wordpress	Remember User session. WordPress sets the after login wordpress_logged_in_{hash} cookie, which indicates when you’re logged in, and who you are, for most interface use.	session	Core
wp-settings-{user_id}	Wordpress	Customization cookie. Used to persist a user’s wp-admin configuration. The ID is the user’s ID. This is used to customize the view of admin interface, and possibly also the main site interface.	1 year	Core
cookielawinfo-checkbox-functional	Cookie/GDPR	This cookie stores if a visitor has accepted "functional" cookies.	choose	Legal
cookielawinfo-checkbox-performance	Cookie/GDPR	This cookie stores if a visitor has accepted "performance" cookies.	choose	Legal
viewed_cookie_policy	Cookie/GDPR	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not the user has consented to the use of cookies. It does not store any personal data.	choose	Legal

Name	Provider	Finality	Validity	Type
wp-wpml_current_language	WPML	Stores the current language. This cookie is enabled by default on sites that use the Language filtering for AJAX operations feature.	session	Multilanguage
wp-wpml_current_admin_language_{hash}	WPML	Stores the current WordPress administration area language.	session	Multilanguage
icl_visitor_lang_js	WPML	Stores the redirected language. This cookie is enabled for all site visitors if you use the Browser language redirect feature.	session	Multilanguage

Name	Provider	Finality	Validity	Type
_gcl_au	Google	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.	3 months	Analytics
_ga	Google	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomlygenerated number to recognize unique visitors.	2 years	Analytics
_gid	Google	installedby Google Analytics, _gid cookie stores information on how visitors usea website, while also creating an analytics report of the website'sperformance. Some of the data that are collected include the number ofvisitors, their source, and the pages they visit anonymously.	1 day	Analytics
_gat_UA-108095224-1	Google	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.	1 minute	Analytics
_hjTLDTest	Hotjar	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.	session	Analytics
_hjFirstSeen	Hotjar	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.	30 minutes	Analytics
_hjAbsoluteSessionInProgress	Hotjar	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.	30 minutes	Analytics

Name	Provider	Finality	Validity	Type
_fbp	Facebook	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.	3 months	Advertisement
test_cookie	.doubleclick.net	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.	15 minutes	Advertisement
m	m.stripe.com	Accept payments and move money globally with Stripe’s powerful APIs and software solutions designed to help you capture more revenue.	2 years	Payment

PowerCapture

Document classifier

WorldObjects

By Industry

By Use Case

Services

Success Stories

Partner Program

Find a Partner

On-Demand Content

Events

Report

Videos

Documentation

The Dark Side of Machine Learning OCRs in Data Protection