Big Data

Privacy For Big Data: Monster Or Myth?

Pinterest LinkedIn Tumblr

It is the year 2020, and enterprises are investing heavily in privacy programs. The International Association of Privacy Professionals (IAPP) defines information privacy as “the right to have some control over how your personal information is collected and used.” Motivated by regulatory pressure (for example, the European Union’s General Data Protection Regulation and the California Consumer Privacy Act), chief privacy officers (CPOs) are leading initiatives related to data discovery, data protection, privacy enforcement and compliance reporting.

Working as the chief strategy officer for a company that conducts security assessments and penetration testing, I’ve found that one fundamental issue in making enterprises privacy-aware is that data is fluid — it continuously moves across the enterprise, and new data is introduced all the time. Unlike the days when enterprise data existed in neatly organized relational databases, we are now in a world where massive data platforms house large volumes of unstructured or loosely structured data that then feeds complex analytics and artificial intelligence (AI) systems.

How can an enterprise ensure the protection of personal and sensitive information in a dynamic environment?

Many organizations have a lot of sensitive data in one place, with many different access points. Mitigating controls include using the data as disclosed and for the purposes for which it was collected, as well as implementing access controls. One simple model that can be applied to many different enterprise data scenarios is to consider that the level of privacy risk at any given point in an enterprise is directly proportional to the concentration of sensitive data in a given place and the level of access available at that point. It is also inversely proportional to the level of data protection operating at that same point.

Plainly stated, the area of least risk is where there is little sensitive data, limited access and lots of protection. Conversely, there is a dramatically higher risk where there is a lot of sensitive data, a high level of access and little data protection. The latter describes the big data world for enterprises. 

Solutions (and millions of dollars in investment) that focus on data discovery, classification, masking, tokenization and access control can help to reduce an organization’s risk profile, but they are simply not enough. The reality that many enterprises face is that while sensitive data in core systems is often well protected, that same sensitive data ends up in big data platforms where it is indexed and aggregated to support an almost infinite number of use cases: customer support, data analytics, artificial intelligence, security analytics, sales and marketing, revenue optimization, research and development, etc. 

Big data indexes cannot be encrypted or masked because encryption and masking break the indexing process. Data that’s in use in these systems remains in clear text as it is sliced and diced to support key business goals. Enterprises defend this data by using best-in-class access control measures, but these can fall short.

Administrators have unfettered access to sensitive information in big data indexes. While they are some of the most skilled and trusted professionals in the technology industry, even the most experienced admin is prone to natural human error. Inadvertent misconfigurations have resulted in the loss of billions of sensitive data records over the past few years. Finally, a significant amount of enterprise analytics is now conducted in the cloud via software as a service (SaaS) platforms. Regardless of granular access controls put in place by an enterprise, cloud providers will necessarily have access to sensitive data flowing through their back-end systems. 

With critical business operations now being powered by big data — and, at the same time, with privacy being top of mind for both end customers and enterprises — the time has come for innovation. How can such platforms be made privacy-aware while retaining the critical functions that they perform? Or are we willing to accept that the very platforms that are powering the AI and behavior analytics that monitor our other critical systems should themselves be vulnerable to privacy violations and data breaches?

The industry needs a data solution that is secure and private by default, even in memory, even in search results, even when aggregated, even when shared between different applications.