Data volumes continue to grow at accelerating rates, and with that, finding what we’re looking for is getting even harder. Artificial intelligence (AI) and machine learning (ML) are promoted as solutions, but they lie at the heart of the problem.
We see the onrushing tsunami of data. In November 2018, IDC predicted the “Global Datapshere” would grow from 33 zettabytes (ZB) in 2018 to 175 ZB by 2025.
Like a deer caught in the headlights, we instinctively reach for more storage. We’ve taken our physical world pack-rat mentality and digitized it. Not willing to part with it, we box it up and put it in storage. “I’ll sort that out later; it’ll be worth something in the future.” But out of sight is out of mind, and slowly, invisibly, the rot sets in.
We know we have a problem. Our solution has been search. Google indexed the internet because it grew too big for Yahoo to catalog. Microsoft has long promoted search for finding old emails over the limitation of the folder structure.
But how is search working out for us? How often do we find ourselves refining our entry in the search box? How often do we reach out to a colleague to see if they remember that email from two years ago?
We’ve digitized the preamble to the U.S. Constitution, and the growth figures embed the inherent assumption that all data is created equal; all data is imbued with the same potential for future value. Those are fallacies.
Situations like the current pandemic or the 2008 recession just add to the volumes of zombie data. The lack of a data management approach to storing data in the first place exacerbates the issues. The current imperative is speed. “Just have IT back up that workstation or archive that mailbox, and we’ll sort through it later.” But we never do.
Maybe it sits there until your archive policy deletes it, or perhaps you don’t have an archive policy and you pay ongoing storage costs. Perhaps you’ve bought into using the cloud for backup but, not knowing what has value and what does not, you default to keeping everything.
We see this every day in our digitizing project for our clients. Boxes that have been in a warehouse for years, labeled as holding relevant data, just keep the clearings of some long since separated employees, the markers of careers swept into a box and placed in a warehouse. The objective of clearing the desk is achieved, and the work of isolating the diamonds from the soil is left for some future that never arrives.
Data ages at different rates. Often, technology marches on faster than the data fed in as input or generated as output. Technical debt has a habit of orphaning its child data.
I recall one project from my past where we successfully archived the data and developed a shiny new front end to help the users search it. We missed one minor element: We didn’t archive the code of the software that processed the inputs and generated the outputs. When the time came to show the auditors that the results came from the inputs, we couldn’t.
Sometimes new technology and new algorithms can bring new results from old data. Sometimes that adds value, and sometimes it doesn’t. The question is, how do we know which datasets will have future value and which will not?
We’ve all seen how data-hungry AI is. The more data we can train with, the better the AI becomes. This fosters the temptation, once again, to keep everything, just in case. We need to ask ourselves how we can use the derivative insights.
I once discovered an analytics project underway to determine if widget A was better than widget B. A good deal of time and effort was spent on a program that was history masquerading as future. It shouldn’t have taken me to point out that neither widget A nor widget B was still being produced. Their manufacturers had used their own analytics to advance to widget A+ and widget B+ with enough differences to invalidate the comparison of their predecessors. Add in the cost of replacing all our B’s with A’s, retraining the workforce and the life of the widget, and we’d never make a return on the investment in the switch.
The solution to these issues lies in the discipline of data management.
Effective data management will index your data in context. Instead of searching for data, you’ll be able to stop searching and start finding. By identifying data as you create it, you can create effective policies around its retention. You can identify the inputs and trace the lineage of derived information to the origin data. You can better imply the future value of any given set and make decisions on the cost-benefit of storage over deletion.
Effective data management allows you to store data with intention and escape the digital pack-rat habit. As the volumes of data in storage continue their rapid ascent, can you afford to ignore data management?