Data Management

Data Management In The Age Of AI

Pinterest LinkedIn Tumblr

How many times have organizations been told that data is their most important and strategic asset? It’s the black gold, the lifeblood, the new oil and, most importantly, the new source code that keeps the wheels of business turning in the digital era.

Data is the doorway to new business models, faster time-to-market and competitive differentiation. But the next evolution set to shift the dynamics of data and intelligence is accelerated DataOps. This is effectively and efficiently managing data for descriptive, predictive, prescriptive and now cognitive analytics, leveraging artificial intelligence (AI). And it is determining how organizations get actionable intelligence and operationalize data pipelines that reshape their successes in the digital economy.

In my role at WekaIO, I see this evolution across verticals, whether it’s an autonomous vehicle, genomics or drug discovery pipelines or fraud and cybersecurity analytics pipelines.

What is accelerated DataOps?

Gartner defines “DataOps” as “A collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data, data models and related artifacts.”

“Accelerated DataOps” is an operational definition of Gartner’s definition. There could be many who claim DataOps, but doing it effectively and efficiently is key. Accelerated DataOps needs to:

  1. Provide actionable intelligence for business intelligence and artificial intelligence pipelines alike, while catering to a multitude of diverse I/O requirements.
  2. Provide operational agility for continuous improvement/continuous development (CI/CD) pipelines, whether on-premise or in the cloud.
  3. Provide end-to-end governance and security for data in-flight and data at rest.

In essence, the enterprise needs to enable accelerated DataOps by solving challenges around storage, workflow and architecture. And while AI and machine learning are maturing, there are several challenges to overcome.

1. Multi-workload convergence

AI is increasingly converging the traditional high-performance computing and high-performance data analytics pipelines, resulting in multi-workload convergence. Data analytics, training and inference are now being run on the same accelerated computing platform. Increasingly, the accelerated compute layer isn’t limited to GPUs⁠—it now involves FPGAs, graph processors and specialized accelerators.

Use cases are moving from computer vision to multi-modal and conversational AI, and recommendation engines are using deep learning while low-latency inference is used for personalization on LinkedIn, translation on Google and video on YouTube.

Convolutional neural networks (CNN) are being used for annotation and labeling to transfer learning. And learning is moving to federated learning and active learning, while deep neural networks (DNN) are becoming even more complex with billions of hyper-parameters. The result of these transitions is different stages within the AI data pipelines, each with distinct storage and I/O requirements.

2. Need for trusted and explainable AI

Explainable AI (XAI) is becoming very important to production use cases. In addition to the models providing versioning, lineage and source control, the datasets they are trained on need to be retained and versioned for a long time.

3. Data anywhere with EdgeAI

The internet of things (IoT), 5G and low-powered devices have introduced AI at the edge, known as EdgeAI, which is expected to be even bigger than the cloud. The “edge” includes anything from the autonomous vehicle to the IP camera—from the magnificent to the mundane—yet, every point on the edge needs infrastructure capable of handling everything from core to cloud data pipelines.

Architectures have to cater to complex DNNs and performance at scale, and storage cannot be limited to traditional storage stacks, as this cannot deliver insights at the scale required for these new workloads.

Embracing accelerated DataOps

The approach needs to be built around breaking down silos and enabling high-performance data lakes for business intelligence (BI) and AI, alongside operational agility and governance, risk mitigation and compliance (GRC).

1. Breaking organizational silos

As organizations increasingly adopt AI/ML, there is a need for greater collaboration between the line of business personas and the IT infrastructure personas. New overlay roles like chief data officers or chief analytics officers are increasingly used to bridge the gap.

2. Next-generation data lakes

High-performance data lakes need to have the parallelism and exascale to meet the compute power and parallelism, with the ease of use of POSIX (portable operating system interface). Storage platforms need to have the ability to provide transparency, reproducibility of experiments, end-to-end security and, consequently, explainability. Increasingly important are in-built data protection, immutability and the ability to provide hybrid workflows for test and development use cases.

The challenge is that organizations need to relook at the architecting of their storage stacks, ensuring their purchasing decisions are capable of leveraging accelerated DataOps in the future while minimizing the problems within their architecture now.

3. Leveraging reference architecture and SDKs

Enterprises can embrace the potential of accelerated DataOps without the limitations of legacy and the challenges that these limitations introduce by investing into production-ready storage solutions with reference architectures and software development kits (SDKs).

When achieved, a business can build a production-ready solution where the entire AI data pipeline workflow can run on the same storage substrate. This can be done both on-premise and in the cloud and, with the right toolkits, with relative ease.

It will also allow enterprises to create reference architectures that are easy to consume, deploy and operationalize and that tick the boxes of faster time-to-market, competitive differentiation and data monetization.