The AI revolution is in full swing as firms in virtually all industries are using AI techniques to boost their bottom lines. An MIT Sloan Management Review survey-based report found that enterprises increased AI spending by 62 percent in 2019, and more organizations are expected to invest in AI in the coming years.
However, techniques such as predictive analytics can be seen as “rocket science” by non-technical teams, leaving money on the table. In spite of massive investment, many AI initiatives fail. In fact, MIT Sloan Management Review also reports that “seven out of 10 companies surveyed report minimal or no impact from AI so far.” AI is difficult to understand and even harder to implement.
My goal in this article is to demystify predictive analytics. In Part 2, I’ll explore how nontechnical teams can implement predictive analytics.
An Intro to Predictive Analytics
Predictive analytics is, as the term implies, a way to predict future outcomes based on historical data. Data that is relevant to the problem is used as the input or the source of new knowledge.
For instance, an HR manager may be interested in applying predictive analytics to employee attrition because talent is the most valuable asset of any company. If you were to guess whether an employee would leave your company, you might look at things like an employee’s job satisfaction, performance reports, how many days they take off, or even how far they live from the office.
A predictive analytics model would do the same thing using a mathematical function, as opposed to a gut feeling. There are many types of algorithms used in predictive modeling, but a common one for tasks such as this would be a decision tree.
You’ve used decision trees in your own life, even if you haven’t realized it. A decision tree is simply a set of sequential, hierarchical decisions that lead to some final result. For example, you might be deciding whether to go to the park or to the cinema. It might depend on whether or not it’s sunny, whether your friends are available to meet, whether you want to meet with your friends, what movies are showing, and so on.
Predictive analytics uses historical data to make predictions. To build a decision-tree model, a data scientist feeds in historical “training data,” which is simply the data relevant to the problem at hand (such as employees’ job satisfaction). This data set contains “labels” — the KPI(s) you’re interested in (in the case of employee attrition, whether each employee quit).
The decision tree is created as the training data is divided by various factors (for example, possibly splitting employees between high and low job satisfaction and then dividing each group by length of commute). The tree shape is created from the sequence of these factors and their relationship to the label (e.g., what percentage of low satisfaction/long commute employees quit compared to other possible combinations of factors).
The same principles apply to other predictive analytics use cases, such as analyzing churn to increase customer lifetime value (CLV). For example, a telecom company interested in reducing churn might use a decision tree that relies on data such as the customer’s tenure, whether they have multiple telephone lines, their age, and their type of contract.
Using a Predictive Model
After a data scientist creates a predictive model, typically using a programming language such as Python or R, they then deploy it so a user can make predictions.
This can be done with a mix of complex tools such as Kubernetes and Google Cloud Platform, which would each require its own series of articles to explain. Suffice it to say, once you have a deployed model, you can enter data and receive a prediction in return.
Suppose you have a predictive employee attrition model. A manager could enter a current employee’s job satisfaction, performance report data, how many days they take off, and so on, and the model will calculate the probability that the employee will quit.
The manager could also make an aggregate prediction and use data from all employees to estimate recruitment costs for the next year.
Again, the same principles apply for any use case. Consider another example: a telecom company that wants to predict churn would enter data from a current customer, such as their tenure, age, and contract type. The telecom could also make aggregate predictions using data from all customers to estimate overall churn and profit.
Where Predictive Analytics Falls Short
Although it’s clear that the ability to predict the future is useful for any industry, there are times when predictive analytics falls short.
Because predictive analytics relies on past data, we run into trouble when the data is inaccurate, biased, or of generally low quality.
For instance, if a start-up wants to predict employee attrition, but hasn’t conducted many performance reports or surveys, then there isn’t much past data to base a prediction on, and it will be difficult to build an accurate model.
Further, an organization may have biased data, which would lead to a biased predictive model. One infamous example is a model built by Amazon that scored job candidates to accelerate hiring. Because the tech industry, including Amazon, has historically been male-dominated, the training data taught the algorithm that male candidates were preferable.
Just as it would be difficult for you to personally predict the weather if you weren’t able to look at the sky or wanted to predict the weather in a location you don’t know anything about, predictive models have a hard time making accurate predictions if they don’t have complete, relevant data.
Learning the tools, such as Python, R, Kubernetes, and GCP, can take years, which is why many companies hire specialized data scientists to handle predictive modeling.