The trajectory of data science has been stunning. As a profession it went from nothing to “the sexiest job of the 21st century” within a decade. Computers have been used to analyze data since their inception. So why did it take more than half a century for such a field to emerge? In this post, I want to explore the tensions that had built up before that fateful day when DJ Patil and Jeff Hammerbacher slapped down the name that took off.
Data science broke into public consciousness after several companies, like LinkedIn and Facebook, touted it as their secret sauce. Following in their footsteps, other companies tried to build up their data competence – with varying levels of success. These companies failed because data science sits at the top of a ladder of analytic capability that each company must climb from the bottom.
Let’s start with defining this ladder of analytic capability. I use the word “ladder” to emphasize the ordering of the levels. We borrow the rung’s names from Davenport’s Competing on Analytics (source):
- Descriptive: What is happening, or already happened? This capability enables alerting, exploring and reporting.
- Predictive: Given what we are seeing today, what is going to happen in the future?
- Prescriptive: If we take a particular course of action, what would happen?
- Automated: Without human intervention, what course of action should we take?
An organization that attempts to forecast sales (predictive) when their data hasn’t yet been properly cleaned, stored, and made accessible (descriptive) will find only frustration.
Before data science, there was business intelligence.
In the 80’s and 90’s, there was also a field in tech focused on extracting meaning from data: business intelligence. Here is one definition of business intelligence (BI) from an ACM review article:
“Business intelligence software is a collection of decision support technologies for the enterprise aimed at enabling knowledge workers such as executives, managers, and analysts to make better and faster decisions.” (source)
And here is a summary of what BI technology should enable:
" … BI technology is used in manufacturing for order shipment and customer support, in retail for user profiling to target grocery coupons during checkout, in financial services for claims analysis and fraud detection, in transportation for fleet management, in telecommunications for identifying reasons for customer churn, in utilities for power usage analysis, and health care for outcomes analysis." (source)
The field was going to revolutionize decision making by replacing gut instinct with facts. With the organization’s data collected, processed and stored in a data warehouse, decisions were supposed to be as simple as finding the right patterns to exploit.
Baked into BI was an implicit assumption – that more data is always better. So a lot of the work in BI was tapping into various live transactional systems, like customer databases and inventory management systems, and piping it to a central location. Once centralized, this data was used to answer queries like, “What is the average age of a hand lotion customer?”
The high-water mark of a BI organization was the deployment of an online analytical processing (OLAP) cube. The idea was to store aggregates, like sums, by their context, and be able to perform complex exploratory operations. These systems enabled analysts to interactively explore data with operators like rolling up, drilling down and slice-and-dice.
At first, these technologies must have paid large dividends. In one anecdote, Eugene Wei talks about how his entire job at Amazon was compiling a paper copy of the company-wide analytics dashboard (source). By the late 1990’s, we had all the technology for capturing, storing, and exploring data. In addition, computing and storage costs were still dropping exponentially. If BI, whose aim sounds an awful lot like data science, had delivered on their promise then shouldn’t the field be everywhere now? Why is BI not the sexiest job of the 21st century?
To understand this, we need to take a detour into the world of psychology.
A crisis of replication
You may have heard the term “replication crisis” being thrown around in the last decade. That’s because some fields in science, like psychology and sociology, are questioning some of their most foundational findings. This crisis was kicked off by Bem’s publication in 2010 titled, “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”.
Bem’s paper concluded that the ability to predict the future exists. Now, nothing is unusual about outrageous claims. There are many eccentric individuals out there. What made Bem’s paper remarkable is that the methodology was airtight. He used standard protocols, widely accepted statistical methods, and exceptional sample sizes to prove a claim that is clearly bunk.
If Bem could use the most widely accepted methodologies to demonstrate a clearly nonexistent phenomena, then those same methods can be used to support any position whatsoever! So, you can see why the field was having a bit of an existential crisis. (source)
At the root of this crisis is a lack of replicability. That is, if another trained scientist carried out the same experiment as a check, they would fail to find the published phenomenon. The trio Simmons, Nelson, and Simonsohn, thought to simulate what happens if someone makes similar methodological choices as psychologists. Since their work was done on a computer, the team could ensure that there was really nothing to find. Unfortunately for psychologists, the trio found that accepted methods could be manipulated to say there is something there, even when there wasn’t, up to 60% of the time! (source)
It is this sifting through data for statistical significance that leads us astray. Humans are predisposed to narration. We want to find signals even when there is none. In the case of scientists, the sifting occurs over many experiments and parameter tweaks. They have to work harder because experiments, in a way, protects against this tendency. Even then, with enough determined effort, you can prove that ESP exists.
This problem is exacerbated in purely observational data, exactly the kind that BI deals with. But BI never had a replication crisis because business is much more secretive than science. Failures are swept under a rug, and successes are trumpeted from mountains. However, after years of decreasing returns on investment, most companies do eventually realize that they had hit a wall. That’s why BI plateaued.
Taking science as a role model
It should be obvious by now that technology is not the first field to wrestle with using data to draw conclusions. The scientific community has been dealing with such issues for a very long time – for hundreds, if not thousands, of years. In a recent take, Jeff Leek and Roger Peng describes exactly this problem of mistaking the level analysis and drawing unmerited conclusions. (source)
Their paper has a nice table that names jumps between specific levels of the analytics ladder. They use slightly different terminology, but I will map them to Davenport’s ladder for our discussion.
- (1 -> 2) Descriptive to Predictive = “Data Dredging”, “Overfitting”, “N of 1 analysis”
- (2 -> 3) Predictive to Causal = “Correlation is not causation”
If you have even a cursory background in science, you’ll know that these are not nice things to say. The key, according to Leek and Peng, is correctly matching the question we are trying to answer, with the level of analysis. We should not expect a descriptive analysis (rung 1) to meaningfully estimate the effect of an intervention (rung 3).
Imagine that an analyst carries out an exploratory analysis. She comes back and reports that more ice cream is eaten on summer days. An executive reading the report goes on to mandate that more ice cream is consumed in hopes of better weather. Obviously, he is going to be sorely disappointed. This example may seem ridiculous, but I see decisions like this made all of the time.
The rise of data science
Picture the world in 2008. The great financial crisis left our economy in tatters. Tech was still heavily out of favour – a lingering result of the dot-com crash. Industry adoption of traditional BI tools had plateaued because of diminishing returns. This was the world that Data Science was born into.
There are two things that Data Science can do that BI cannot, resulting in its meteoric rise. The first success of data science is experimentation. This is the ability to run AB tests at scale. When a test is running, different users on a site can, simultaneously, have very different experiences. The second success is automated decisions, also called data products. Today, data products are so ubiquitous that it is hard to remember a world before them. Experimentation is not as widely known by the average consumer, but they were pivotal in achieving automated decision making.
I will first describe data products. You have probably interacted with several data products already just today. These features include LinkedIn’s People You May Know, Facebook’s feed, Spotify’s daily mixes, Youtube’s video up next, and Amazon’s people who bought X also bought Y. On the modern web, we actually expect the product to tell us what to do next. This is a complete 180 from the way things were.
In the old days, you had to go searching for what to do next. Let’s call this the pull model. Google is a widely recognized pioneer of the pull model. With Google, you describe what you want, and it tries to return with some related options. You get to pull information. Data products invert this. Instead of describing what you want, the product infers from your previous behaviour what you would like next. You get pushed into it. Instead of a turning on the facet when you feel thirsty, there is now a hose that constantly sprays information at you.
As you can probably guess, data products are tough to build. When you design a single experience, it’s easy to know whether the product is great. You go there and experience it yourself. However, if you’re building a personalized product like Facebook, then every single person’s feed is different. It’s determined by their social graph. So how do you go about optimizing a million different experiences? Well – you do it through experimentation. This is why our ladder puts prescriptive before automated analytic.
What makes experimentation so powerful is that it resolves the cause of our replication crisis from earlier. We are no longer looking at observational data. Instead, we are looking at what would actually happen if we intervene. We have moved from predictive to prescriptive.
So in summary, data science rose on the back of a few very prominent success stories. These companies succeeded because they managed to break through the barrier of predictive analytics and proceeded all the way to automated analytics. This was such a competitive advantage that they rose to dominate the industry, and actually now the economy.
Closing thoughts
I think one of the biggest mistakes companies make is trying to skip rungs on the ladder. Companies see the successes of Facebook or Spotify and want to replicate it for themselves. A cursory look reveals that data products are a key driver of that success so they go out and hire a bunch of PhDs to try and “figure it out.”
Recently, we’ve seen many articles about the under performance of data science organizations. These failures are not because those PhDs were duds, or that data science isn’t real. The failures are likely, at least partially, due to trying to skip steps in the ladder of analytic capabilities. It takes more than a single person, or even small a motivated team, to shift a company’s culture. This requires buy-in all the way through an organization.
So there you have it, data science shouldn’t be seen as a separate distinct capability, but a rung at the top of a ladder. This means that a company can’t just hire talent at the top rung and expect to reap the rewards. To get a return, companies must build an entire analytics value chain.
If you liked this post, consider following me on Twitter for more musings.