I’ve been reading
Nate
Silver’s The Signal and the Noise. It’s not the sort of book I normally would
read, but since Nate kept me from a jumping off a tall building during the last
election I felt I owed him the $27.95.
Given Nate’s record predicting election outcomes you might think this is
a book that reveals the hidden secrets of the black art of predicting
things. But it’s not. It’s about how hard it is to make accurate
predictions even when we have mountains of data from which to do it. And it’s causing me to look differently at
the issue of Big Data and predictive analytics.
Nate spends a lot
of pages on some of those things for which we have lots of data but still
aren’t good at predicting – the weather, earthquakes, economic growth,
etc. Consider economic growth. We all have a sense of just how much economic
data there is and how long the time series.
(Nate estimates around 4 million variables.) But forecasts of growth are all over the
map and even “consensus forecasts” routinely are just plain wrong.
Nate argues that predictions
fail because we fall victim to two common errors. The first is to overfit the prediction
model into something that looks very sophisticated and plausible but either
ignores important variables or simply fails to understand the underlying
structure of the data. Machine learning
is especially susceptible to overfitting. The second is the classic error of
interpreting correlation as causation. A
good example is the Super
Bowl indicator, which says that the direction of the stock market can be
predicted based on who wins the Super Bowl.
Ultimately, we
need to be able to make good decisions about which data are important. And we need to be able to look at what a
model is saying, why it’s saying it, and judge whether it makes sense. Finally, we need to understand the
uncertainty in the prediction and communicate it. That sounds a lot like MR, except for that
last part about uncertainty.
Right now the
possibility of a future world of petabytes, MPP architectures, neural networks
and naïve Bayes is scaring the pants off a lot of people in the MR industry. It
may well be very bad news for MR companies but maybe not so bad for the MR
profession. There always will be demand
for people who understand data, consumers and the competitive challenges that
client companies face in the marketplace.
Or, as Nate
writes, “Data-driven predictions can succeed—and they can fail. It is when we
deny our role in the process that the odds of failure rise.”