Why FiveThirtyEight doesn’t beat prediction markets

During the last presidential primaries, the US-based magazine CAFE employed a pundit, Carl Diggler, with ’30 years’ experience of political journalism’, to make predictions for each of the US state primaries. Diggler used only his ‘gut feeling and experience’. He certainly knew his stuff, calling 20 out of 22 of the Super Tuesday contests correctly. As his predictions continued to pan out, he challenged Nate Silver to a head-to-head prediction battle. Nate didn’t respond, but Diggler persisted nonetheless. By the end of the primaries, Diggler had the same level of accuracy, with 89 per cent correct predictions, as FiveThirtyEight. Not only that, he had called the result of twice as many contests as Nate’s site. Carl Diggler was the prediction champion of the US primaries.

Carl Diggler’s predictions were real, but he isn’t. He is a fictional character. Two journalists, Felix Biederman and Virgil Texas, who used their own intuition to produce the predictions, wrote his column. Their original idea was to make fun of political pundits who resemble Diggler in their pompous certainty, but when the journalists started making successful predictions, they turned on Nate. After the election, Virgil wrote an opinion piece in the Washington Post criticising the misleading nature of FiveThirtyEight. He accused Silver of making predictions that were not ‘falsifiable’ because they couldn’t be tested, and criticised the way FiveThirtyEight appeared to hedge its bets using probabilities.

Virgil’s criticism of Nate’s methods on the basis of Diggler’s success is misplaced. It is simply untrue that Nate’s model is not falsifiable – it can be tested. By calculating a measure known as the Brier score, which rewards both accurate and brave predictions, my colleague Alex Szorkovszky and I have showed that FiveThiryEight 2016 state-by-state predictions do beat Virgils, albeit narrowly.

But FiveThirtyEight has another problem, a more serious one. Over the last three US elections, Nate and his team don’t reliably beat prediction markets, such as PredictIt and Intrade. Moreover, while both 538 and Intrade make reasonable predictions, they are not independent.

How the predicted probability of a Clinton victory changed in the months leading up to the 2016 US presidential election for FiveThirtyEight (solid line) and PredictIt (dotted line).

The figure above shows that the PredictIt and 538 predictions lay very close together during the U.S. Presidential race of 2016. This can be partly explained by PredictIt users exploiting FiveThirtyEight. Indeed, on the Superforecasters’ forum discussions, the single most frequent source for information was Nate Silver’s website. However, the whole point of a prediction market is that it brings together different pieces of information, weighing them in proportion to their quality. So while it would be considered as high quality, it is unlikely that FiveThirtyEight was the sole source of the PredictIt market. And there is certainly no evidence of PredictIt forecasts following the ups and downs of FiveThirtyEight with a delay.

FiveThirtyEight doesn’t explicitly use betting market data in its model. However, Silver, a former professional gambler, understands very well that prediction markets and bookmakers’ odds give a better reflection of the probability an event will happen than the polls themselves. He could see that the markets were not very certain about a Clinton victory. Other models, at the New York Times and the Huffington Post, which were based purely on polls, were predicting a 91 per cent and 99 per cent, respectively, win probability for the Democrats candidate. The FiveThirtyEight team applied an adjustment to the polls in order to reflect uncertainty about the outcome, bringing it nearer to the market odds.

While this adjustment turned out to be justified in terms of getting the election less wrong than his competitors, the fine-tuning raises an issue about the basis of his approach. Former FiveThirtyEight writer Mona Chalabi told me that Nate’s team would use phrases such as ‘we just have to be extra cautious’, to express a shared understanding within their newsroom that the model shouldn’t give too strong predictions for Clinton. They were aware that they would be judged after the election in the same black-and-white terms that humans always judge predictions: they would either be winners or losers.

Mona, who is now a data editor at the Guardian US, told me: ‘The ultimate flaw in FiveThirtyEight, and all election forecasting, is the belief that there is a method to correct for all the limitations in the polls. There isn’t.’ Academic research has shown that polls are typically less accurate than prediction markets. As a result, FiveThirtyEight has to find a way of improving its predictions. There is no rigorous statistical methodology for making these improvements; they depend much more on the skill of the individual modeller in understanding what factors are likely to be important in the election. It is data alchemy: combining the statistics from the polls, with an intuition for what is going on in the campaign.

Mona made this point very strongly when I talked to her: ‘The polls are the essential ingredient to prediction and the polls are wrong. So if you take away the polls, how exactly are they going to predict the election?’

FiveThirtyEight is an almost entirely white newsroom. They are, for the most part, American, Democrats and male. They have followed the same courses in statistics, and share the same world view. This background and training means they have very little insight into the mind of the voter. They don’t talk directly to people to get a sense of the feelings and emotions involved, an approach that would be considered subjective. Instead, Mona described to me a culture where colleagues judged each other on how advanced their mathematical techniques were. They believed there was a direct trade-off between the quality of statistical results and the ease with which they can be communicated.

If FiveThirtyEight offered a purely statistical model of the polls then the socio-economic background of their statisticians wouldn’t be relevant. But they don’t offer a purely statistical model. Such a model would have come out strongly for Clinton. Instead, they use a combination of their skills as forecasters and the underlying numbers. Work environments consisting of people with the same background and ideas are typically less likely to perform as well on difficult tasks, such as academic research and running a successful business. It is difficult for a bunch of people who all have the same background to identify all of the complex factors involved in predicting the future. Prediction markets do enable people with different backgrounds to contribute to make better forecasts.

We are outnumbered by statistical experts like Nate Silver because we believe that they have a better answer than we do. They don’t. They might be better than chimpanzees with darts and they might narrowly beat a (pretend) pundit like Carl Diggler, but they don’t beat our collective wisdom in prediction markets.

If you are interested in finding out more about the intricacies of creating models, then I thoroughly recommend the FiveThirtyEight pages. If you are just checking the headline number for the upcoming senate election, you are wasting your time. Use the bookmakers’ odds instead.

This is an extract from Outnumbered, Bloomsbury, 2018.