Why algorithms are no better than humans at predicting exam results, goals in football, musical taste or criminal reoffending

It is very likely that at least some of the people who suggested using an algorithm to predict A-level results thought that they were being scientific and rational. They imagined that their algorithm would be neutral, remove bias and do an overall better job than the teachers, who are too close to the students to remain clear-headed.

The irony is that it is exactly this thinking that is unscientific, irrational and biased. Let me explain why, starting with a metric from football called ‘expected goals’. Expected goals are calculated by feeding a lot of data (shot location, whether it was made with foot or head, etc.) about historical chances in to a statistical model. In doing so we create an algorithm that can rank the quality of goal chances in future matches.

This algorithm is useful: teams with more expected goals in their previous matches tend to score more real goals in their subsequent ones.

But…

There is also another reliable way of measuring chances in football: simply ask humans who watched the match what they think. Many sports performance companies use this method. A trained human operator, looks carefully at every shot and labels them. If they think it wasn’t much of a chance they label it as ‘not a big chance’, if they think it was a big chance then they write ‘‘big chance’. Simple as that. Well, not quite as simple. Some companies label their chances on a scale of 1 to 6. But the principle is the same.

So, which method do you think best predicts the future performance of teams?

Researcher Garry Gelade (who sadly passed away earlier this summer) looked at this question in 2017 and found that the expected goals model was unable to outperform the operators measuring big chance. It might initially sound impressive that we have an algorithm for measuring performance in football, but the method doesn’t outperform an educated football fan (operators are typically recruited from football enthusiasts) making a note each time a team generates a goal- scoring chance.

Picture credit: Mohamed Mahmoud Hassan

This result is not limited to football. In fact, in my book Outumbered I argue that it is a fundamental limit of algorthimic prediction: algorithms don’t outperform humans at these types of tasks.

One of the most powerful demonstrations of this priniciple was Julia Dressel’s work on the criminal sentencing algorithm COMPAS. In her study, the researchers paid Mechanical Turk workers, all of whom were based in the USA, $1 to evaluate 50 different defendant descriptions. After seeing each description, they were asked, ‘Do you think this person will commit another crime within two years?’, to which they answered either ‘yes’ or ‘no’. On average, the participants were correct at a level comparable to a commercial software used by judges, suggesting very little advantage to the recommendation algorithm used. Again, humans and algortihms are equally accurate.

Talking to practitioners in data science gives a similar picture. A few years ago, I spoke to Glenn McDonald, who developed Spotify’s music recommendation algorithm for an article for the Economist 1843 magazine. Before the interview, I was a bit nervous about revealing my own opinion of Spotify’s suggestions. I had used the ‘discover weekly’ service now and again to find new music, but was often frustrated. I tend to like melancholy songs, but when I listened to the tunes suggested by Spotify they didn’t have the same emotional effect as my own sad favourites. In fact, the suggested songs tended to be quite boring. Many Spotify users complain of the same problem: the songs it recommends are watered- down versions of their true favourites.

When I told Glenn that I often found myself flipping through song after song without fastening to any of the recommendations, I expected him to be slightly disappointed. But he was happy to admit his algorithm’s limitations. ‘We can’t expect to capture how you personally attach to a song,’ he told me.

Glenn told me that the process of making recommendations is far from a pure science, ‘half of my job is trying to work out which computer-generated responses make sense’. When Glenn chose his job title, he asked to be called ‘data alchemist’ instead of ‘data scientist’. He sees his job not as searching for abstract truths about musical styles, but as providing classifications that make sense to people.

Glenn’s attitude to life is admirable, but raises a big question about where we can and can’t use algorithms. While you might not mind an alchemist picking your music, would you be happy to allow them to decide your A-level results?

Algorithms have the advantage over humans that they can be scaled up to serve millions of people at the same time. This is why they are so powerful when used by Amazon, Facebook and Spotify. But in the case of A-levels there is an army of teachers ready to make the decision on students who they knew and whose work they have observed. The experts are there. So, if the government really wanted to be scientific in the way they assess A-levels, then they should have trusted those teachers from the start. Anything else is pure alchemy.

Professor of Applied Mathematics. Books: The Ten Equations (2020); Outnumbered (2018); Soccermatics (2016) and Collective Animal Behavior (2010).