Introducing Wordalisations: automated explanations of data
With the release of TwelveGPT Open Source, I explain why Wordalisations are going to be an important use of Large Language models in the future.
Nearly two years after the release of ChatGPT, there are two major challenges facing commercial applications of large language models.
1, Language models still say a lot of things which are not true.
2, Building viable products based on this technology has proven difficult.
Let’s imagine for a minute that we don’t believe Sam Altman that GPT-5 will be a massive step up from GPT-4 or that the abilities of large language models will continue to scale up with computing power as the team who created Claude believe. Let’s consider the possibility that the solution to 1 and 2 lies, not in computing power and giving these companies lots of money, but in our own human creative ability to understand the world…
It is this belief which underlies how I have approached using LLM’s in football (soccer) during the last year. Instead of intensely training a model on more and more data, we (at Twelve football) have built our own mathematical models of the game of football (using our experience working with clubs), then used existing language models to explain what these models tell us. This is a very different approach to solving challenges 1 and 2, not only in football but in a whole range of applications.
In this article and the accompanying Github repository, I give an overview of how to create what I have started to call wordalisations: factual representations of data in words. I derived the term wordalisation from visualisation: just as data science benefited from being able to visualise patterns in data more clearly, by wordalising data we bring clearer understanding of what the numbers tell us.
To get us started with this idea, let’s have a look at some examples. Here is a summary of Arsenal winger, Bukoyo Saka, from one of Twelve’s automated match reports (see the full report here). This is from the professional version of TwelveGPT.
All the text in this wordalisation is generated automatically. The headline identifies his weakness last season in finishing (converting chances into goals), while the text below explains why he can be considered versatile and explosive. The visualisation shows (in two different ways) how he compares to other players in certain measurements. The wordalisation explains how we should interpret these measurements.
Wordalisations are built to only contain statements that are factually correct. To see how we do this, consider how we measure a player’s box threat (the ability to threaten the opponents penalty area). We start by looking at underlying data of how often a player enters the box with the ball, how often he or she receives the ball in the box and other metrics (these are detailed in the full report). We then rank all players (using a weighted average) on these metrics to build the box threat quality. From there, we perform three steps. The first is to describe the overall quality and the metrics in words (such as poor, average, good or outstanding), the second is to create a small written training set which explains what the metrics mean in footballing terms and the third is to provide examples of the types of reports we would like to see. The output of all of these steps are then fed into an existing large language model in order to get the final wordalisation.
There are lots of details I am skipping over here (and to really get into this you will have to start working with our open source release) but the main point I want to emphasise (with respect to challenges 1 and 2 above) is that none of this process involves building bigger language models. It involves using our own modelling skills and knowledge of a particular subject area (in this case football) to utilise already existing large language models (here we use ChatGPT, but Claude, Llama-2 or Gemini would work just as well). Human creativity is the key!
Whatever can be visualised can also be wordalised. We can also describe patterns in space and time, as in these two examples from one of our match reports show.
These texts describe different aspects of Liverpool’s win over Manchester United in a recent Premier League match. They capture important nuances in the game: that Liverpool won because of their defensive work and that they were most dangerous in the middle and on the right side of the pitch. Even if the reader does not fully understand concepts such as xG (used in the figure), if they are “football litterate”, they will understand the conclusions drawn for the data. The reader might not fully agree, many of the pundits focussed on two mistakes by Manchester United player Casemiro, but they will better understand what the data “tells us” about the game. A wordalisation is a subjective opinion (in the sense that we decided what was important to put into the model) based on objective data (Liverpool did make the passes into the final third shown in the figure above).
The possibilities of wordalisations are endless. What I have shown here is just a small fraction of the tools we have built during the past year at Twelve to explain football data in words. And there are many more application areas. My research group at Uppsala University are looking into wordalisations of everything from personality tests to socioeconomic data. When the commercial applications come it will be interesting to see how they impact the workplace. Management consultants, at places like McKinsey or Price Waterhouse, currently command high fees for “human-made” wordalisations of financial or business data. Psychometrics companies charge clients to “interpret” personality tests in words. My guess is that much of what these specialists do can be automated using the wordalisation approach.
In order to help others get into wordalisations, we are releasing TwelveGPT Open Source. The screenshot below shows how it assesses Peter Crouch in the 2017/18 season using a model of strikers we built as part of the Soccermatics course.
It certainly captures Crouch’s unique skillset in words!
Now it is up to you. If you want to work with these methods, the starting point is the Read Me file, which gives an overview of TwelveGPT Open Source. But if you want to go deeper more quickly, I would then recommend that you subscribe to Twelve Community, where you will get access to myMasterclass videos explaining, not only how to build scouting reports like this, but also match reports and transfer models.
During November, Twelve Community will go on to have a series of hands-on workshops, where we work together on these methods. Would be great if you can join us.
I really look forward to seeing others sharing their football and other wordalisations online. Have fun!