Regretting Regression: Arsenal, Spurs and the Limits of Regression Analysis in Football
As the season approaches the home stretch, the table looks pretty much as expected, and as expected goals predicted it would. But ignoring the journey to highlight the destination misses the point.
I write A LOT about analytics. Obviously. There’s a word, however, that I avoid like the plague. I don’t use it in my writing. If I can edit it out of other writer’s work without too much pain I put the proverbial line through it. On podcasts (especially of the non-analytics bent) I do all sorts of linguistic gymnastics to talk around it. That word is regression. And that goes double for its nerdier, more technically accurate cousin, regression to the mean. Unless I’m absolutely cornered and there’s no way to avoid it, I simply won’t type, or tweet, or say those words.
There’s no problem with the concept of regression. Two separate flavors of it generally operate in the football world. First, team performances generally regress to their resource levels. That is, spending eventually equates to points. It’s not a perfect relationship (no relationship ever is) and spending both fluctuates and can have a time lag involved. But, as a rule, it’s a pretty good one. The more a team spends the more points they’re likely to have.
Second, teams’ goals scored and goals allowed generally regress to their expected goals levels. That’s what makes expected goals the gold standard of predictive stats. It works. When it comes to predicting the future, expected goals is a better indicator of where things are going than anything else, including shots or goals themselves. That’s what it was designed to do and that’s what it does quite well.
So, why don’t I use it? Well the first issue is a technical one. Regression, which is a true thing that happens, all too easily slips into gambler’s fallacy which isn’t. A team that’s getting lucky isn’t due to get unlucky to balance things out. They’re just due to stop getting lucky. That’s a small but crucial difference in how to best analyze these things. Without being very careful with argumentation it’s simply very easy to slip into the language of gambler’s fallacy on the world evening things out, when what’s closer to the truth is that the world doesn’t really care.
But there’s another more fundamental reason I avoid regression, and that's the way it reduces discussions to a single variable. That’s the power of expected goals, it reduces it to a single number, which is what makes it useful. It also hides everything that’s going on under the hood. And that’s the interesting stuff.
The reason expected goals, and by extension regression in general, works is that there are a million factors that go into the equation. The reason things become unsustainable is precisely because it takes any number of things lining up just right to break loose of the underlying numbers, even for a limited time. Talking about regression makes it seem like an issue of one thing changing, when really it’s about weighing which of the 17 different things currently going right is likely to go wrong.
Which brings us to this season. The table looks pretty normal. The top six are the top six. Manchester City, as expected, are odds on to win the title with Liverpool second. The battle for fourth is alive and well. There are, from a big picture xG perspective, relatively few surprises. What that misses though, is all the ways that teams might have done something unexpected, even though they ultimately didn’t.
At every point this season there have been things that have been tantalizingly close to becoming interesting and defying expectations. The way the world works is that most of those things won’t come to pass. This season maybe none of them will. That doesn’t mean they’re impossible.
Take Arsenal for example. The beginning of their season was a classic case of regression to the mean waiting to happen. The combination of a ridiculous winning streak with mediocre underlying numbers screamed unsustainable. There was a shot, however, that under a new manager that mean might improve. That Unai Emery might implement his new system and improve the team so that as their luck was running out, their numbers would improve. Such possibilities are the things from which xG defying seasons are born. It didn’t quite happen.
What did happen is that Arsenal suffered a rash of injuries right when it appeared things might be coming together and that’s all it took. The beauty of using regression as a tool is that you can be fairly confident that eventually something will go wrong, without having to predict exactly what. The beauty of sports is in the exactly what.
Across North London something similar is happening for Spurs. Spurs under Pochettino are the rare side that can claim to regularly beat their xG numbers. But there’s beating your xG and then there’s what Spurs were doing at the height of this season. It seemed like absolutely nothing was going to slow that train down. Don’t have a midfield? No problem. Star player after star player gets hurt? No big deal. Even losing Harry Kane didn’t slow them down. And then, right as everybody got healthy, and it seemed like a great season was looking nailed on, bang, they’ve taken a exactly one point in the last month, a four game stretch which included losses against relegation candidates Burnley and Southampton.
While the Arsenal story is easy to tell, the Spurs one is harder. What exactly changed? Maybe they were just getting lucky all along and that luck ran out at precisely the wrong moment. Maybe their short squad was fatigued and as the miles piled up they lost whatever little edge their cult leader of a manager had managed to install. For Arsenal, the story of regression obscures the story of exactly what happened this season. For Spurs, it obscures a still unanswered question (a question which may not have an answer beyond the whims of the gods).
Understanding the concept of regression to the mean is important. It’s the landscape on which the story of the season is set. But the work of analysis, of watching football and breaking it down, and mining the ins and outs in a search for the why of it all, that’s what brings the story to life. Understanding the underlying math makes it clear that most of the things that could, in theory, happen over the course of a season will not. But some will. And relying on regression closes off avenues of investigating what, if anything, causes those departures. Understanding why unsustainable things happen is as important as understanding that they’re unsustainable. And that’s why, while regression is important, it’s best use is in the background. The Premier League season so far has been an eventful and fascinating journey, but all that a typical regression will point out is that the destination seems pretty dull.