Since Hudl Statsbomb Data debuted in 2018, our expected goals (or xG) model has always been a little bit different. The vision with Hudl Statsbomb data was to transition football data from the world of proxies into a more accurate reflection of what is actually happening on the pitch. Right out of the gate, we added the location of goalkeeper and defenders around a shot, on every shot, in every league that we collect. This seemingly small upgrade delivered substantial improvements measuring xG numbers in densely packed penalty areas and especially when the GK is out of position.
Combine better xG numbers with pressures, pass footedness, a host of other USPs, and an overarching emphasis on quality and accuracy in the data, and it’s easy to see why Hudl Statsbomb Data has become the default choice for smart teams, federations, and gamblers everywhere. Our data is more accurate not only for where events occur on the pitch, and in what order, but it’s also more accurate regarding when events occur. That means that Hudl Statsbomb data is easier to effectively merge with tracking data than any other event data on the market.
However, sports data is a competitive space and every year we strive to get better. Last season we made two major upgrades to our shot information
- All freeze frames around shots are collected by computer vision. This gives us better location information for the ball, attackers, and defenders than ever before.
- We added something called Shot Impact Height (SIH) to shot data. Simply put, Shot Impact Height is the height (or z coordinate) of the ball when a shot is struck.
Why collect Shot Impact Height? Because looking at the game from a football perspective, it felt like there might be a difference between a header struck at head height vs one 40cm higher that skims off the top of someone’s head. (Or uh… a header struck while totally on the ground? Actually skip that, that data is totally biased.) Or a volley just off the ground vs one at chest height.
What’s interesting is that we didn’t know any of this when we started collecting the data. We simply added the features to data collection and then revisited it this summer to investigate further.
What Have We Learned?
The differences for the majority of shots are zero or very small. This was caused by another feature already incorporated in the xG model - pass height. We have always collected information on whether a pass was on the ground or not, and this info was passed into our xG model. This meant that there was already some information on potential shot height baked into the model, which is why most xG numbers don’t move.
However… for some shots, shot impact height made a big difference. These shots are largely ones where the model no longer knows the height of the ball because the shots occur after some event that is not a pass (like a rebound, player control, deflection, second-phase set piece, whatever). And for those shots, having shot impact height does move the needle.
Here is the writeup from our CTO, Thom Lawrence:
Our hope was that explicitly including shot impact height would improve the model in some edge cases, especially those requiring particularly majestic leaps or difficult bodily contortions to connect with. While this addition doesn’t provide a huge leap forward in accuracy overall, there certainly are many shots whose values far better pass the eye test now when compared to video.
For example, in this shot by Ajaccio against Caen, we previously had around ~0.6 xG. While the shot is crowded by defenders, it’s a kicked shot within the six yard box, with the keeper completely out of the way, which normally scores highly.
However watching the video, we can see that the shot itself is actually at about waist height:
This means the technique required is much harder to execute (the shooter manages to get a shot off but it’s as much a toe-poked half-volley as anything else). The model identifies shot impact height as a significant factor in its prediction here, and the new score is in the region of ~0.3 xG.
Interestingly, shot impact height doesn’t just have a negative effect on the xG of difficult shots. It’s also informative for some shots in positive ways. For example, a shot impact height of zero, i.e. that the ball is on the ground, can sometimes be very useful to detect great chances as well and we see improvement on those chances across the data set.
Further Examples
Old xG: 0.65
xG with SIH: 0.35
Old xG: 0.40
xG with SIH: 0.20
Old xG: 0.54
xG with SIH: 0.29
Old xG: 0.67
xG with SIH: 0.47
This is not dissimilar to having GK information on shots. For the vast majority of shots where the GK is in a normal position, having GK position is of nominal value. BUT for the times when the GK is way out of position in the goal, the difference between having their position and not having their position in the model is massive.
GK Position Examples
No GK Position: 0.48
Hudl Statsbomb xG: 0.82
No GK Position: 0.27
Hudl Statsbomb xG: 0.74
Like GK position, having SIH in the model more accurately reflects expected goals on a per shot basis and across the data set. Each upgrade in data collection brings Hudl Statsbomb Data just a little bit closer to fully reflecting what is happening on the pitch.
Conclusions
- Ground shots are even more valuable than we thought before.
- Shot Impact Height has a sizeable impact on a subset of shots that don’t already have pass heights attached to them.
- GK Position, Pass Height, Shot Impact Height combine to form a significant improvement in expected goals models and reduce xG model errors on outliers.
- Having richer data with better quality actually does matter.
If you want to know more about what Hudl Statsbomb can do for you, please get in touch.