A couple of weeks ago, I promised on Twitter to look at the data at Football-Data.co.uk and compare it to the publicly available data from Opta contained at places like EPL Index and the like. One of the largest problems for amateur soccer analysts is getting hold of useful data. Unlike North American sports that somehow have reams of publicly available data to crunch (mostly as a post-Sabermetrics legacy, it seems), soccer data collection is dominated by data collection companies like Opta, Prozone, and Infostrada. While these companies may collect rafts of data for later sale to teams, betting companies, and rich sports bettors, they make very little of that information available to the general public. This policy is a source of constant frustration in the analytics community, and almost certain holds back the discipline. (I’m not judging – data collection is clearly expensive, and they need to pay for it. That said, it would be really useful to get a lot more top-level data released and impact their ability to sell data to interested parties by a fractional amount.)
That said, there IS a source of publicly available data for a number of leagues and countries at the Football Data site. The data contained there covers 11 countries, multiple leagues in each country, and some of it goes back for a decade. It’s not as complex as the stuff you would get from the professional companies, but it IS data, and that in itself is laudable and potentially useful.
One of the issues with this data, however, is that the definition of a “Shot on Target” is completely different than what Opta uses. (You can get more info on the differences in this forum thread.) Since that sort of thing matters for a number of models, I have stated public concerns about using it for data crunching instead of the professional scraps available elsewhere. That said, James Grayson uses it for most/all of his research, so I promised to take a quick look at it after my holiday and see what conclusions I could draw.
The Good News
Shot data, which is pretty simple to track, is close to make it useable. I looked at current season totals for Arsenal, Aston Villa, and Chelsea, and all of them end up fairly close to the aggregate info I have been tracking from EPL Index. The Football-Data info under counts both Shots and Shots Against vs. what Opta tracks, but it’s within 10% of the aggregate totals, and since the differences are fairly consistent on both sides, it keeps TSR ratios close to the same.
The Bad News
Shots on Target info is completely different.
Whereas the shot tracking data is within 10%, Shots on Target has huge differences due to the different definitions used by the tracking companies. If your model uses Shots on Target modifications to adjust goal expectations, it won’t translate at all between the big companies and FD data.
One thing you might be able to do is crunch data across years for the SoT information and come up with a modifier for the Football Data info. Eyeballing these three clubs suggests that the FD numbers are between 63-70% higher than what Opta says. If that ratio is fairly constant, you can simply plug in the modifier and reverse engineer numbers close to Opta’s for all the years Football Data. It will be an imperfect solution, regardless, but it might get you close enough not to matter. I don’t have the time to do this right now, so if you do it (and crunch it across the lower leagues and the different countries) and find something useful, please let me know.
So there’s good news and bad news here. The good news is that shot data is fairly constant, which to me means you can use all the extra years FD has with some confidence. On the other hand, if you need Shots on Target info, you’ll need to put a lot more thought and work into the subject before you can come to any real conclusions. For now I’m going to stick with using whatever Opta info I have available, but this summer I might delve into Football League data from Football-Data.co.uk a bit more and see how I feel after that.
Best of luck!
@mixedknuts on Twitter