## The beginner’s guide to reading, writing and pitching about football analytics

Do you find yourself with time on your hands these days? Suddenly staying in on a Saturday night for the good of humanity? And just to top it all off, you have to seclude yourself with no sports to watch. Separately, have you noticed an explosion of numbers in football? A sudden rash of xGs springing up all over the place? Suddenly everybody seems to be spouting off about stats and you’ve got only the vaguest notion of what they’re on about?

Well, you’re in luck. StatsBomb copy-editor and general woman about town Kirsten Schlewitz is just like you! While she’s an expert at correcting the incredibly sloppy copy you’ve all come to know and love from me, she also came in as a relative novice at this whole stat thing. So, I roped her into asking me every question she could think of that she might have otherwise been afraid to ask. So, let’s get started.

K: First of all, let’s get the elephant out of the room. What is xG?

M: xG is short for expected goals. It’s a statistic that attempts to measure how likely any given shot is to become a goal. It’s really good at predicting the future. That is, xG is better at telling you which teams will score and concede goals going forward than any other statistic we have. That’s the most basic barebones definition I can think of.

K: If you already have xG to predict who will score and who won’t, why are so many other numbers needed? I see a great deal of figures and maps when I edit pieces, and sometimes I don’t understand what their purpose is. For example, when comparing two players, you can’t rely on xG for a team. So what numbers would be used there?

M: The answer to your first question is that knowing who is more likely to score goals going forward isn’t a particularly interesting thing to know (unless all you care about is betting, which, fair enough). The interesting questions are the hows and the whys. A single number like xG doesn’t help you very much with that. I like to think of xG as being a statistic that makes sure the conversation starts in the right place, as opposed to one that tells us anything remotely close to what we need to know.

So after the conversation gets started that’s why we need all the other stuff, to examine how teams play, what individual players are doing, basically what’s going on on the pitch that leads to the xG number at the end. And when those numbers come to particular players things can get very complicated very quickly. That’s because while xG works fine for players (specifically it can tell us when a player is on a hot or cold streak that’s unlikely to continue), shots are only a relatively small part of what’s happening. And, quite frankly, the further we get away from the actual shot, the less definitive our numbers become about what’s good and what’s bad, and the more we rely on them to try and accurately describe the game, as opposed to predict outcomes.

K: So you’re saying there are more or less two sets of numbers that a StatsBomb article could use: ones to predict which team will play better going forward, and ones that tell us what happened in a previous game (games?) in a way that dives deeper than simple match reports. If someone wants to write an article about, say, how they think the Champions League would have panned out this year, would they only use the prediction numbers, or would they also examine numbers that show what happened in previous matches?

Or am I way off base here and all the StatsBomb stats are used in conjunction with one another, rather than existing as two separate sets that focus on past and future?

M: So this is exactly right conceptually. The problem is that the numbers often overlap in ways which make the divide not particularly clear cut. For example, xG is an excellent stat for predicting the future, but it’s also a pretty ok one for explaining what happened. We know more about a match if we say that Arsenal had 1.5 xG than if we said Arsenal had 15 shots. Using the xG from a single game is kind of a quick and dirty way to describe what happened, albeit one with plenty of faults.

The best use of numbers though will always combine prediction and explanation. If I wanted to look at upcoming, now cancelled, Champions League matches, I would use general xG numbers as a starting place and say, “Here’s what I think will happen based on these numbers” and then use everything else to say, “And here’s why.” Now that also doesn’t mean xG is perfect. Doing good work in stats means trying to understand the limitations of the numbers as well so that we can understand when they might be missing something. So, in theory, it might be possible to analyze all the whys and hows and decide beforehand that even though a team like Liverpool might seem much better based on xG, they would struggle against Atléti (that’s not a conclusion I would have come to, but it’s not like completely beyond the pale to suggest).

K: We keep talking about “the rest of these numbers.” For someone who’s completely intimidated by stats, to the point they’re afraid to even click on a StatsBomb link, much less pitch you an idea, what other types of numbers would you anticipate they’d need to understand?

M: From a writing perspective, understanding the numbers is somewhat less important than understanding the game. If a writer is making accurate assertions about the game then those claims are going to be reflected in the numbers and in the editing process we can work together so that your friendly neighborhood StatsBomb editor (me) can help give you the appropriate statistical support you need.

So, if a writer wanted to write about how a team relied on a midfielder for a lot of their buildup play, they wouldn’t need to know the ins and outs of StatsBombs numbers. But I’d be able to call upon stats of ours like “deep progressions” to look at how frequently they move the ball up the field, or at passing percentages when they’re pressured and not pressured to explain how they’re cool in the face of a defense, or information and graphics on pass length, etc. etc. etc.

Now, if the numbers don’t match a writer’s argument that makes for an interesting challenge. The question of why a writer perceives the game a certain way while the numbers don’t capture it is generally a really exciting place to do analysis. Figuring out why there’s a disconnect between what the numbers capture and what the eye might see is usually an interesting endeavor for everybody involved.

K: I’m here editing and writing articles, and I fully admit I don’t comprehend exactly what half of these numbers mean. But if I wanted to submit an article that showed I do understand a few of the statistics, which do you think would be most important to understand?

M: You do need to understand the basic mechanics of xG and why it works so well. It’s important to understand that a player having more goals than xG expects he “should” is likely to start scoring less. Beyond that I’m looking less for knowledge of a specific stat than for a way of thinking about questions. Questions like, “Do you have a statistic that measures XYZ” are good, questions like, “How do you go about measuring ABC” are even better.

K: From xG and its variations (non-penalty xG, open play xG etc), it’s relatively easy to assess the offensive strength, or lack thereof, of a side, even if you’re new to stats — and I can attest to this, believing I had no ability to comprehend sports statistics before I took this job. But what still tends to confuse me is the defensive measurements . . . I see the maps and figures, but even those don’t help me quite get it.

M: Yeah. Defense is hard. We can look at xG conceded, or shots conceded, or any number of other things, but those are still fundamentally measurements about what the other team’s attack is doing. And that makes sense, because on some level all defense is is preventing the other side from attacking. But it’s also unsatisfying because defenders are obviously doing SOMETHING and it would be nice to describe what those things are.

The traditional measures are things like tackles, interceptions and blocks, and while those are useful numbers, they have some major problems. The biggest is that you can’t commit those defensive actions while you have the ball, so players on bad teams tend to have more defensive actions than players on good ones that keep the ball all the time. One thing we do is adjust all of those numbers for possession, to try and give a better picture of what’s going on.

On top of that we track pressures. That is, we track every time a defender is close to an attacker with the ball and impacting him in some way. This gives us a lot of information — adding pressures into the mix demonstrates where on the field a team is making defensive actions.That gives us the ability to look at a heatmap of a team’s activity and really get a picture of where on the pitch they like to defend (the redder the square the further above average the number of defensive actions are in the zone, the bluer the square, the further below). Manchester City defend basically in their opponents penalty area, for example.

All of that’s a long winded way of saying that it’s really really hard to evaluate defenses!

K: So we know how offense is evaluated, and we know how defense is judged — somewhat, anyway. With these two necessary halves of the game described, I have one final question: What would you like to see a writer be able to demonstrate with the numbers, keeping in mind that the StatsBomb blog is there to both educate readers and show potential purchasers what they can do with the data?

M: The major thing I want to see isn’t a specific proficiency with data, but rather a framework for thinking about issues. Think about a question you want to answer, and how can you use data to answer that question. That’s what we’re all trying to do, whether it’s determining if a potential signing will be worth it, or why a player is having a career year, or if a keeper’s yips will pass, everybody is fundamentally doing the same thing. Whether it’s analysts with teams, or fans in the stands, or writers for StatsBomb, they’re looking at the game, developing a question and then trying to answer it.

## The Most Unpredictable League In The World?

Just How Unpredictable Is The English Premier League? With Just over a month remaining in the 2013/14 season there is still all to play for in the Premier League. The league title, European qualification and the relegation battle all look like going right down to the wire. Many commentators are calling this the most unpredictable season ever and we often hear the Premier League referred to as “the most unpredictable league in the World”. Never being one to take a commentator’s word for something I wanted to discover if this is really the case.

Just how ‘unpredictable’ is the Premier League?

What do we even mean by ‘unpredictable’? Can we measure it?

Furthermore, is there an ideal level of ‘unpredictability’ or ‘competitiveness’ for a league?

How Can We Measure Unpredictability? Fortunately there are companies for whom it is their job to accurately predict sporting events – bookmakers. The Football Data website records match statistics and pre-match bookmaker odds for thousands of football matches across Europe every season. How Accurate Are Bookmaker Predictions? The website Kaggle runs competitions for predictive modelling of many scenarios including sporting events. Recently they ran a competition to predict the outcomes of US College Basketball matches during March Madness. Kaggle evaluated entries using the Binomial Deviance method and I will use the same scoring system here. Hopefully this isn’t as complicated as it sounds. ‘Binomial’ just describes the way matches are evaluated on a scale from 0 to 1 (1 for a home win, 0 for an away win) and ‘deviance’ just means we will measure by how much our predicted outcome deviates from the actual match outcome. The difference between the forecast outcome and the actual outcome is measured in terms of the log-loss between the two. The smaller the log-loss the more accurate the predictions are considered to be. The idea here is that a very confident prediction that is incorrect is ‘punished’ more than a less confident pick would be. This is perhaps best shown with an example: Example: Liverpool vs Tottenham Hotspur (30th March 2014) Liverpool were strongly favoured to win this match. The average bookmaker odds were: Home Win – 1.45        Draw – 4.65      Away Win – 6.76 Bookmakers odds represent the percentage chance each game is expected to end in a home win, draw or away win so can be easily converted to the 0 to 1 scale (a drawn match is scored as 0.50). The expected ‘score’ for this match from the bookmakers odds is therefore: Expected ‘match score’: 0.757 [Please see comments section below for a full explanation of this calculation] Liverpool did win as expected (actual ‘match score’ of 1.000) so the resultant log-loss was small:         0.278 If the match had been drawn (‘match score’ 0.500) the log-loss would have been larger:                          0.847 If Spurs had pulled off a shock win (‘match score’ 0.000) the log-loss would have been very large:        1.416 How (In)Accurate Are Bookmaker Forecasts? Now we have a method for evaluating predictions we can produce the following chart:

[All data correct up to and including 1st April 2014]

This chart shows the average per match log-loss of pre-match bookmaker odds for the last 5 seasons of the EPL (remember the smaller the number the more accurate the predictions). It actually seems that the ‘predictability’ of the Premier League has remained pretty consistent of this period. If anything, this season has actually been the 2nd ‘easiest’ to predict in the last five years. Further details are below: 2013/14     =             0.591 per match                Biggest Upset: Man Utd 1-2 West Brom                 (1.724 log-loss) 2012/13     =             0.603 per match                Biggest Upset: Chelsea 0-1 QPR                            (1.945) 2011/12     =             0.623 per match                Biggest Upset: Man Utd 2-3 Blackburn                   (2.290) 2010/11     =             0.635 per match                Biggest Upset: Arsenal 2-3 West Brom                  (1.948) 2009/10     =             0.583 per match                Biggest Upset: Tottenham 0-1 Wolves                   (1.770) What is Happening Here? Technically our scoring system is a measure of how ‘inaccurate’ the bookmaker predictions are. The smallest log-loss scores result from very confident predictions that prove to be correct (i.e heavy favourites that go on to win their matches). Although the 13/14 title race remains unpredictable, in reality there have actually been very few genuine ‘upsets’ this season. The top teams have all been very consistent and have largely beaten the teams they are expected to. The biggest upsets have been Manchester United losing at home to West Brom (log-loss 1.724), Everton losing at home to Sunderland (1.588) and Chelsea losing away at Crystal Palace (1.525). Towards the end of the recent Liverpool against Sunderland match the Sky Sports co-commentator Alan Smith described Sunderland’s pretty disappointing (and ultimately unsuccessful) second half comeback as something along the lines of “What makes this league so great”. Is this really the ideal level of unpredictability for a league? How Does The Premier League Compare To Other Leagues? This table represents the same measure for the current 13/14 season for every league that is covered by Football Data (again, the smaller the number the more ‘predictable’ the league).

[All data correct up to and including 1st April 2014]

This table suggests that the Premier League is actually one of the more ‘predictable’ leagues around Europe? What might be causing this? Is it possible that it is actually easier for bookmakers to set odds on some leagues than it is on others? It is certainly possible that there is some truth in this. Several of the leagues with the most accurate odds are also those that are the most covered in the media (EPL, Serie A, La Liga) and have the most information available. In contrast, I don’t think there aren’t too many odds compilers who specialise in the Scottish lower leagues. Does this mean we should all start betting on the Bundesliga Two? I won’t be rushing to do so just yet. I think any differences here are still very small and that this method should rather be considered as an interesting way to highlight differences in the competitive shape and balance of competitions. For many of the leagues studied there appears to be an inverse relationship between how predictable the matches are and how competitive the league is. For example the leagues with the lowest average log loss include the SPL and Scottish Division One where Celtic and Rangers have already clinched the respective titles with a month to spare. The most predictable league is the Greek Superleague which has been won by the same team for the last 4 seasons. This method is still the best we have for evaluating competition ‘predictability’. If we consider this a useful measure of predictability then it is surely also a useful measure of the ‘competitiveness’ of a competition. Why might the Premier League have a lower score than the Bundesliga? Although Bayern Munich has romped clear in Germany, below them the league has been very competitive. As mentioned, in the Premier League the top 4 teams have all been consistently excellent (the top 5 have only 4 home defeats between them all season). The title races remains open but it is widely accepted that it will probably be decided by the two games Liverpool play against Manchester City and Chelsea. Does this mean  commentators should be more careful what they describe as unpredictable? For the EPL it seems fair to say the title race is unpredictable but in general it is not actually one of the more unpredictable leagues. Is the Premier League actually not competitive enough? Is There An Ideal Level Of Predictability For A League? The question of how competitive we might want the league to be is an important one and has implications for a wide range of decisions, in particular with regard to revenue distribution from the leagues lucrative media contracts. Many of the leagues that we have seen to be the most ‘predictable’ are also those that have very uneven financial structures. In contrast, the major US sports leagues such as the NFL and MLB openly engineer greater competition through the use of salary caps and draft systems. Yet is it really desirable to have a league where ‘anyone can beat anyone’? Does this mean every team is as good as each other? Or does this just mean every team is as bad as each other? Before we get too excited and start speculating about revenue redistribution it is important to remember that the best Premier League clubs are also those that represent English football in UEFA competitions such as the Champions League. This is not a consideration for any of the major US sports as they do not have to compete with other leagues overseas. This season only 2 English teams have made the quarter finals and neither are favourites to progress. Interestingly, the favourites to win the Champions League (Bayern Munich, Barcelona, Real Madrid, PSG) are all sides who compete is seemingly lop-sided domestic competitions (see above). Is there an optimal balance to be sought between the competitiveness of a league competition and the opportunity it affords its best teams to build squads to rival the best in Europe? Conclusions I admit my premise was a little facetious – I do not actually think the EPL is too predictable and actually think this has been the most interesting Premier League season for a long time. I am sure plenty of football fans in other leagues are envious of such a close finish in prospect. Also, I noted that only two of our sides are in the quarter finals but Manchester City and Arsenal didn’t exactly disgrace themselves – coming up against the 2 best sides in Europe and some unfortunate refereeing decisions. Yet I do think there are some important issues to look at in terms of what it actually means to have a competitive league. Should competitiveness be ‘engineered’? What if this is to be at the expense of the performance of our sides in Europe? If this season is representative of the future then I think the current balance between the league and European performance is about right but this doesn’t mean we should be complacent. And it definitely doesn’t mean the Premier League is ‘the most unpredictable league in the world’.

## Part Two: Has Britain Got Talent? Is A Lack Of Data Holding Back British Football Clubs?

In Part One Oliver Page looked at what statistical data is available to domestic clubs outside of the Premier League and how clubs might be able to use this to increase their efficiency in the transfer market. In Part Two he investigates further the transfer market (under?)performance of these leagues and whether a way forward can be identified. Compare The Market: Is The Transfer Market Efficient?

How do domestic divisions perform in the transfer market in comparison to other leagues around Europe and the World?

In part one I wrote about the value of using statistical player comparisons to make better informed transfer decisions. Similarly, I want to use league comparisons to look more closely at the apparent decline in transfer market performance of domestic divisions outside of the Premier League. Comparing anything across different leagues can obviously be problematic as inherent differences exist in the relative standard of those leagues. If player A has performed well in his league and player B has done the same in a different league can we really compare them? If league A generated £x million transfer revenue what does that mean? Is it just a reflection of the quality of that league? To attempt to address this I conducted on on-line survey asking users to rate the relative strengths of 25 different leagues from across the world. The methodology is inspired by this article by Jay Ulfelder which also explains the scoring system. A sample of the current scores (as of March 10, 2014) are as follows:

• England – Premier League (95 out of 100)
• Germany – Bundesliga (91)
• Spain – Primera Division (87)
• Italy – Serie A (84)
• Netherlands – Eredivisie (70)
• Russia – Premier League (61)
• England – Championship (50)
• Scotland – Premier League (25)
• England – League One (23)
• England – League Two (6)

These results are based on 1,251 votes so far and the full results can be see here. Obviously league standards can fluctuate over time (e.g. Glasgow Rangers demotion has weakened the SPL) but to my eye the ratings appear reasonable and are considered a useful tool for comparison. I took a selection of these world leagues and plotted their ‘ratings score’ against their respective transfer revenues for the last 8 seasons. Please note from here on I am combining pairs of seasons (e.g. 2006/07 and 2007/08 combined) as otherwise a single transfer in one season can sometimes distort results. Of particular interest was the comparison between the 06/07 & 07/08 period and the 12/13 & 13/14 period which can be seen below along with the full table of results. [see notes at end of article for further details of methodology]

From 06/07 to 07/08 the English Championship generated more transfer revenue than the German Bundesliga.

For the period 06/07 to 07/08 both the Championship and SPL were generating considerably more revenue from player sales than many leagues of a similar, and even higher, rating. Since then however, they have both been overtaken by the ‘better leagues’ and caught up by many of the ‘worse’ leagues. There may be lots of different factors at play here (e.g. the most recent Russian Premier League revenue is skewed by the collapse of Anzhi Makhachkala) but the most recent chart does show evidence of a growing relationship between transfer revenue and rating score.

Worryingly for these leagues, the data suggests that the Championship and Scottish Premier League were actually OVERPERFORMING in the transfer market in previous years. Have they now just found their ‘true level’?

It is also interesting to note, however, that a number of these leagues that have shown an increase in revenue are also those that have built strong relationships with data providers. In Part One we saw how the level of detail with which data companies such as Opta, Wyscout and Prozone cover competitions can vary from league to league. For example, the Bundesliga, Eredivisie and Russian Premier League have all had the full-detail level of Opta data available for at least 4 full seasons now. For the Championship this data only became available during 2013/14 and for the SPL, League One and League Two it remains unavailable.

Obviously we should be careful to draw sweeping conclusions – correlation does not imply causation – but it is difficult not to be intrigued by the possible existence of this additional relationship.

Where Do Championship Clubs sell players to? The data we have seen so far only shows total transfer revenue and a league could generate revenue just from buying and selling in-division and between its own members. Focusing now on the English Championship, where do its clubs sell their players to the most? In particular, what changed between the 06/07 to 07/08 period and the 12/13 to 13/14 period? [For an explanation of Superior 7 and Threatened 13 see this article by Infostrada Sports’ Head of Analysis Simon Gleave] Firstly, Championship clubs appear to have next to no market for their players outside of the top two English divisions. The majority of transfer revenue has always been generated by sales to either teams in the Threatened 13 or the Championship.  Interestingly, a similar pattern exists in terms of where Championship clubs buy players from too. For example, for the period 06/07 and 07/08, Championship transfer expenditure was £203.8M. £76.5M of this went on players from the Championship and £45.1M went on players from the Threatened 13.

Historically the Championship and Threatened 13 clubs have been locked in a cycle of selling and buying players the same players to and from each other.

All three of the Championship’s main ‘customers’ are declining however. For example, sales within division are down from £89.7M to £35.1M and sales to Threatened 13 clubs are down from £95.4M to £58.3M.

Is this due to lack of data to evaluate Championship players? Is the data available but just not being used for recruitment purposes?

Or, is it the more worrying scenario that the data is available, it is being used for recruitment, and top clubs are just choosing to eschew an overpriced and overrated market?

Where Do Premier League Clubs Now Buy Players From? Championship transfer revenue is down but the Premier League recently signed another record broadcasting rights deal and is continuing to spend as much as ever. Where is this Premier League money now going? Again, I will focus here on the changes between 2006-08 and 2012-14. As we have already seen, the historically inefficient domestic ‘loop market’ between the Threatened 13 and the Championship has been greatly reduced in value. The leagues that are the greatest beneficiaries of this include Spain, France, Netherlands and Italy.

This cannot just be dismissed as the inevitable consequence of Bosman – this ruling celebrates its 19th birthday this year. In 2012-14 Premier League clubs actually signed players from LESS different overseas leagues than in 2006-08.

It appears that there could be a trend towards Premier League clubs concentrating recruitment on certain specific leagues around Europe and the World. Several factors could be causing this. Firstly, the leagues which have seen the largest increases are also those grouped in or around the top of the world league ranking seen earlier. Empowered by the new television deal, even Premier League clubs outside the Superior 7 can now shop for players in these leagues. For example, a club like Southampton can now buy players from a club like AS Roma. Secondly, the most successful international side of recent years is Spain, and the most revered club side in world football is FC Barcelona. The unique style of football with which these teams have achieved their success has inevitably led in part to some Premier League clubs trying to replicate this style and to increase their signings from the Spanish domestic leagues.

But it is also interesting to note again that the leagues which have seen the largest increases are also those who have been amongst the first to adopt detailed statistical coverage.

Are we witnessing a more data-driven approach to recruitment making the transfer market more efficient?

Championship clubs currently have next to no market for their “goods” outside of the UK and the Superior 7 are increasingly willing and able to find a more efficient market overseas. There is also evidence of this trend making its way down the ladder to Threatened 13 and Championship clubs. For example Newcastle now makes most of its signings from French Ligue One.

Are clubs simply concluding that domestic leagues offer poor value? That is, a high cost player of a quality that, even if you can measure and benchmark it, is inferior?

If the data shortage is a concern for young footballers’ attempts to get scouted, it is of even greater concern to the football clubs who have previously relied on revenue from an inefficient transfer market to survive.

Unknown Unknowns

(Paraag Marathe, President San Francisco 49ers)

The above quote is my favourite from the recent Sloan Sports Analytics Conference. Without wishing to go over all of the recent pro- versus anti- statistics in sports arguments I think it is worth remembering that nobody is saying that statistical analysis in the be-all and end-all and the answer to all clubs problems. What I believe it can offer is a way to add context to decision making that would otherwise be made on the basis of such things as instinct or experience. Perhaps it can tell you how a midfielder’s attributes compare to similar players elsewhere around Europe? Or perhaps it can provide you with an objective way to draw up a short-list of young talents outside of your own division.

My background is in sports betting where everyone understands that a shift from 48% to 52% could be the difference between winning and losing in the long run. Unfortunately such long term and probabilistic thinking is rarely a luxury afforded to football clubs. Football, and indeed sport in general, is a game of opinions and almost everyone has one. Go to a stadium, watch a match in the pub or follow the game on Twitter and almost everyone has a opinion and everyone is an expert.

I do not know if this is a trait unique to sports but it isn’t often you hear someone admit ‘You know what I am not sure about that‘ or ‘I haven’t really seen that player play much actually’. When looking at the results for my on-line quiz it was noticeable how few responses were given as ‘I don’t know’ or ‘I don’t know enough about that to vote’. It often seems people within sport are afraid to admit they don’t know something. So here goes…

I have basically watched sport for a living for the past 8 years but will happily admit there is a LOT that I still do not know about it.

There, I said it. One way I like to help to get more information to help me make decisions and form opinions is to use statistics.

“Sports analytics doesn’t take the fun out of sports, it mostly takes the dumb out of sports”

(paraphrasing Edward Tufte, Sloan Sports Analytics Conference 2014)

My version of the above quote would be something like ‘it mostly takes the bravado out of sports‘.

What Is The Way Forward? As we saw in Part One, we may not know for sure exactly what the situation is ‘on the ground’ – clubs and data companies are secretive – but we do have increasing evidence of a trend towards analytical recruitment in football. Data analytics is not ‘taking over‘ but it is an invaluable tool for assisting in decision making processes. The top clubs are doing it and before long everyone else will follow.

It is no longer a choice of whether or not to embrace statistical analytics but WHEN and HOW.

Teams outside of the Superior 7 need to recognise that they operate in a world market now and can no longer rely on the domestic market for transfer revenue. They will need to become more analytically ‘savvy’ and use every new technique at their disposal to compete in this increasingly competitive market.

But who will pay for it?

It is understandably difficult to know details of the funding for data collection and analysis. We do not really know who the largest clients of data companies are (professional clubs? the media? bookmakers?) or how much it costs to provide and get access to all the most detailed data but it is something not every club can afford. Why is it so expensive?

To listen to some speak they would have you believe the data companies are the evil gatekeepers holding all the data for themselves in their ivory towers and charging a kings ransom to anyone and everyone for the privilege to use it.

Yet I have seen first-hand the intensive process Opta undertakes to fully code just a single game – I am sure similar processes exist at Prozone – and obviously data companies cannot provide this service for free. My understanding of the current system is that clubs are responsible for their own relationships with data providers – they are individual clients and have to pay for the breadth and depth of service that meets their own needs. This ad-hoc system is in contrast to how much of the same information is provided in the major American sports. For example the NBA recently agreed a deal to install optical tracking cameras at every team and also to make the data available to the public. In soccer, the MLS has a league-wide relationship with Opta which has been considered a great success both on and off the pitch. A hot topic in UK football at present is the perceived poor performance of our national teams and the relatively limited opportunities given to young British players at Premier League clubs. The Football Association are currently commissioning their own investigation into this and only this week the Times newspaper is running a series entitled The Good of the Game.

What would happen if a governing body such as the FA, SFA or Football League decided to invest in data analytics for the benefit of every club?

I do not know what the cost of this might be – this could be an impractical non-starter – but investment does not have to be purely financial. If clubs’ own analysts do not have the time or skills to deal with the newly available ‘big data’ then could this work be centralised and centrally funded? I am sure there are lots of people with the necessary skills out there who are only too willing to help as this article makes clear. If statistics and video coverage makes it way to all of the domestic leagues will we necessarily see a recovery in transfer revenue in those leagues? We do not know. Will it just confirm the suspicion that clubs have been overpaying for players in these divisions for years? Possibly.

Only time will tell but if you are a club and you don’t adapt to these new market conditions your future could be difficult.

Or if not, and you are a young player at one of those clubs, it might be time to check you have a valid passport.

NOTES [Note 1: TransferMarkt historic values for transfer fees are inflated to reflect current market prices. At present I have not received a response from them to confirm the exact method for doing this. It is assumed that this is consistent across world leagues] [Note 2: To account for the demotion of Glasgow Rangers in 2012 the combined revenue from all Scottish divisions is included throughout] [Note 3: Sales by promoted and relegated clubs are counted for the division they were playing in the previous season. E.g. Wigan sold James McCarthy to Everton when they were a officially a Championship club (summer 2013) but because Wigan was relegated the previous season this is counted as a Premier League to Premier League transfer (i.e. it is assumed Everton made the signing on the basis of his performances in the Premier League the previous season). At the other end, Dwight Gayle’s transfer from Peterborough to Crystal Palace is considered Championship to Premier League despite Peterborough’s relegation to League One.]

## Has Britain Got Talent? Is A Lack Of Data Holding Back British Football Clubs?

In the first of a two-part series Oliver Page investigates what data and analysis services are available to British clubs, how they are currently utilising them for recruitment, and what effect this might be having on the UK transfer market.

“Economic efficiency is likely to be greatest when information is comprehensive, accurate and cheaply available.”

(The Economist, A-Z of Economics, online 2014)

At the OptaPro Analytics Forum a recurring question was “For which leagues is it possible to reproduce the work being discussed?” Almost all of the presentations utilised data from the Opta f24 data feed for the English Premier League. This data includes descriptions of every on-ball action and corresponding x-y coordinates to determine location around the pitch (as seen on websites such as Squawka and Statszone). My personal interest has always been greatest in domestic football outside of the Premier League so I was disappointed to learn that Opta do not currently provide this level of data for the Scottish Premier League or English League One and League Two. The only data that is available is ‘headline stats‘ such as goals scored and assists.The full detail dataset for the English Championship only became available midway through the current 2013/14 season.

My dream of becoming the lower league Billy Beane is temporarily on hold.

More importantly however, this raises the question whether or not clubs are able to scout and compare players in these divisions as accurately as elsewhere?

What effect is this lack of data having on the recruitment of players at clubs playing in these ‘black hole’ divisions?

What Exactly Is The Problem? The website Transfermarkt keeps a comprehensive record of every major transfer and its value. Historic values are adjusted to reflect current market prices [see note 1 at end of article]. The chart and table below show transfer revenue generated by the English Championship, League One, League Two and Scottish Leagues over the previous 8 seasons. [see notes 2 & 3 for further notes on methodology] We can see that transfer revenue generated by these divisions peaked in the 2007/08 season, after which the general trend has been one of steady decline. Scottish leagues did see an increase in 13/14 although this is almost entirely due to sales by Celtic alone. The 07/08 peak coincided with the start of a lucrative new broadcasting rights deal for the Premier League. The EPL recently signed another record rights deal meaning there is now more money than ever in the top division, yet this time it has not been matched by a similar boom in domestic transfers.

For these 4 divisions 2013/14 represents the lowest combined total transfer revenue since 2004/05.

Why does this matter? Recruitment is often only considered in terms of teams trying to sign players, yet for all but a minority of clubs their ability to sell players efficiently is equally crucial to their financial performance. Clubs outside the English Premier League are concerned about the current financial climate and the potential impact of new Financial Fair Play rules. Some are even worried enough to be considering legal action. Infostrada Sports’ Head of Analysis Simon Gleave recently coined the terms Superior 7 and Threatened 13 to describe the two-tier structure that is often said to exist within the Premier League. The increasingly powerful Superior 7 teams are considered to be Manchester United, Manchester City, Arsenal, Chelsea, Liverpool, Tottenham and Everton.

This current 2013/14 season, there has not been a single permanent signing by a ‘Superior 7’ club of a player who played the previous season in the Championship, League One, League Two or the Scottish Premier League.

There’s Plenty Of Fish In The Sea

“…99% of player recruitment is who you don’t buy.”

(Mike Forde, former Director of Football Operations at Chelsea FC)

The 1995 Bosman ruling and the collective economic power of the top European clubs means there is now a vast pool of footballers from across the world to be evaluated every transfer window. There is only a small fraction of the world’s footballers that can be considered ‘off the market’. Michael Calvin’s book ‘The Nowhere Men’ describes how many clubs are finding that traditional scouting methods alone no longer meet their recruitment needs. One way clubs can improve their recruitment process is through detailed analysis of players’ statistical performance. An increasing number of teams are now using statistics to objectively compare players they may wish to sign. Statistical analysis can also help them to assign reasonable transfer values to potential targets.

There has been much criticism of ‘meaningless stats’ and their potential lack of context recently, however full and thorough analysis done well can actually ADD context to decision making.

Go Compare Example 1 – Goal Scorers Just knowing how many goals a striker scores does not necessarily tell us how good a finisher he is. For example, what is the quality of the chances he is being presented with by his teammates? Colin Trainor is leading much of the work in this area.

Was it really not possible to produce chance quality analyses for these goal scorers?

Gary Hooper (63 goals in 3 seasons in the SPL)         Jordan Rhodes (70 goals in 3 seasons in L1)

Example 2 – Goal Keepers: Paul Riley uses shot location information for the purpose of rating goalkeeper performance.

How can teams objectively evaluate goalkeeper performance without full data coverage in these leagues?

Fraser Forster (14 goals conceded in 28 SPL games)      Wayne Hennessey (£3m signing from League One)

Example 3 – Midfielders: Some of the most criticised statistics in football are those related to passing (otherwise known as the Leon Britton effect). Marek Kwiatowski’s work attempts to address this by comparing the passing (field position, length angle and volume) of central midfielders.

If we know that two players are attempting similar types of passes each game then suddenly a stat such as pass completion percentage does become meaningful.

Liam Bridcutt (121 Championship appearances 2010 to 2014)

Is it merely a coincidence that a Championship midfielder known primarily for his passing was only signed for a Premier League team by one of his former managers?

How many other young players in these leagues might be going unnoticed because of a lack of data coverage?

Even when players are noticed, are Premier League clubs increasingly reluctant to ‘pull the trigger’ on these signings?

Will Hughes (18, Derby County)                              Thomas Ince (only on loan at Crystal Palace)

How is Data Actually Being Used by Clubs? I wanted to find out more about what data is available and how clubs are actually using it ‘on the ground’. I spoke to several representatives from football clubs at all levels and also direct to data companies. Opta has been producing their full level f24 data for the Premier League for 12 seasons. Full details of which leagues they cover can be seen here but highlights include:

• Germany Bundesliga (9 seasons)
• Italy Serie A (9 seasons)
• Spain La Liga (8 seasons)
• France Ligue One (8 seasons)
• UEFA Champions League (8 seasons)
• Russian Premier League (5 seasons)
• Dutch Eredivisie (4 seasons)
• Portuguese Primeira Liga (4 seasons – 75% coverage)

Leagues that have been added for the 2013/14 season include the Championship, Brazilian Serie A and Argentinian Primera. The level of coverage is expanding every year but, as mentioned earlier, there is not currently the level of demand to cover League One, League Two and Scotland in this detail. Opta is not the only data company however and pure number-crunching is not the only modern technique available. For example data can be now be used in conjunction with tailor-made video analysis via services such as Opta’s VideoHub Elite and Wyscout. One of the first companies in this field was Prozone who now provide both video and data services to over 300 professional clubs worldwide.

The combined use of statistical analysis with worldwide video scouting can be considered together as the modern developments in football player recruitment.

How many clubs are fully embracing these new developments? At what levels are these methods most prevalent? Is it even possible to use these methods the further down the league structure you go? Every club representative I spoke to was interested in the potential impact of incorporating a more data-driven approach to the transfer market. However they also all spoke of the fact that, when it comes to recruitment at least, these techniques are currently in their infancy e.g. “Most teams do minimal stats and rely on traditional scouting and manager input. Only a few teams do what I would call proper statistical analysis of potential transfers”. From this correspondence it seemed initially that the data is not currently available outside the EPL. For example, “I see the work that appears on StatsBomb and it’s very interesting, however I could not replicate it in terms of the League as I don’t have the data to do so” and “From what I know of this XY data is simply not available – I would very much doubt it is even collected by the clubs themselves.” However, when I spoke to representatives from Prozone I was informed that a lot of this data is being recorded. Opta is really the only company to make any data available publicly via the media whereas Prozone and Wyscout’s business models focus on the professional game and provide more bespoke services tailored to the needs of individual clubs as clients.

The reality is that some clubs have access to data that the public and other clubs do not.

Prozone records what they term technical data of 2,500 on-ball events per match. They currently record this for all of the Premier League and Championship and much of League One. Paul Boanas, Senior Account Manager, told me “We work with 23 of the 24 Championship clubs and 10 of the clubs in League One…they all get every touch of the ball in their games, with the vast majority of championship clubs having access to all touches of the ball for all 552 games in their division”. In the Championship this coverage has been in place since 2007 and for League One the last 4 to 5 years. Having increased in coverage over the past 10 years Prozone now provide this level of service 25 leagues worldwide. In addition, all Premier League clubs, 17 Championship clubs and 2 League One clubs now have a fixed-position camera system in place to record player movements!

This presents the even more interesting scenario that the data is out there but it is just not yet being fully utilised by all but a small minority of clubs.

What could be causing this dichotomy? One reason is simply the cost and practicalities of fully using data analytics at a professional football club. The data may be available, but as one person told me “At best a lot of these clubs have 1 full time analyst who is covering everything at first team level” and “Unfortunately football is a business…clubs will feel reluctant to pay for services that they do not appreciate as yet.” Also, much of the focus is understandably placed on analysing one’s own team’s performance (post-match and physical analysis) and also preparing for forthcoming opponents (pre-match). One analyst told me “Things like shot locations for example, I can do this for my club as I have access to each game, but I don’t know how it compares to the rest of the league.” I know from my own experience preparing for the OptaPro Forum that it can take a lot of time and programming skill to get these vast datasets into something approaching a workable format for analysis. The week-to-week practicalities of the football schedule unfortunately do not allow for this kind of time investment.

It is for exactly this reason that Colin Trainor this week put forward an offer of assistance to clubs from all of us at Statsbomb.

Ultimately we cannot be sure exactly what data individual clubs have available and how they are using it – clubs are secretive and if they find a competitive edge will want to hold it for as long as possible. What is clear, however, is that data companies are expanding their coverage all the time and certain clubs are the early adopters embracing new  analytic methods. Although not exclusively, this innovation appears to be starting with some of the richest clubs (e.g. Manchester City, Chelsea, Liverpool) but the inevitable trend is that it will feed their way down the football food chain.

What effect might this trend have on teams further down the league ladder? Has this process already started? And can we already see its impact?

In Part Two I will look at how the developing use of data analysis might be impacting the domestic transfer market. Why has transfer revenue outside of the Premier League reduced? How do domestic leagues compare to similar standard leagues worldwide? Where do Premier League clubs now buy their players from and is the transfer market becoming more efficient?

NOTES [Note 1: TransferMarkt historic values for transfer fees are inflated to reflect current market prices. At present I have not received a response from them to confirm the exact method for doing this. It is assumed that this is consistent across world leagues] [Note 2: To account for the demotion of Glasgow Rangers in 2012 the combined revenue from all Scottish divisions is included throughout] [Note 3: Sales by promoted and relegated clubs are counted for the division they were playing in the previous season. E.g. Wigan sold James McCarthy to Everton when they were a officially a Championship club (summer 2013) but because Wigan was relegated the previous season this is counted as a Premier League to Premier League transfer (i.e. it is assumed Everton made the signing on the basis of his performances in the Premier League the previous season). At the other end, Dwight Gayle’s transfer from Peterborough to Crystal Palace is considered Championship to Premier League despite Peterborough’s relegation to League One.]

## M******** (a.k.a “The Scottish Play”)

M******** (a.k.a “The Scottish Play”)

“Sometimes what doesn’t happen can be just as important as what does”

(Chris Anderson, Footballers Football Show, 2013)

Last week I had the good fortune of being invited to present at the inaugural OptaPro Analytics Forum. This forum featured many interesting guests from a wide variety of backgrounds, including representatives from professional clubs, getting together to talk about how numbers can be used to enhance our understanding of “the beautiful game”.

When contemplating the event later that evening it dawned on me that there had actually been a rather large elephant in the room . Just like superstitious Shakespearian actors,  even though many interesting topics were covered, not a single speaker or audience member mentioned “the M-word”.

Statistics in sport hit the mainstream in 2011 when Brad Pitt starred as former Oakland Athletics’ general manager Billy Beane in “That Movie”. The film tells the story of how a down-on-its-luck, low budget baseball club achieved unprecedented success through the innovative use of statistics. The club’s success is primarily due to Beane (Pitt) and, in the Hollywood version, his hiring of a young stats whiz kid (Jonah Hill). Around the same time “M********” was premiering, John W Henry was settling in to his new role as owner of Liverpool Football Club. Henry is also owner of the Boston Red Sox baseball team, had worked with Beane previously, and made no secret of his intention to try to implement a similar approach to building and running a “soccer” club. One of Henry’s first moves was to appoint Frenchman Damian Comolli as “Director of Football Strategy”. Comolli had spoken before of his admiration for the work of people such as Beane, and was immediately given the remit of managing Liverpool’s recruitment strategy. In the summer of 2011, Comolli oversaw an unprecedented signing spree:

Liverpool’s signings that season included Andy Carroll (£35m), Charlie Adam (£9m), Stewart Downing (£20m) and Jordan Henderson (£16m).

The 2011/12 season was a disaster. Despite winning the League Cup next to none of the signings paid off. Liverpool finished 8th in the Premier League and both Comolli and manager, Kenny Dalglish, were sacked soon after. It was not difficult to sense a feeling of satisfaction in many quarters that such a numbers driven “experiment” had failed.

What struck me from several discussions at the OptaPro Forum, was that many people who work in this fledgling analytics community now see it as part of their job to defend statistics in sport and rebuild the appetites of club owners and fans for such work.

Although there have been several positive developments recently – notably Chris Anderson and David Sally publishing The Numbers Game and Sky Sports devoting a whole episode of The Footballer’s Football Show to statistics (3rd December 2013) – there has been an undeniable sense of a stalling, despondency and collective banging of heads against brick walls. Chris Anderson is a well-respected and thoughtful academic but it was perhaps unfortunate that his co-panellists on The Footballer’s Football Show were Sam Allardyce and our friend Mr Comolli.  Allardyce speaks often of his admiration for statistics and video analysis, but the consistent lack of aesthetic appeal with which his teams often play is surely part of the reason that the phrase “playing the percentages” comes with such negative connotations. A recent article for When Saturday Comes in the Guardian caused a stir with a scathing attack on the field, including such lines as “these people don’t deserve football” . Only last night journalist and radio DJ Danny Baker tweeted to his 300,000 followers “Surely now we can finally see ‘stats’ are train spotting bullshit”.

Are Allardyce and Comolli really the best advert for football analytics?

“M********” is an adaptation of a 2003 book by Michael Lewis which, for me, is far more interesting than the movie. This book was a catalyst for an explosion of similar work in the US, whose major sports all seem more prepared to embrace  such work than soccer. Indeed Henry himself is a confirmed speaker at the forthcoming MIT Sloan Sports Analytics Conference alongside  such high profile figures as current Indianapolis Colts quarterback Andrew Luck . However, in this country it is the juxtaposition of the sugar-coated Hollywood story and the failings of  Comolli-era Liverpool with which most conversations about statistics in football both begin and end.

It seems the shadow of Charlie Adam looms large in more ways than one.

One analogy that sprung to my mind was the 2011 UK referendum on a proportional representation voting system (bear with me!). I, and many people I know, were advocates of such a system but I remember being not entirely satisfied with the actual proposal put forward by the Yes campaign. As such, I am sure many either abstained or actually voted to keep the current system. I personally voted Yes, not particularly because I liked the system for which I was voting – I just recognised that if the campaign was defeated this time the issue would likely not make its way back onto the political agenda in my lifetime.

Just because Liverpool was the first club to experiment with analytics and it failed, do we now write off the entire field of work for a generation?

It was therefore with some trepidation that I made my way to Birkbeck University this week – how would the forum play out? What would be the response from the army of mysterious and secretive “performance analysts” that the clubs had (almost) all sent? I personally envisaged one of two responses – either “Stats in football? Nonsense. What can you do for me? Tell me to sign Andy Carroll?” or maybe the polar opposite “Stats, brilliant, yes we have been doing that for years but we can’t possibly tell anyone, our managers would shoot us”. Fortunately what actually transpired was a series of varied and interesting presentations across a wide variety of topics, with equally stimulating debate both during and after. My personal background is in the betting industry, but there were presentations from people in fields as varied as telecommunications, accountancy and theoretical biology. It was hugely rewarding for “outsiders” such as myself to have the opportunity to share our work. In addition, much of the work was of a level that it was easy to envisage immediate and tangible benefits for clubs resulting from it. Of particular interest were two presentations of work that is actually currently ongoing at two of the top clubs in the whole of Europe. Pedro Marques (1st team analyst at Manchester City) presented some of his and his colleagues work on visualising the nature and frequency of passing networks by forthcoming City opponents. This included not just detail on how they collect and look for patterns in the data, but also actual training ground footage of the coaching staff implementing their findings. Also, representatives from Spanish telecommunications giant Telefonica presented some of the work they have done for none other than FC Barcelona on “Players’ pre- and post-pass movements” – that is, not only measuring what happens when players have the ball, but actually measuring players’ movements when they do not have the ball. After the event, it was enlightening to talk with representatives sent from Liverpool FC (who were not, incidentally, part of the Comolli regime) about the challenges they faced in trying to rebuild the reputation of analytics at the club in the wake of its previous experience. I think this remains a challenge for the whole industry. A club requires the full and unstinting support of all stakeholders to have any kind of success with data analysis but the vision really has to come right from the top. They spoke positively of the support they continue to receive from John W Henry and, judging by recent performances at least, it seems that finally Henry’s faith could be on its way to being rewarded. To see clubs of the stature of Liverpool FC, Manchester City and FC Barcelona put their weight behind data analytics should be a huge incentive for other clubs to make similar investments – and not just financial investments. Personally, I found the forum immensely energising and I hope to be able to continue to do work in this field in the future. Just don’t mention the M-word.