A toolbox for football analytics

noahs_ark My article “Towards a new kind of analytics” published on this site several weeks ago has received a lot of attention, for which I am very grateful. Most feedback I received though was along the lines of “this is all well and good, but how do I go about doing this kind of stuff?” This follow-up is designed to answer this question in a narrow sense, by listing some of the basic computational and statistical tools that I found indispensable in “proper” analytics work.  As I argue in that previous article, the lowest-hanging fruit in football analytics has been picked and basic tools aren’t sufficient anymore. To put things bluntly, the vast majority of analysts active in the public sphere today need to make a qualitative methodological leap forward or lose contact with the state of the art forever. The narrative link list below is divided in three parts: databases, programming and statistics. *** Databases *** You must kill every thing you love, and that includes Excel, because a well-organised data store is critical to your analytics pipeline. It ensures that you have a single cross-linked source of data that you can curate and keep up-to-date, and it minimises the time required to extract and prepare datasets for analysis. In practice, this means a relational (SQL) database. SQL is an industry standard language for extracting (“querying”) data from highly optimized databases. It was designed to be relatively easy to use by people without programming background, and I would say that this goal has been achieved — I have personally converted several non-technical people into able SQL query-writers. To use SQL you must operate an actual database containing your data. There are two basic questions here: what database variety to use, and where to run it. Starting with the latter, it is certainly possible and an entirely reasonable initial step to maintain a database on your personal computer. However, having it in the cloud is the superior option, because you don’t need to worry about setup, backups or power failures. Amazon Web Services (AWS) offer a year’s worth of free database hosting, so you can start with them and after a year either pay to continue or move to another solution. As to what flavour of SQL to use, I recommend MySQL if you have no previous relational database experience, and PostgreSQL otherwise. My impression having used both is that  PostgreSQL is vastly superior to MySQL in a number of aspects, but trickier to tame without experience. If you go down the MySQL route, the MySQL Workbench is the leading front-end to MySQL databases. *** Programming *** In an ideal world, analytics requires little actual programming: the right data emerges from the database in the right form for a machine learning algorithm to analyse it and report the results to you. The practice makes a cruel mockery of this vision, and as an analyst I spend most of my time programming rather than designing models and analysing results. The three most common programming tasks that I have to perform are: loading raw data into databases; transforming the data extracted from a database into the form required by statistical software; and validating and reporting the results of my analyses. I suspect that the experience of other analysts is broadly similar. Thus, unless you work in a team or have a programmer accomplice, you need a decent grasp of a programming language to do football analytics well. Two languages are the overwhelming  favourites of the community: Python and R. My general advice on which one to choose is this: if you know basics of one of these, stick with it and put in the effort to learn it to a good standard. If you don’t know either, learn R. Personally, I believe that as a programming language, R is an abomination and it really ought to be banned; but it is also the pragmatic choice for football analytics, because of the breadth of statistical methods available, and because two truly game-changing packages due to Hadley Wickham, dplyr and ggplot2 can take care of 95% of your data manipulation and visualisation needs. RStudio, also from Wickham, is the leading environment for R development. The Advanced R book (guess the author) is in fact not that advanced in the first few chapters and is a decent guide, unless you’re truly starting from zero. If you go down the Python route, install the Anaconda distribution, which pre-packages Python 2.7/3.5 (either is fine) for data science, including the scipy, numpy, matplotlib, statsmodels and scikit-learn add-ons essential for data analysis. PyCharm is a wonderful, feature-rich Python editor. An added benefit of Python is that you can use it to structure and query your SQL database using so-called ORMs, that is a technology that integrates the database with the language so closely that database records appear as variables directly in the code and you can manipulate them without SQL (I have no doubt that R has ORMs too, but the mere thought makes me shudder). The two leading Python ORMs are Django and SQLAlchemy. I use the former, which is actually a complete framework for database-driven websites, but SQLAlchemy is a perfectly fine choice too; Soccermetrics use it, and you can find plenty of code samples in Howard’s repos. Lastly, whether you end up with R or Python, version control software (VCS) is essential. VCSes let you move easily between multiple versions of your code and thus make sure that nothing you ever write is lost and also help you understand how your code evolved over time and why. There is no better VCS than Git. If you can afford it, pay GitHub 7 money/month and they will host your code in private repositories, and you can use their excellent web interface which adds tons of genuinely useful features on top of Git itself. If you’d rather not pay, Bitbucket will host private repos for free, but the interface is atrocious. The last option is GitLab — it is free and the interface is perfectly decent, but you have to host the code repository yourself. In all cases, you will have to learn Git itself, which is a command-line program of considerable complexity, but understanding the basic commands (checkout, status, add, commit, push, pull, branch) takes no more than a day and is all you are going to need. The official Git webpage linked above has plenty of good educational resources. *** Statistics *** Perhaps my main complaint with the public analytics today is that the analysts do not use the proper statistical machinery to tackle their questions. As I have said before, all progress on the key analytics questions than can be achieved with counting and averaging event data has been achieved. Football is complex and football data is noisy, and to derive robust insight, powerful, specialist tools are necessary. Unfortunately, learning advanced statistics on your own is a challenging and long process, and despite having been engaged in it for the past several years, I have only scratched the surface. Perhaps the more efficient way of doing it would be to attend an online course, or follow a general statistics textbook. I can’t recommend any particular course but I can’t imagine that a randomly chosen one can do harm. As to textbooks, Cosma Shilazi’s draft is very decent, as is Norman Matloff’s (Thom Lawrence’s find), and they are both free. Gelman et al.’s Bayesian Data Analysis is a comprehensive, advanced treatment of, erm, Bayesian data analysis, and if you google hard enough there is a PDF of that on the web too. One concrete statistical method that I believe is simple enough to get a grip on very quickly but could instantly improve a majority of your analyses is Generalized Linear Models (GLMs). GLMs generalize linear regression in two ways: first, the dependent (predicted) variable doesn’t have to be a linear function of the predictors variables anymore, and second, the distribution of errors is not necessarily Gaussian. Because of these two areas of added flexibility, many of the common football analytics models fit in the GLM framework. An expected goals model can be a GLM but so can a score prediction model or a power ranking, and so on. R has the in-built glm function, which allows you to specify GLMs with a single, powerful command, and the payoff is great: you get the coefficients, statistical significance and fit diagnostic all for free. *** My objective in this article is to enable budding football analysts to build a platform from which they can analyse the game in the most efficient way possible, meaning maximum insight for unit of time spent. Unfortunately, unless you’re already half-way there, this entails a heavy initial time investment. In my view it is not just worth it; is it a necessary precondition for serious work. If serious work is what you have in mind, do not hesitate for ask me on Twitter (@statlurker) for further pointers.   Many thanks to Thom Lawrence for his feedback on this article.

Towards a new kind of analytics

vf1cin8hwz6hbhua8uti I have been involved in football analytics for four years and doing it for a living since 2014. It has been a wonderful adventure, but there is no denying that the public side of the field has stalled. But this is not really a “crisis of analytics” piece or an indictment of the community. Instead, I want to point out one critical barrier to further advancement and plot a course around it. In short, I want to argue for a more theoretical, concept-driven approach to football analysis, which is in my opinion overdue. It is going to be easy to read this short article as a call to basic, as opposed to applied, research and consequently dismiss the ideas as impractical. Try not to do this. I like applied football analytics and I firmly believe that it has value — even the public variety. But I also believe that we have now reached the point where all obvious work has been done, and to progress we must take a step back and reassess the field as a whole. I think about football analytics as a bona fide scientific discipline: quantitative study of a particular class of complex systems. Put like this it is not fundamentally different from other sciences like biology or physics or linguistics. It is just much less mature. And in my view we have now reached a point where the entire discipline is held back by a key aspect of this immaturity: the lack of theoretical developments. Established scientific disciplines rely on abstract concepts to organise their discoveries and provide a language in which conjectures can be stated, arguments conducted and findings related to each other. We lack this kind of language for football analytics. We are doing biology without evolution; physics without calculus; linguistics without grammar. As a result, instead of building a coherent and ever-expanding body of knowledge, we collect isolated factoids. Almost the entire conceptual arsenal that we use today to describe and study football consists of on-the-ball event types, that is to say it maps directly to raw data. We speak of “tackles” and “aerial duels” and “big chances” without pausing to consider whether they are the appropriate unit of analysis. I believe that they are not. That is not to say that the events are not real; but they are merely side effects of a complex and fluid process that is football, and in isolation carry little information about its true nature. To focus on them then is to watch the train passing by looking at the sparks it sets off on the rails. The only established abstract concept in football analytics currently is expected goals. For good reasons it has become central to the field, a framework in its own right. But because of it focuses on the end result (goal probability), all variation without impact on xG is ignored. This focus on the value of a football action or pattern rather than on its nature seriously undercuts our understanding of the fundamental principles of the game. Just like isolated on-the-ball events, expected goals tell us next to nothing about the dynamic properties of football. Indeed it’s the quantitative dynamics of football that remains the biggest so-far unexplored area of the game. We have very little understanding of how the ball and the players cross time and space in the course of a game, and how their trajectories and actions coalesce into team dynamics and, eventually, produce team outputs including goals. This gap in knowledge casts real doubts on the entirety of quantitative player analysis: since we do not know how individual player actions fit in the team dynamics, how can we claim to be rating the players robustly? And before the obvious objection is raised: these dynamic processes remain unexplored not for the lack of tracking data. The event data that is widely available nowadays contains plenty of dynamic information, but as long as our vocabulary forces us to consider every event in isolation, we cannot but glimpse it. Luckily, a newer concept is emerging into view and taking a central place: the possession chain (possession for short). A possession is a sequence of consecutive on-the-ball events when the ball is under the effective control of a single team. A football game can then be seen as an (ordered) collection of sequences. It is a very positive development since possessions make much more sense as the fundamental building blocks of the game than events. This is because they are inherently dynamic — they span time and space. I believe that they should be studied for their own sake, and if you only compute them to figure out who should get partial credit for the shot at the end of it, then in my opinion, you are doing analytics wrong – or at least not as well as you could be. To give an example of such a study and why it is important, consider the question: what makes two possessions similar? To a human brain, trained in pattern recognition for millions of years, it is a relatively easy question. It is however a quite difficult, basic research task to devise a formal similarity measure given the disparate nature of the data that makes up a possession (continuous spatial and time coordinates, discrete events, and their ordering). For the sake of argument, assume that we have a measure that we are happy with. It has an immediate, powerful application: we can now measure the similarity of playing styles of teams and players by measuring the similarity of possessions in which they are involved. This method is bound to be much more precise than the current, purely output-based methods, and as we know, playing style similarity has a wealth of applications in tactics and scouting. But that’s not the end of the story. Our hypothetical measure, having already provided a considerable applied benefit, can now be fed back into basic research. Under a few relatively mild additional assumptions, the measure gives a rich structure to the set of all possible possessions, potentially allowing us to deploy a century of research in general topology and metric spaces to make statements about football. But for all these potential rewards, the subject remains unexplored because of the twin obstacles of inadequate conceptual arsenal and perceived lack of immediate applied benefit. * * * Based on what I sketched above, my suggestion for everyone involved in the field is to be more ambitious, to think more expansively and to not settle for imperfect investigations of lesser questions just because the data seems to be limiting us in this way. It isn’t, at least not all the time. Instead of counting events in more and more sophisticated ways, let us focus on possessions, ask broader questions and interrogate the data in more creative ways than before. I firmly believe that the payoff of this approach will be far greater than anything we have achieved so far.   I want to thank Colin Trainor (@colintrainor), Ted Knutson (@mixedknuts), James Yorke (@jair1970) and especially Thom Lawrence (@deepxg) for their feedback on earlier versions of this article.

3’781 ways to score a goal

Is every goal unique? The instinct says yes, but one needs only to remember Steven Gerrard to realise that it is possible to make a fine career out of scoring the same three goals over and over. The truth sits somewhere in the boring middle, but it’s undeniable that many goals share similarities, including how they are created. It is this similarity that I set out to explore here. My analysis rest on the notion of possession chains. For every on-the-ball event — such as a goal — it is possible to find the unbroken, ordered sequence of previous events leading to it. This is the idea on which Colin Trainor based his recent article about players’ attacking contributions. My definition of a chain is likely different from his, and too technical to give it in full here, but the broad outline is as follows:

  • I only look at chains terminating in a goal,
  • The events in the chain are strictly consecutive (ie. no intermediate events are excluded),
  • Only actions by the scoring team belong in a chain,
  • A set piece can only be the first event in the chain (ie. we never look past a set piece),
  • Ditto possession regain events (tackle, interception, recovery).

The numerous minor choices I had to make on top of these may mean that the overall definition is so arbitrary that I am unsure how much of what follows is insightful (never mind useful), and how much is just having unwholesome fun with the data. Caveat emptor. The data I looked at was kindly provided by Opta and comprises all games from 2010/11 to 2013/14 in the top divisions of England, Spain, Italy, France and Germany. For every goal I derived the possession chain and grouped identical chains. It turns out that the 50 most common goals look like this:50ways[Yep, that’s a screenshot.] As you can see, by far the most common goal is scored with the team’s first touch in the chain. I think this is partly a testament to the randomness of the game itself, but also to the strictness of the definition of the chain: if a defender manages to get a touch just before an intricate move is about to be crowned with a goal, none of the move will count in the chain. A penalty and a header from a corner complete the top 3. Another summary of the results is provided in the figure below. Apologies for the terse, but hopefully still unambiguous codes for individual events. Note that listed event can occur anywhere in the chain for the goal to be counted, so for example the “head” bar comprises not only the headed goals, but also any goals where there was a headed pass in the move. barsIt turns out that only 64.5% of goals have a completed pass in the buildup (again, under my restrictive definition of buildup). I was delighted to discover that this agrees nicely with the classic analysis of Reep and Benjamin (“Skill and Chance in Association Football” J. R. Stat. Soc. 134(4):623-9, 1968, cited here after The Numbers Game), whose number is 60.6%. A quarter of goals involve a cross (but not necessarily as an assist), and about 1 in 19 see a shot saved before the ball goes into the net. Own goals are 3% of the total. Finally, for the theory minded, here is the distribution of frequencies of individual chains on a log-log scale. It’s tempting to drop some names here (cough Zipf cough), but in truth so many things look linear-ish on a log-log plot that it’s best not to. Perhaps if and when the definition of the chain is made more robust, the distribution plot will be more interesting.distro All data provided by Opta_200px

How low can you go: Assorted thoughts about crosses

Click any graphic to enlarge There is a pass that David Silva and Mesut Ozil, Premier League’s outstanding playmakers, are very fond of. Standing just inside the opposition penalty area close to a corner, with options inside and outside the box, they slip the ball instead to the overlapping fullback who crosses it in. Why do they do this if, as the common wisdom has it, crossing is low percentage play? The obvious answer is that not all crosses are equal, and with good setup play a cross is a dangerous weapon. I think it is particularly true of short, low crosses, precisely the kind Ozil and Silva encourage. I set out to investigate this hypothesis, only to realise that I don’t have a clean way of separating low and high crosses in my database. What follows are two simple analyses trying to work around this problem.

Completion and conversion

Two definitions: I consider a cross completed if the next on-the-ball action is performed by the player from the crossing team. A cross is converted if the crossing team scores within 5 seconds of the cross. This is an arbitrary window, but it should catch all the goals (including own goals) which are “due” to the cross in significant degree. This will include own goals, rebounds and goals from brief goal-line scrambles. Note that conversion and completion are independent in this formulation: a completed cross may be converted or not, and a converted cross needn’t have been completed. I looked at the last four full seasons of the five big European leagues and only considered locations where I have more than 1000 attempted crosses. Unless indicated otherwise, only open-play crosses were considered. conversion completion As expected, the premium crossing area is on the edge of the penalty and inside it (shall we call it the Zabaleta Zone?), where around 5% of crosses are converted. If this sounds low, then consider that the average cross conversion rate is just 1.76%. What was a bit of surprise to me is that it seems to be easier to complete a cross from the wide areas, farther away from the box. I suspect this is due to the fact that with a short cross the area is on average more crowded and who takes the next touch becomes more random. Average completion across all areas is 23.58%.

Crossing and success

There is a weak positive relationship between success (measured in points per game) and the proportion of cross-assisted shots that aren’t headers. (Here assist is taken in the strict sense and not in the sense of conversion defined above.) The correlation coefficient for the attached graphic is 0.43, which drops to 0.32 when Manchester City, the ultimate low-crossing team, are removed. Interestingly, this relationship doesn’t exist in the Bundesliga and Ligue 1. epl_crosses epl_low I don’t want to speculate on the nature of this relationship beyond what Devin Pleuler said on Twitter on Monday: https://twitter.com/devinpleuler/status/519196963129294850 https://twitter.com/devinpleuler/status/519197095887380481 That is, a preference for low crossing will come naturally to better teams. Success and reliance on crosses are inversely related: the higher proportion of a team’s shots come from crosses, the lower points-per-game. The strength of this relationship is similar to the previous, and, once again the effect is not found in Germany or France.


There can be no firm conclusions until I find a way of separating low crosses from the rest. However, it does appear that not all crosses are equal, and that a team that relies heavily on crosses for chance creation should make sure they know what they’re doing.   Data provided by Opta.

Picking the optimal Colombian XI for the World Cup

This article is part of the Goalimpact World Cup series. The Colombian XI was picked by Bobby Gardiner. Bobby regularly writes about a variety of football topics on his own blog and other outlets. To read more from him, follow him on twitter at @BobbyGardiner or have a look at www.falseix.com. The eminently skippable subjective introduction to player ratings and Goalimpact is by Marek Kwiatkowski (@statlurker).   Player analysis is a big deal given the sums spent in transfer fees and player wages. Accurate measurement of players’ actions is now possible in a number of areas thanks to the detailed data collected by companies like Prozone, Opta and Infostrada. But individual output is at best a proxy for performance, and the same has to be true of any player rating built on top of individual action counts. Often there will be players with excellent output who are clearly less valuable than some of their peers with lower output. You could call this effect the Podolski Paradox, only it is anything but a paradox: it is a logical, mundane consequence of the fact that at the low level, football is not bean counting but a complex, non-linear and, above all, densely interlinked dynamical system, where individual events take place in a rich context. Consider for a moment the steps necessary to turn actions into a rating: a single-number performance score for individual players. First, you need to select the relevant actions, and these will be different for different player positions. Incidentally, “position” is not a very well defined concept at all, but we plough on. Now you need to weigh the actions: how many tackles is a key pass worth for a full-back? How about for a forward? Hmmmm. But let’s go on; say we scored all actions separately. Now we need to normalise these scores for various factors such as time spend on the pitch (easy-peasy), quality of opposition (feasible) and opportunity to perform every kind of action (errrrrr). But say we’ve managed to do that, so now it’s time to combine the scores into a single rating. Is it a straight sum, or at least a linear combination? No it isn’t, and so on. The multi-dimensionality and complexity of the game bites you in the ass as soon as you begin and doesn’t give up until you do. In other words, what we need before we can build a robust player rating system based on individual actions is nothing less than a complete theory of football: a set of axioms that would allow us to put a precise value on every action in every context. We often — always — proceed as if such theory were not a prerequisite for bottom-up player evaluation, and as long as we do ad-hoc comparisons and the signal is strong we can get away with it, too; whatever that elusive ultimate theory is, it contains rules like “more goals is better” and “give-aways are bad”. But a comprehensive player rating system developed on such a flimsy basis is guaranteed to be bad — witness the WhoScored ratings, whose chuckle:insight ratio is somewhere in high single digits. Lastly, even if we ever arrived at an apparently robust action-based rating and tied it to real-world outcomes such as wages, chances are it would be gamed by players before you could say “Hang on, Mr Mendes”. But once we free ourselves from thinking in terms of individual actions, a new perspective opens.  Football is a team sport, as the cliché goes, so how about giving equal credit for every goal scored (and equal blame for every goal conceded) to all players on the team, regardless of who scored it, who assisted it, who intercepted the ball for the move and who made the decoy run? This simple, elegant and fair approach is the basis of top-down player ratings, chief among them the Goalimpact, developed by Jörg Seidel. Goalimpact has its weaknesses, but in my opinion is still miles ahead of any other systematic, public player rating scheme. To give you a flavour of Goalimpact, the most recent update (February 2014), has Ronaldo, Lahm, Fabregas, Schweinsteiger and Messi, in that order, as the five best players in the world. I wouldn’t be a lapsing academic if I didn’t use this list as a starting point for a quick overview of the strengths and weaknesses of the model. Ronaldo makes perfect sense and provides basic validation of the model. Lahm is a fantastic pick, highlighting the core strength of top-down ratings: independence from individual output, the scoring of which is as a rule more difficult (or at least less settled) for defenders. By way of strengthening this point: Lahm is not in the top 50 players according to WhoScored, who instead have Wolfsburg’s attacking left-back Ricardo Rodriguez as the 5th best player in the world (good player; no further comment). Messi in 5th seems low, but perhaps it’s a sign that his otherworldly performances over the years should be also credited to his excellent Barcelona teammates? Schweinsteiger on the other hand looks too high, but maybe he is responsible for Bayern’s runaway success — or maybe his position is a signal of an unindentified (by me, as yet) bias in the GI formula. That leaves Fabregas, who can be a poster boy for Goalimpact’s major weakness: a standout player who remains at a level lower than that to which he belongs will be overvalued. This refers to Fabregas’ time at Arsenal, and we also found similar issue with the Goalimpact of the Colombians who made it to the big European leagues being often lower than that of the best players in the Colombian league. Over to Bobby. * * *

Thanks to Jörg Seidel of GoalImpact for the data. If you’re wondering what GoalImpact is/entails, check it out here – http://www.goalimpact.com/p/blog-page.html (or, indeed, scroll up –ed).

After a sixteen year wait, Colombia are returning to the World Cup Finals. Inspired by a ‘golden generation’, Los Cafeteros have risen to 5th in Fifa’s World Rankings and given their notoriously passionate fan-base cause for genuine hope. Their first obstacle is escaping unscathed from what is possibly the most even of Rio’s groups – Greece, Côte D’Ivoire and Japan join them in C.

The XI crafted from GoalImpact scores alone:


This particular set up is quite far off a likely XI, probably due to a combination of old defenders (Perea, Yepes and Mosquera are all in their 30s and in Pekerman’s squad) and a lot of players in the Colombian League. That isn’t to criticise GoalImpact as a measurement, though, as we all know that players aren’t always picked or not picked based on ability and/or output alone (ask Samir Nasri or Carlos Tevez). Context is needed, and so I’ll somewhat systematically rifle through each area of the team:


In between the posts, David Ospina of Nice is likely to start. His GI of 92.6 is quite noticeably lower than the 123.67 of Faryd Mondragon, but he is 17 years younger and fast establishing himself as Colombia’s first choice keeper. They’ll be joined in the squad by the uncapped Camilo Vargas (102.78).

Oscar Murillo (110.46) and Alejandro Bernal (110.78), both of Atletico Nacional in Colombia, failed to make Pekerman’s initial 30 man squad and so are out of contention.

Pablo Armero (116.39), recently on loan at West Ham, is likely to start at LB while PSV’s Santiago Arias (94.5) should take up the other full back spot. The young right back may have been capped just 4 times by his country but he is only 22 and has been linked with the likes of Manchester United recently.

I’m not sure what the oldest centre back pairing at a WC is, but if Luis Perea (93.92) and Mario Yepes (40.97) start together as they did in Colombia’s most recent friendly against Tunisia, their combined 72 years may just break that record. Although their GIs (especially Yepes’) are very low, their peak GoalImpact scores are 123.67 and 138.33 respectively and so an experience vs current output trade-off may be Pekerman’s thinking here. I would personally start AC Milan’s Cristian Zapata (102.17) over Yepes. At 27, he’s not far off his peak but still possesses the necessary experience for the occasion.


The general rule of thumb with the Columbian team is the more you push into the attack, the higher the quality of the players. Sadly, Diego Arias (110.86) will not play a part in Rio – like his aforementioned Atletico Nacional teammates, he failed to make the provisional squad.

If you didn’t know who Fredy Guarin (115.04) is, you probably did come January after a frankly confusing transfer fiasco with his club Internazionale and a whole host of English and Italian clubs. In the end the talented all-rounder stayed and he is extremely likely to start in Rio. One of Abel Aguilar of Toulouse (94.08), Elche’s Carlos Sanchez (92.35) or Monarcas’ Aldo Ramirez (93.6) will probably start next to him at centre-mid. I’d go with Aguilar myself. In terms of GI, there’s almost no difference between the three, but the first two are a tad more defensively minded than Ramirez while Aguilar edges it over Sanchez because of his ability to (albeit occasionally) score.

James (pronounced Ha-Mez) Rodriguez (122.87) is one of my favourite players in the world. At 22, he is quickly becoming the perfect combination of dangerous pace and brilliant creativity; managing an extremely healthy NPGA90 of 0.65 this season at Monaco. A lot of attention directed towards the Columbian team will focus on his Monaco teammate, but keep an eye on the man likely to start as a winger but equally adept in a 10 role. On the opposite flank, Juan Cuadrado (106.87) is my choice. An extremely important part of Fiorentina’s attack this season, the skilful winger has improved tremendously in terms of efficiency with a NPGA90 of 0.52. If Pekerman wants a more central attacking midfielder, Macnelly Torres (116.13) is almost a certainty for the squad. Although now plying his trade at Al-Shabab in Qatar, the quirkily named creator is renowned and feared in South America for his trickery.


To be honest, the shape of Columbia’s attack is entirely dependent upon the fitness of one man – Radamel Falcao, whose GoalImpact score is the highest of any of the squad at 131.46. Any ‘golden generation’ comments or ‘dark horse’ bets are almost entirely focused on him. Seen by many as one of the best strikers in the world, his race to fitness has encouraged Pekerman enough to include him in the provisional squad and we all hope that he makes it.

The thing is…it might not be THAT big of a deal if he doesn’t. Before I’m thrown into some kind of metaphorical football taboo dungeon, let’s have a look at Columbia’s other striker options. Jackson Martinez of Porto (119.5), soon-to-be Dortmund’s Adrian Ramos (109.4), Sevilla’s Carlos Bacca (112.46), Luis Muriel of Udinese (105.08) and Cagliari’s Victor Ibarbo (99.52) have all made the provisional squad. That’s an incredible amount of good quality strikers, and Ramos and Bacca especially are coming off the back of very strong seasons with their club sides. As for which of those start if Falcao is out, I’d go with Martinez and/or Ramos depending on the formation. Both are adept target men, but deceptively good with their feet and I think they’d provide the best outlet to James.

My XI, combining context and GI scores:


Formation wise, it’s quite difficult to predict what Pekerman will do. A 4222 type set up helped him through the qualifiers but he was equally reliant on a more defensive 4231 away from home. Against the offensive prowess of Cote D’Ivoire and Japan, the latter might be the better option.

Muriel has been regularly used by club and country as a winger and putting him there allows James to unleash his creativity more centrally. Obviously, if Falcao is fit, he starts either over Martinez or alongside him (or any other one of their billion strikers) in a 4222. The average GoalImpact score of this particular team is 107.60 which is pretty low, but a lot of these players are either young and having their first few good seasons (Muriel, James, Cuadrado) or old and likely to bow out after the World Cup.

‘16’ is an important number for Los Cafeteros in another sense – the furthest they’ve been in a World Cup was the round of 16 in 1990. Maybe, with or without Falcao, Pekerman’s men will be able to better that this summer.

Distributions of Central Midfield Stats

Part 1: distributions of goalscoring stats.

In the second instalment of my Distributions series (for book & movie rights, contact me on the address below) I look at central midfielders. Considerable versatility is required in this position, and so I will look at a larger number of stats than usual, six in total: passing accuracy, key passes, tackles, interceptions, and resistance to opponent’s dribbles and to dispossessions. The choice is arbitrary, and linked to the exclusion from the dataset of #10s, i.e. the advanced central playmakers. Perhaps I’ll treat them separately in a next article; for now I didn’t want them to dull the distributions of the defensive stats.

The dataset includes all central midfielders from the five big leagues since 2009/10. Different seasons are treated separately, so a player active since 2009/10 contributed five records to the dataset (the current season is included). To be included, the player had to spend at least 900 minutes on the pitch over the course of the season. As was the case with the previous piece, this one contains little actual analysis, that is clever things one does with one’s dataset. I realise this makes them a bit dry, but we here at StatsBomb think they are important, in that they are perhaps the first published benchmark for players’ statistical output. Like so:



The vertical line in each histogram marks the 80th percentile, that is the value of the stat beyond which a player outperforms 80% of the field. For each category I have also listed the 10 best player-seasons in the dataset. The reason why I plotted Spain-based players’ interceptions separately is that we suspect that the interception stat is inflated, likely erroneously, for the 2010/11 and (especially) 2011/12 seasons of La Liga.

One question this graph immediately suggested to me was: how strong are the trade-offs between these six facets of midfield play? Unless you are Arturo Vidal, you’re probably not an elite key passer and an elite tackler, and even Vidal doesn’t rank towards the top in, for example, avoiding dispossessions (rather abject 2.53 per90 in 2011/12). So how close to the perfect central midfielder can you get, and who would you resemble? A quick way is to take the relative performance in all six categories (i.e. what % of the field does worse), and average that. The resulting histogram with the list of top 30 performers is below. Note that I didn’t correct for the Spanish interception bias, so the 2010/11 and 2011/12 La Liga players had a bit of an upper hand. As an Arsenal fan, I’m delighted to see Mikel Arteta, my favourite player, in the second position. As an analyst, I’m compelled to remind you that the choice of the six stats was arbitrary in the first place, and averaging percentiles assumes all six stats are of equal importance.



Somewhat related: comparing central midfielders.

Data collected by  Opta_200px .

From Awesome to Average to Awful: Goalscoring Stat Distributions

Thanks to Ted Knutson’s work, we know the players with the best attacking output between 2008/09 and 2012/13. (For all the usual reasons I hesitate to say the “best attacking players”.) But his litany of superstars (a few surprise names notwithstanding) says nothing about how we should evaluate the performance of more ordinary players. We know Olivier Giroud isn’t as good as Messi, but can we quantify that gap? And how does he compare to the rest of the field? In short, we need to see how the basic performance metrics are distributed across all players in the game. This is what I’m setting out to do, in what will hopefully be a series of articles. Today, I focus on attacking production, ie. goals and assists. I use two simple metrics that you probably have seen before: non-penalty goals scored per 90 minutes spent on the pitch (NPG90), and non-penalty goals plus assists per 90 minutes (NPG+A90). Ted has already written about the need of discounting penalty goals from analyses and the importance of normalisation by time in the article linked above, so I don’t have to. Naturally, normalisation for other factors, most notably team and opponent strength, would be nice, but I don’t do it since there is no canonical method of doing so. Caveat emptor. The dataset I used consists of players from the five big European leagues, and spans almost five full seasons (full 2008/09 to 2012/13 and 2013/14 until last weekend). For this article I restricted it to the players who can reasonably be termed “attackers”, because I didn’t want the low attacking output of defenders and deeper midfielders to overwhelm the distributions. The actual algorithm used to determine whether a player should be counted is rather complex, and I will not describe it here, except to say that it did not rely on the goal and assist numbers and so didn’t introduce bias. I wouldn’t expect it to be 100% accurate, but the collection of players considered here should contain most forwards, wingers and purely attacking midfielders from my dataset. Playing 900 minutes or more over the course of a season was also required for inclusion in this study. The histograms of NPG90 and NPG+A90 are shown below: hist (NPG90 mean: 0.28, std dev: 0.18; NPG+A90 mean: 0.44, std dev: 0.23) Now, I am not a statistician, but to my eye both distributions resemble the normal distribution, but with the left side thinned out and the left tail chopped off by the boundary. This makes sense intuitively: with the multitude of factors contributing to player’s performance we’d expect it to be normally distributed; and the missing players in the left half are simply those who are not good enough for a team in Europe’s big 5 leagues, and ply their trade elsewhere. Another, and perhaps better way of visualising this data is the cumulative distribution plot: cdf Here we can see for example that to be in the top 20% of attackers in Europe, a player should score at least at the rate of 0.42 goals per 90 minutes, and have a “goal involvement rate” of 0.59 per90 (NB. For a top-class #9, these numbers are not enough — they are biased downwards by all the midfielders in the dataset). We can also see why Arsenal believed in Gervinho, and that Miroslav Klose is not doing badly for a 35 year old. With thanks to Ted Knutson for discussions on this subject. Data collected by Opta-Logo-Final-Cyan.