Thanks to Ted Knutson’s work, we know the players with the best attacking output between 2008/09 and 2012/13. (For all the usual reasons I hesitate to say the “best attacking players”.) But his litany of superstars (a few surprise names notwithstanding) says nothing about how we should evaluate the performance of more ordinary players. We know Olivier Giroud isn’t as good as Messi, but can we quantify that gap? And how does he compare to the rest of the field? In short, we need to see how the basic performance metrics are distributed across all players in the game. This is what I’m setting out to do, in what will hopefully be a series of articles. Today, I focus on attacking production, ie. goals and assists.

I use two simple metrics that you probably have seen before: non-penalty goals scored per 90 minutes spent on the pitch (NPG90), and non-penalty goals plus assists per 90 minutes (NPG+A90). Ted has already written about the need of discounting penalty goals from analyses and the importance of normalisation by time in the article linked above, so I don’t have to. Naturally, normalisation for other factors, most notably team and opponent strength, would be nice, but I don’t do it since there is no canonical method of doing so. Caveat emptor.

The dataset I used consists of players from the five big European leagues, and spans almost five full seasons (full 2008/09 to 2012/13 and 2013/14 until last weekend). For this article I restricted it to the players who can reasonably be termed “attackers”, because I didn’t want the low attacking output of defenders and deeper midfielders to overwhelm the distributions. The actual algorithm used to determine whether a player should be counted is rather complex, and I will not describe it here, except to say that it did not rely on the goal and assist numbers and so didn’t introduce bias. I wouldn’t expect it to be 100% accurate, but the collection of players considered here should contain most forwards, wingers and purely attacking midfielders from my dataset. Playing 900 minutes or more over the course of a season was also required for inclusion in this study.

The histograms of NPG90 and NPG+A90 are shown below:
hist
(NPG90 mean: 0.28, std dev: 0.18; NPG+A90 mean: 0.44, std dev: 0.23)

Now, I am not a statistician, but to my eye both distributions resemble the normal distribution, but with the left side thinned out and the left tail chopped off by the boundary. This makes sense intuitively: with the multitude of factors contributing to player’s performance we’d expect it to be normally distributed; and the missing players in the left half are simply those who are not good enough for a team in Europe’s big 5 leagues, and ply their trade elsewhere.

Another, and perhaps better way of visualising this data is the cumulative distribution plot:
cdf
Here we can see for example that to be in the top 20% of attackers in Europe, a player should score at least at the rate of 0.42 goals per 90 minutes, and have a “goal involvement rate” of 0.59 per90 (NB. For a top-class #9, these numbers are not enough — they are biased downwards by all the midfielders in the dataset). We can also see why Arsenal believed in Gervinho, and that Miroslav Klose is not doing badly for a 35 year old.

With thanks to Ted Knutson for discussions on this subject. Data collected by Opta-Logo-Final-Cyan.

  • Miki

    Fascinating! At the end of the post you mention those numbers would not be enough for a top class #9…So what values do you get when you only count strikers?

    • http://www.mareklab.org Marek Kwiatkowski

      A good question. Annoyingly, the weakest part of this analysis is the most mundane one: who should or should not be counted as an attacker? The same problem appears for strikers, of course. By experimentally tweaking my already experimental classification model, I arrived at 0.50 NPG90 and 0.66 NPG+A90 for forwards (all the usual #9s I think, but also people like Sanchez and Podolski) at the 80% cutoff line, but this is to be taken with a lot of salt.

      • Miki

        Thanks for the reply.

  • Geraint Morgan

    Go look up what a poisson distribution is. Suspect that would fit the results much better than a nornal distribution.
    normal isn’t the be all and end all of statistics

    • Marek Kwiatkowski

      Thank you for your wonderfully haughty comment.

      • Geraint Morgan

        It wss not meant to be taken that way at all.
        sorry you felt it was said that way, it was genuinely meant to be helpful that statisticians dealing with rare events – goalscoring – could make use of other techniques than justthe normal distribution.

        • Marek Kwiatkowski

          Ah, alright, I’m sorry I snapped. Could you explain how you see the Poisson distribution arising for the NPG90 and NPG+A90 rates?

          • Geraint Morgan

            Original comment was that the shape looked like a poisson distribution to ne rather than normal.
            There is a wealth of articles on the Internet that look at using poisson distribution for betting purposes, I am def not clever enough to explain it as well as them. I googled for poisson distribution goal scoring and found a host of articles which I suspect can be applied to the distibution of goals scored by players

    • Marek Kwiatkowski

      One thing that ocurred to me is that variance=mean for Poisson, which suggests that NPG90 can’t be Poisson-distributed. On the other hand, NPG+A90 passes this simple test.

      I think I’ll ask our in-house statistician about this later today.

      • Lukasz

        Interesting read. Thanks.

        I think there might be a bit of confusion here as for which variable we might expect to be, or not to be, Poisson distributed. You would not be very far off if you assumed that the conditional distribution of a player’s goals per game given his scoring ability is Poisson. But this is not what you have plotted in figure 1. What is in there, if I understand it right, is the empirical distribution of what I guess could be an estimate of the scoring ability itself. You could argue that Poisson is a much worse choice (than the normal) for it as it is a discrete distribution to start with! Normal would make more sense if you assumed it on the log scale. (If you want to stick to the natural scale I would try Gamma as a first guess.)

        As for some of the other points, we tried to do a similar analysis of the goalscoring ability in a recently published paper. Some features are quite similar, e.g. discarding penalties, accounting for time on the pitch, position, etc. We have also accounted for the strength of the team mates and the opposition with some admittedely controversial results:)

        • Marek Kwiatkowski

          You’re quite right: the distribution of the goals scored by individual players in a fixed amount of time may be Poisson, but will likely be independent from the distributions of players’ Poisson parameters (ie scoring abilities). I wish I put it this way myself yesterday.

          As to the rest, well, give us the paper! In the spirit of academic co-operation. We may even cite! Please?

  • Geraint Morgan

    Brevity of reply on phone may have caused odd tone

  • Constantinos Chappas

    Hi Marek,

    The nature of the data (non-negative numbers with a small mean and a long tail) would not suggest a Normal distribution. There are tests that one can use to see if a Normal distribution is a good fit, and I’d be more than happy to have a look if you wish (I’d just need the data points used to create the first plot). In fact, I don’t think it’s Poisson either because for a Poisson fit, you’d need the horizontal axis “Score” to be a discrete variable whereas you have a continuous variable. Perhaps a Gamma distribution or a log-normal, but I don’t think identifying the distribution or assuming that it is Normal would be needed as you are not making any assumptions in your article which demand the presence of a Normal distribution. You are essentially using the empirical distribution (i.e. the one actually observed) to choose the thresholds which is perfectly reasonable given that the sample size is reasonably large. The normality assumption is not relevant here.

    • Marek Kwiatkowski

      Thanks, Constantinos. Yes, you’re right I don’t need this distribution to be normal to use as an assumption. I just like the idea of providing a little bit of mechanistic explanation for such distributions, and that’s what I tried to do here (normal starting point (via CLT) + directed selection process). I won’t insist this is correct.

  • Pingback: StatsBomb | Distributions of Central Midfield Stats()

  • Dan

    How do the two distributions look across the years? Are they broadly similar, or do they differ substantially?

Improve Performance and Productivity in Your Club:
State-of-the-art Football Analytics