noahs_ark

My article “Towards a new kind of analytics” published on this site several weeks ago has received a lot of attention, for which I am very grateful. Most feedback I received though was along the lines of “this is all well and good, but how do I go about doing this kind of stuff?” This follow-up is designed to answer this question in a narrow sense, by listing some of the basic computational and statistical tools that I found indispensable in “proper” analytics work.  As I argue in that previous article, the lowest-hanging fruit in football analytics has been picked and basic tools aren’t sufficient anymore. To put things bluntly, the vast majority of analysts active in the public sphere today need to make a qualitative methodological leap forward or lose contact with the state of the art forever.

The narrative link list below is divided in three parts: databases, programming and statistics.

*** Databases ***

You must kill every thing you love, and that includes Excel, because a well-organised data store is critical to your analytics pipeline. It ensures that you have a single cross-linked source of data that you can curate and keep up-to-date, and it minimises the time required to extract and prepare datasets for analysis. In practice, this means a relational (SQL) database. SQL is an industry standard language for extracting (“querying”) data from highly optimized databases. It was designed to be relatively easy to use by people without programming background, and I would say that this goal has been achieved — I have personally converted several non-technical people into able SQL query-writers.

To use SQL you must operate an actual database containing your data. There are two basic questions here: what database variety to use, and where to run it. Starting with the latter, it is certainly possible and an entirely reasonable initial step to maintain a database on your personal computer. However, having it in the cloud is the superior option, because you don’t need to worry about setup, backups or power failures. Amazon Web Services (AWS) offer a year’s worth of free database hosting, so you can start with them and after a year either pay to continue or move to another solution. As to what flavour of SQL to use, I recommend MySQL if you have no previous relational database experience, and PostgreSQL otherwise. My impression having used both is that  PostgreSQL is vastly superior to MySQL in a number of aspects, but trickier to tame without experience. If you go down the MySQL route, the MySQL Workbench is the leading front-end to MySQL databases.

*** Programming ***

In an ideal world, analytics requires little actual programming: the right data emerges from the database in the right form for a machine learning algorithm to analyse it and report the results to you. The practice makes a cruel mockery of this vision, and as an analyst I spend most of my time programming rather than designing models and analysing results. The three most common programming tasks that I have to perform are: loading raw data into databases; transforming the data extracted from a database into the form required by statistical software; and validating and reporting the results of my analyses. I suspect that the experience of other analysts is broadly similar.

Thus, unless you work in a team or have a programmer accomplice, you need a decent grasp of a programming language to do football analytics well. Two languages are the overwhelming  favourites of the community: Python and R. My general advice on which one to choose is this: if you know basics of one of these, stick with it and put in the effort to learn it to a good standard. If you don’t know either, learn R. Personally, I believe that as a programming language, R is an abomination and it really ought to be banned; but it is also the pragmatic choice for football analytics, because of the breadth of statistical methods available, and because two truly game-changing packages due to Hadley Wickham, dplyr and ggplot2 can take care of 95% of your data manipulation and visualisation needs. RStudio, also from Wickham, is the leading environment for R development. The Advanced R book (guess the author) is in fact not that advanced in the first few chapters and is a decent guide, unless you’re truly starting from zero.

If you go down the Python route, install the Anaconda distribution, which pre-packages Python 2.7/3.5 (either is fine) for data science, including the scipy, numpy, matplotlib, statsmodels and scikit-learn add-ons essential for data analysis. PyCharm is a wonderful, feature-rich Python editor. An added benefit of Python is that you can use it to structure and query your SQL database using so-called ORMs, that is a technology that integrates the database with the language so closely that database records appear as variables directly in the code and you can manipulate them without SQL (I have no doubt that R has ORMs too, but the mere thought makes me shudder). The two leading Python ORMs are Django and SQLAlchemy. I use the former, which is actually a complete framework for database-driven websites, but SQLAlchemy is a perfectly fine choice too; Soccermetrics use it, and you can find plenty of code samples in Howard’s repos.

Lastly, whether you end up with R or Python, version control software (VCS) is essential. VCSes let you move easily between multiple versions of your code and thus make sure that nothing you ever write is lost and also help you understand how your code evolved over time and why. There is no better VCS than Git. If you can afford it, pay GitHub 7 money/month and they will host your code in private repositories, and you can use their excellent web interface which adds tons of genuinely useful features on top of Git itself. If you’d rather not pay, Bitbucket will host private repos for free, but the interface is atrocious. The last option is GitLab — it is free and the interface is perfectly decent, but you have to host the code repository yourself. In all cases, you will have to learn Git itself, which is a command-line program of considerable complexity, but understanding the basic commands (checkout, status, add, commit, push, pull, branch) takes no more than a day and is all you are going to need. The official Git webpage linked above has plenty of good educational resources.

*** Statistics ***

Perhaps my main complaint with the public analytics today is that the analysts do not use the proper statistical machinery to tackle their questions. As I have said before, all progress on the key analytics questions than can be achieved with counting and averaging event data has been achieved. Football is complex and football data is noisy, and to derive robust insight, powerful, specialist tools are necessary. Unfortunately, learning advanced statistics on your own is a challenging and long process, and despite having been engaged in it for the past several years, I have only scratched the surface. Perhaps the more efficient way of doing it would be to attend an online course, or follow a general statistics textbook. I can’t recommend any particular course but I can’t imagine that a randomly chosen one can do harm. As to textbooks, Cosma Shilazi’s draft is very decent, as is Norman Matloff’s (Thom Lawrence’s find), and they are both free. Gelman et al.’s Bayesian Data Analysis is a comprehensive, advanced treatment of, erm, Bayesian data analysis, and if you google hard enough there is a PDF of that on the web too.

One concrete statistical method that I believe is simple enough to get a grip on very quickly but could instantly improve a majority of your analyses is Generalized Linear Models (GLMs). GLMs generalize linear regression in two ways: first, the dependent (predicted) variable doesn’t have to be a linear function of the predictors variables anymore, and second, the distribution of errors is not necessarily Gaussian. Because of these two areas of added flexibility, many of the common football analytics models fit in the GLM framework. An expected goals model can be a GLM but so can a score prediction model or a power ranking, and so on. R has the in-built glm function, which allows you to specify GLMs with a single, powerful command, and the payoff is great: you get the coefficients, statistical significance and fit diagnostic all for free.

***

My objective in this article is to enable budding football analysts to build a platform from which they can analyse the game in the most efficient way possible, meaning maximum insight for unit of time spent. Unfortunately, unless you’re already half-way there, this entails a heavy initial time investment. In my view it is not just worth it; is it a necessary precondition for serious work. If serious work is what you have in mind, do not hesitate for ask me on Twitter (@statlurker) for further pointers.

 

Many thanks to Thom Lawrence for his feedback on this article.

  • Ville Sillanpää

    What about the data itself? I have all the capabilities you talk about in the article, but they are of no use since all interesting data is proprietary. Looking at analytics twitter it seems that all interesting stuff is done with datasets that have event info with spatial data. I don’t think you can get that kind of data for cheap from anywhere.

    Great article though! Thanks for writing this.

    • Paul Tiensuu

      Not only is it not cheap, but it is virtually impossible to get for “personal usage”, which is how Opta put it when I asked to buy data from them. They refused to sell their data to me, I am not sure what I should have done exactly to be allowed to buy it, they were not explicit about this, but it seemed to me that if I had an enterprise with a plan of commercial use of the data, they could have sold it to me. I’d be really interested in hearing how Marek, Ted and others got their data in the first place.

      • Ron IsNotMyRealName

        Definitely a good question.

      • kolyvanov

        this question has been asked over and over again, not only to statsbomb people. But it never gets a reply. I think this kind of attitude is holding football analytics back.
        By the way, have you asked other data providers?

        • Paul Tiensuu

          I haven’t. Tbh I don’t know others.

          I’ve noticed the silence around this issue too. I’ve personally posed the question a few times here. It’s strange, they repeatedly tell us to get real data in our hands to get started, but when it comes to the question of how do we get the data, a silence falls.

          • kolyvanov

            I can think of prozone as an alternative provider.
            But possibly there are many others.

      • BrunoP

        I confirm this. I have also tried to buy their data (OPTA) but they have refused selling it to me. They would only sell basic, aggregated data like number of shots, interceptions, etc.

      • Johan

        Don’t want to get into the Opta data argument (for now) but have some old Opta like match data I could let you all have so that you can practice your analytic skills.

        • Misha

          I’d appriciate that very much!

          • Johan

            OK, can someone suggest the best way for me to give you access to the data. Ideally I would like to deal with each one of you directly via email. and find out what you want to do with it. I also would expect to receive back some interesting analysis, etc.

          • Hannesson1

            Very much appreciated! Can you provide an email address were we can send our requests to?
            Regards

          • Johan

            Write to walter.bert@digitalartlab.co.uk with a brief description of who you are and what you aim to do with the data.

        • Andre Forbes

          Ditto.. Would be quite interested to mess with some real dataa

        • Paul Tiensuu

          I would appreciate that too.

        • StefanoR

          I’d love that. Thanks!

        • http://bmarques.net El Saico

          A Masters student with no data yet to speak for would deeply appreciate it.

    • akoshodi

      Here is a link to a kaggle dataset which contains 25k+ matches and players stats for European Professional Football. https://www.kaggle.com/hugomathien/soccer I dont know if its what you are looking for but its good enough for learning.

      • kolyvanov

        the one with the player attributes from EA sports FIFA games?
        Are you serious? There is no individual player real stats in that dataset.

    • Ville Sillanpää

      I guess after a few years it might be that Opta will release some old datasets to public to accelerate data use. I guess it could make commercial sense to release some +5 year old data for academic purposes, as then you could still sell the latest data to relevant parties.

      • kolyvanov

        or they may sell data to individuals and increase revenues.

    • Tuiuan Veloso

      Well, I live in the country of tricks(Brazil), but when I was faced with the same situation in a different area I used my company as a “front” to buy it so I could use the data. It was nothing remotely related to my business, but since I was buying it as a company, they sold me anyways…I don’t know if that’s a possibility here, but is an idea.

      • Ron IsNotMyRealName

        I am very interested in hearing more about this (like how much it cost, and what you got).

        • Tuiuan Veloso

          It was some market data for a college work, me and a teacher. The university refused to support us, not exactly because they didn’t want, but because the bureaucracy of me and my teacher paying the university to buy something was kinda hard, and could raise some eyebrows. So he and I just split the cost using my company as a front to buy it. It was kinda expensive, converting the currency at the time it was roughly $600 dollars for some quite old small piece of data. The most damming thing is that in the end we had to abandon the article, as it was getting increasingly difficult for us to find time and we couldn’t get financing from any source, so it was money down the toilet in the end.

          • Ron IsNotMyRealName

            Oh I thought you meant it was football data haha.

          • Tuiuan Veloso

            There was a company here in Brazil that gathered some limited data from Brasileirão and sold it to anyone. It was some very basic stuff, and I don’t know if they’re still around. Footstats is the name of the company, I think.

          • Ron IsNotMyRealName

            Well whoscored has Brazil Serie A now, but having historical data would be nice too. I’m actually trying to get a project for which that might be quite useful.

    • http://bmarques.net El Saico

      100% this.

      My Master’s dissertation will be about machine learning-based tactical analysis, and as such is entirely reliant on spatiotemporal match data.
      However, attempting to contact Opta was fruitless (I managed to reach their LatAm representative, but he did not answer me further), Prozone tried to sell me some packaged stuff and even local provider Footstats refused to even bother replying.

      This means I risk having to have to either give up on my degree or quit the area.

      • Johan

        Did you write to me? Not sure if you did…very busy last few days. Don’t worry, help is coming.

      • Dale Micallef

        I am also doing a thesis on similar topic, which also requires spatiotemporal match data, have you had any luck finding this?

  • Ville Sillanpää

    For basics of Bayesian data analysis I would recommend “Doing Bayesian Data Analysis” by John K Kruschke instead of the Gelman book you suggested. In my opinion the Kruschke book is more hands on whereas the Gelman book is more theoretical.

  • Dave Foster

    May I add Jupyter Notebooks to the list of Python tools – having a friendly and forgiving REPL for exploration comes in really handy for quickly iterating on ideas before bringing a project into a finalized state.

  • Ron IsNotMyRealName

    Fantastic article. I’m in a program that offers experience in all of these right now.

  • kidmugsy

    Analysis of individual players makes sense for attacks: for defences, analysis of the group makes more sense. Or at least it would if any club had the insight to buy a whole defence rather than a scatter of individual players. Or, if not a whole defence, at least it could buy centre backs as a pair. Or centre backs plus goalie. Or centre backs plus DMF.

    • Ron IsNotMyRealName

      It’s hard to say what statistics even make a good CB.

      If you had everything you wanted in the world, how would you say quantifies and value a CB independent of team quality (just not giving up goals is not enough, and may even be a false positive)?

      And even further, how would you do this for a young CB that hasn’t had experience of playing in multiple partnerships at senior level?

    • Misha

      La Squadra Azzurra got that for free. 🙂

    • Tuiuan Veloso

      Sometimes a great CB may fool you into thinking his bad partner is not that bad, or defenders that have complementary qualities or play well in that particular style…Not even talking about differences between how coaches set up their defenses, zonal/man marking, adjust for opponents…Defense is fascinating, but it’s a goddamn mess.

  • kidmugsy

    To extend this: even if you suspect that no club will want to sell a pair of good centre backs, a buying club could at least look for centre backs who have played well as a pair for their country.

  • Johan

    Wonder why my offer of some Opta-like data has been taken by only a few people. Please check post below for details. Last orders….

    • Hannesson

      Hi Johan, I’ve sent you two mails but did not get an answer so far. Could you doublecheck your inbox?
      And just tell me if I need to be more concrete concerning my research topic.
      Best, Hannes

      • Johan

        Just found your email in Spam… will reply soon. Sorry!

        • Hannesson

          Thank you very much, just received the file.
          No problem at all 🙂
          Cheers

  • 1912George

    Lovely article, many thanks for that.
    I have to add that sourcing for data is another big part. One of your main programming sessions will be extracting data from different sources, cleaning and database planning. As mentioned in the article, you will be spending 95% of the time preparing your databases and looking for new data and the other 5% will be your football analysis.

    I do scraping myself and there are a number of websites that use OPTA data. So, learning data mining is one of the important skills that you should know.

    I wouldn’t say that it is a heavy time investment and 6-12 months will get you very close to the intermediate level in databases, programming and stats.

  • moshe ziat

    Great article, thanks.

    I started such projects few moths ago. Two tips for starters to make your life easy:

    1. You don’t need OPTA data sets to begin. You can just scrape data from big sites on the major leagues and get start. Usually you can extract pretty easy a data about goals, scorers, mnuted, yellow and red cards, fouls, corners etc.

    2. Start with a rule base algorithm. A machine learning is a different level of comlexity, and rule base algorithm can be enough to start with.

    I am working on stats I extracted about the English Premier League, and the most interesting (and simple) signal I was able to extract is a “team form behavior”. It goes as follow:

    – Every team result can be translated to “W” (win), “D” (Draw) or “L” (Lose).
    – Every team has a sequence of letters that presents it form from the start of the season, constructed by W, L or D.
    -As a linguist, I’m treating to it as a language with grammer. based on all team’s letters sequences from last seasons, I can extract a general synatx.
    – Now, in a given form – let’s say WDD – I can check was the most common letter to come after this sequence.

    Marek, I can’t send you a Twitter message. I’ll aprriecate if you contact me at my Twitter account (@ziatmoshe).

    Thanks.