A toolbox for football analytics
My article "Towards a new kind of analytics" published on this site several weeks ago has received a lot of attention, for which I am very grateful. Most feedback I received though was along the lines of "this is all well and good, but how do I go about doing this kind of stuff?" This follow-up is designed to answer this question in a narrow sense, by listing some of the basic computational and statistical tools that I found indispensable in "proper" analytics work. As I argue in that previous article, the lowest-hanging fruit in football analytics has been picked and basic tools aren't sufficient anymore. To put things bluntly, the vast majority of analysts active in the public sphere today need to make a qualitative methodological leap forward or lose contact with the state of the art forever.
The narrative link list below is divided in three parts: databases, programming and statistics.
*** Databases ***
You must kill every thing you love, and that includes Excel, because a well-organised data store is critical to your analytics pipeline. It ensures that you have a single cross-linked source of data that you can curate and keep up-to-date, and it minimises the time required to extract and prepare datasets for analysis. In practice, this means a relational (SQL) database. SQL is an industry standard language for extracting ("querying") data from highly optimized databases. It was designed to be relatively easy to use by people without programming background, and I would say that this goal has been achieved -- I have personally converted several non-technical people into able SQL query-writers.
To use SQL you must operate an actual database containing your data. There are two basic questions here: what database variety to use, and where to run it. Starting with the latter, it is certainly possible and an entirely reasonable initial step to maintain a database on your personal computer. However, having it in the cloud is the superior option, because you don't need to worry about setup, backups or power failures. Amazon Web Services (AWS) offer a year's worth of free database hosting, so you can start with them and after a year either pay to continue or move to another solution. As to what flavour of SQL to use, I recommend MySQL if you have no previous relational database experience, and PostgreSQL otherwise. My impression having used both is that PostgreSQL is vastly superior to MySQL in a number of aspects, but trickier to tame without experience. If you go down the MySQL route, the MySQL Workbench is the leading front-end to MySQL databases.
*** Programming ***
In an ideal world, analytics requires little actual programming: the right data emerges from the database in the right form for a machine learning algorithm to analyse it and report the results to you. The practice makes a cruel mockery of this vision, and as an analyst I spend most of my time programming rather than designing models and analysing results. The three most common programming tasks that I have to perform are: loading raw data into databases; transforming the data extracted from a database into the form required by statistical software; and validating and reporting the results of my analyses. I suspect that the experience of other analysts is broadly similar.
Thus, unless you work in a team or have a programmer accomplice, you need a decent grasp of a programming language to do football analytics well. Two languages are the overwhelming favourites of the community: Python and R. My general advice on which one to choose is this: if you know basics of one of these, stick with it and put in the effort to learn it to a good standard. If you don't know either, learn R. Personally, I believe that as a programming language, R is an abomination and it really ought to be banned; but it is also the pragmatic choice for football analytics, because of the breadth of statistical methods available, and because two truly game-changing packages due to Hadley Wickham, dplyr and ggplot2 can take care of 95% of your data manipulation and visualisation needs. RStudio, also from Wickham, is the leading environment for R development. The Advanced R book (guess the author) is in fact not that advanced in the first few chapters and is a decent guide, unless you're truly starting from zero.
If you go down the Python route, install the Anaconda distribution, which pre-packages Python 2.7/3.5 (either is fine) for data science, including the scipy, numpy, matplotlib, statsmodels and scikit-learn add-ons essential for data analysis. PyCharm is a wonderful, feature-rich Python editor. An added benefit of Python is that you can use it to structure and query your SQL database using so-called ORMs, that is a technology that integrates the database with the language so closely that database records appear as variables directly in the code and you can manipulate them without SQL (I have no doubt that R has ORMs too, but the mere thought makes me shudder). The two leading Python ORMs are Django and SQLAlchemy. I use the former, which is actually a complete framework for database-driven websites, but SQLAlchemy is a perfectly fine choice too; Soccermetrics use it, and you can find plenty of code samples in Howard's repos.
Lastly, whether you end up with R or Python, version control software (VCS) is essential. VCSes let you move easily between multiple versions of your code and thus make sure that nothing you ever write is lost and also help you understand how your code evolved over time and why. There is no better VCS than Git. If you can afford it, pay GitHub 7 money/month and they will host your code in private repositories, and you can use their excellent web interface which adds tons of genuinely useful features on top of Git itself. If you'd rather not pay, Bitbucket will host private repos for free, but the interface is atrocious. The last option is GitLab -- it is free and the interface is perfectly decent, but you have to host the code repository yourself. In all cases, you will have to learn Git itself, which is a command-line program of considerable complexity, but understanding the basic commands (checkout, status, add, commit, push, pull, branch) takes no more than a day and is all you are going to need. The official Git webpage linked above has plenty of good educational resources.
*** Statistics ***
Perhaps my main complaint with the public analytics today is that the analysts do not use the proper statistical machinery to tackle their questions. As I have said before, all progress on the key analytics questions than can be achieved with counting and averaging event data has been achieved. Football is complex and football data is noisy, and to derive robust insight, powerful, specialist tools are necessary. Unfortunately, learning advanced statistics on your own is a challenging and long process, and despite having been engaged in it for the past several years, I have only scratched the surface. Perhaps the more efficient way of doing it would be to attend an online course, or follow a general statistics textbook. I can't recommend any particular course but I can't imagine that a randomly chosen one can do harm. As to textbooks, Cosma Shilazi's draft is very decent, as is Norman Matloff's (Thom Lawrence's find), and they are both free. Gelman et al.'s Bayesian Data Analysis is a comprehensive, advanced treatment of, erm, Bayesian data analysis, and if you google hard enough there is a PDF of that on the web too.
One concrete statistical method that I believe is simple enough to get a grip on very quickly but could instantly improve a majority of your analyses is Generalized Linear Models (GLMs). GLMs generalize linear regression in two ways: first, the dependent (predicted) variable doesn't have to be a linear function of the predictors variables anymore, and second, the distribution of errors is not necessarily Gaussian. Because of these two areas of added flexibility, many of the common football analytics models fit in the GLM framework. An expected goals model can be a GLM but so can a score prediction model or a power ranking, and so on. R has the in-built glm function, which allows you to specify GLMs with a single, powerful command, and the payoff is great: you get the coefficients, statistical significance and fit diagnostic all for free.
My objective in this article is to enable budding football analysts to build a platform from which they can analyse the game in the most efficient way possible, meaning maximum insight for unit of time spent. Unfortunately, unless you're already half-way there, this entails a heavy initial time investment. In my view it is not just worth it; is it a necessary precondition for serious work. If serious work is what you have in mind, do not hesitate for ask me on Twitter (@statlurker) for further pointers.
Many thanks to Thom Lawrence for his feedback on this article.