Messi Data Release Part 1 & Working with StatsBomb Data in R
Today's the day. Er, hoy es el día.
Last year we released free StatsBomb Data for the Men's 2018 World Cup, as well as for several women's leagues. This year we did the same for the Women's World Cup. Now, as previously announced, we're rolling out free data for every La Liga game that Lionel Messi has played in to date.
All of this will be released in four parts. Today we're putting out the first chunk of data, covering the 2004/2005 - 2007/2008 seasons. This is coming along with a few changes to the open data repository on GitHub. But don't worry, our lovely CTO Thom Lawrence is on hand to explain all that:
"We've made a few changes to the open data repo. The first of these might break your existing code (although we've updated StatsBombR to support it): matches are now split into competitions and seasons, whereas previously only a single season of each competition was available. Secondly, we've upgraded all the matches currently available to our latest spec. This includes goodies like derived carry events, which might be of interest if you wanted to study the dribbles of a certain diminutive Argentinian. These events are calculated automatically, so you might see some very odd short looking carries, but for now we leave them all in the data for completeness' sake. Various other improvements to the data are detailed in the updated docs available alongside the data. Have a look at the docs and the new file layout, and please update your code to be compatible with the changes. Of course, if you don't have time, you can also just pull previous versions of the data from git if you really need to."
Just to emphasise: if you have done work with our data previously, please update StatsBombR.
To get your hands on the data just hop on over to our resource centre, sign the user agreement and go have some fun.
Working with StatsBomb Data In R
Now, we know that wrangling the data is an intimidating process for some. Coding can be tough to get into. For the uninitiated, staring at walls of code is a bit like gazing into a misty forest of unnervingly arranged characters. And just who is this json?
We'd like this to be as approachable as possible for as many people as possible. We want you to feel comfortable jumping in and having a play around. With that in mind, we've put together a little primer for working with our data in the R programming language.
R was chosen for a few reasons: For starters it's the language we on the internal analysis side of StatsBomb use most commonly. It's quite handy in various ways for parsing, visualising and generally working with large datasets (although I've no doubt some will have objections to this). And, perhaps most importantly, our former data scientist Derrick Yam (now of the bloody Baltimore Ravens! Go you Hamsterdam birdfriends!) put together StatsBombR, an R 'package' with a whole suite of functions that make it a breeze to pull and manipulate our data. For those who have already committed to Python, we apologise for having nothing that caters to you. Perhaps in the future we will. It's just not something we have time and expertise to do right now.
It'll run you through the fundamentals of R itself, along with how to pull our data, a few initial use cases like pulling out various numbers and building some basic plots and point you towards some resources that'll help you along your way. You can find that here in PDF form:
GUIDE IN ENGLISH HERE--> Using StatsBomb Data In R
That's not all though. This is, after all, a project revolving around Spanish football. To further the ends in Spain, we also have a Spanish language version of the primer. The translation for this was kindly done by the effervescent Pablo P. Rodríguez. If you have any questions about the translation specifically then he will happily* field them (*poor bugger). And while you're there, please do thank him for his efforts.
GUIDE IN SPANISH HERE--> Using StatsBomb Data In R - Spanish
Just one example of what can be conjured up in R with StatsBomb Data. I made this and it's crap, but you can do better and take my job!
Any questions about the primer, or just generally about using the data in R, then do give me a shout on twitter (DMs are better for longer inquiries) or via my work email, firstname.lastname@example.org.
Hopefully having a look through this will ease your anxiety about diving into R. It's just a start but it should give you that first little nudge on the way to getting fully stuck in to the wild world of data witchcraft. We sincerely hope that you have a grand ol' time rambling through the avenues of Messi's La Liga history and all the free data beyond that.
Be well and have great days.