Doing More With StatsBomb Data in R

Alongside the release of our Messi dataset we also put a PDF guide to using our data in R. It was intended as a basic introduction to not only our dataset but also the R programming language itself, for those who have yet to use it at any level. Hopefully that gave anyone interested in digging into football data a nice, smooth onboarding to the whole process.

For those who have taken the plunge, this article is going to go through a few more involved things that one could do with the data. This is for those that have already gone through the guide and have been playing about with SBD for a while now. It’s important that you have done this first as we will not be walking through absolutely everything and assumes a certain level of familiarity with R.  Now that the base terminology of it all has been established it should be easier to explore uncharted territory with a bit less trepidation. So far we have released open data on the women’s and men’s World Cups, the FAWSL, the NWSL, Lionel Messi’s entire La Liga career, the 2003/04 Arsenal Invincibles and 15 years of Champions League finals. You can follow along with this article using any dataset you like but for consistency’s sake we will be using the 2019/20 FAWSL season in all examples.

One last disclaimer: this is, of course, all about R. We also have a package for Python that isn’t quite as developed but still handles plenty of the basics for you if that’s your programming language of choice.


A big hurdle to doing anything nuanced with any dataset is one’s underlying understanding of it. There are so many distinct variables and considerations in the SB dataset that even I – having worked with it as my job for two years now – forget about some parts of it every now and then. To this end it helps to not only have our specs to hand for checking, but also to be aware of the names() and unique() functions. These allow you to get a top-down look at the columns/rows a dataframe contains. So let’s assume you have your data in an R df called ‘events’. We will be using this name for the data in all examples throughout this article. If you were to do names(StatsBombData) that would give you a list of all the columns in your dataset. Similarly, if you were to do unique(StatsBombData$type.name) you would get a list of every unique row that the ‘type.name’ column contains, i.e all the event types in our data. You can of course do that with any column. It’s good to have these two in your back pocket should you get lost in the forest of data at any point.

xGA, Joining and xG+xGA

xG assisted does not exist in our data initially. However, given that xGA is the xG value of a shot that a key pass/assist created, and that xG values do exist in our data, we can create xGA quite easily via joining. Here’s the code for that, we’ll go through it bit-by-bit afterwards:

library(tidyverse)
library(StatsBombR)

xGA = events %>%
filter(type.name==”Shot”) %>% #1
select(shot.key_pass_id, xGA = shot.statsbomb_xg) #2
shot_assists = left_join(events, xGA, by = c(“id” = “shot.key_pass_id”)) %>% #3
select(team.name, player.name, player.id, type.name, pass.shot_assist, pass.goal_assist, xGA ) %>% #4
filter(pass.shot_assist==TRUE | pass.goal_assist==TRUE) #5

  1. Filtering the data to just shots, as they are the only events with xG values.
  2. Select() allows you to choose which columns you want to, well, select, from your data, as not all are always necessary – especially with big datasets. First we are selecting the shot.key_pass_id column, which is a variable attached to shots that is just the ID of the pass that created the shot. You can also rename columns within select() which is what we are doing with xGA = shot.statsbomb_xg. This is so that, when we join it with the passes, it already has the correct name.
  3. left_join() lets you combine the columns from two different DFs by using two columns within either side of the join as reference keys. So in this example we are taking our initial DF (‘events’) and joining it with the one we just made (‘xGA’). The key is the by = c(“id” = “shot.key_pass_id”) part, this is saying ‘join these two DFs on instances where the id column in events matches the ‘shot.key_pass_id’ column in xGA’. So now the passes have the xG of the shots they created attached to them under the new column ‘xGA’.
  4. Again selecting just the relevant columns.
  5. Filtering our data down to just key passes/assists.

 

The end result should look like this:

All lovely. But what if you want to make a chart out of it? Say you want to combine it with xG to make a handy xG+xGA per90 chart:

player_xGA = shot_assists %>%
group_by(player.name, player.id, team.name) %>%
summarise(xGA = sum(xGA, na.rm = TRUE)) #1

player_xG = events %>% filter(type.name==”Shot”) %>%
filter(shot.type.name!=”Penalty” | is.na(shot.type.name)) %>%
group_by(player.name, player.id, team.name) %>%
summarise(xG = sum(shot.statsbomb_xg, na.rm = TRUE)) %>%
left_join(player_xGA) %>% mutate(xG_xGA = sum(xG+xGA, na.rm =TRUE) ) #2

player_minutes = get.minutesplayed(events)

player_minutes = player_minutes %>%
group_by(player.id) %>%
summarise(minutes = sum(MinutesPlayed)) #3

player_xG_xGA = left_join(player_xG, player_minutes) %>%
mutate(nineties = minutes/90, xG_90 = round(xG/nineties, 2),
xGA_90 = round(xGA/nineties,2),
xG_xGA90 = round(xG_xGA/nineties,2) ) #4

chart = player_xG_xGA %>%
ungroup() %>% filter(minutes>=600) %>%
top_n(n = 15, w = xG_xGA90) #5

chart<-chart %>%
select(1, 9:10)%>%
pivot_longer(-player.name, names_to = “variable”, values_to = “value”) %>%
filter(variable==”xG_90″ | variable==”xGA_90″) #6

  1. Grouping by player and summing their total xGA for the season.
  2. Filtering out penalties and summing each player’s xG, then joining with the xGA and adding the two together to get a third combined column.
  3. Getting minutes played for each player. If you went through the initial R guide you will have done this already.
  4. Joining the xG/xGA to the minutes, creating the 90s and dividing each stat by the 90s to get xG per 90 etc.
  5. Here we ungroup as we need the data in ungrouped form for what we’re about to do. First we filter to players with a minimum of 600 minutes, just to get rid of notably small samples. Then we use top_n(). This filters your DF to the top *insert number of your choice here* based on a column you specify. So here we’re filtering to the top 15 players in terms of xG90+xGA90.
  6. The pivot_longer() function flattens out the data. It’s easier to explain what that means if you see it first:

 

It has used the player.name as a reference point at creates separate rows for every variable that’s left over. We then filter down to just the xG90 and xGA90 variables so now each player has a separate variable and value row for those two metrics. Now let’s plot it:

ggplot(chart, aes(x =reorder(player.name, value), y = value, fill=fct_rev(variable))) + #1
geom_bar(stat=”identity”, colour=”white”)+
labs(title = “Expected Goal Contribution”, subtitle = “Premier League, 2019-20”,
x=””, y=”Per 90″, caption =”Minimum 750 minutes\nNPxG = Value of shots taken (no penalties)\nxG assisted = Value of shots assisted”)+
theme(axis.text.y = element_text(size=14, color=”#333333″, family=”Source Sans Pro”),
axis.title = element_text(size=14, color=”#333333″, family=”Source Sans Pro”),
axis.text.x = element_text(size=14, color=”#333333″, family=”Source Sans Pro”),
axis.ticks = element_blank(),
panel.background = element_rect(fill = “white”, colour = “white”),
plot.background = element_rect(fill = “white”, colour =”white”),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title=element_text(size=24, color=”#333333″, family=”Source Sans Pro” , face=”bold”),
plot.subtitle=element_text(size=18, color=”#333333″, family=”Source Sans Pro”, face=”bold”),
plot.caption=element_text(color=”#333333″, family=”Source Sans Pro”, size =10), text=element_text(family=”Source Sans Pro”),
legend.title=element_blank(),
legend.text = element_text(size=14, color=”#333333″, family=”Source Sans Pro”),
legend.position = “bottom”) + #2
scale_fill_manual(values=c(“#3371AC”, “#DC2228”), labels = c( “xG Assisted”,”NPxG”)) + #3
scale_y_continuous(expand = c(0, 0), limits= c(0,max(chart$value) + 0.3)) + #4
coord_flip()+ #5
guides(fill = guide_legend(reverse = TRUE)) #6

  1. Two things are going on here that are different from your average bar chart. First is reorder(), which allows you reorder a variable along either axis based on a second variable. In this instance we are putting the player names on the x axis and reordering them by value – i.e the xG and xGA combined – meaning they are now in descending order from most to least combined xG+xGA. Second is that we’ve put the ‘variable’ on the bar fill. This allows us to put two separate metrics onto one bar chart and have them stack, as you will see below, by having them be separate fill colours.
  2. Everything within labs() and theme() is fairly self explanatory and is just what we have used internally. You can get rid of all this if you like and change it to suit your own design tastes.
  3. Here we are providing specific colour hex codes to the values (so xG = red and xGA = blue) and then labelling them so they are named correctly on the chart’s legend.
  4. Expand() allows you to expand the boundaries of the x or y axis, but if you set the values to (0,0) it also removes all space between the axis and the inner chart itself (if you’re having a hard time envisioning that, try removing expand() and see what it looks like). Then we are setting the limits of the y axis so the longest bar on the chart isn’t too close to the edge of the chart. ‘max(chart$value) + 0.3′ is saying ‘take the max value and add 0.3 to make that the upper limit of the y axis’.
  5. Flipping the x axis and y axis so we have a nice horizontal bar chart rather than a vertical one.
  6. Reversing the legend so that the order of it matches up with the order of xG and xGA on the chart itself.

All in that should look like this:

Heatmaps

Heatmaps are one of the everpresents in football data. They are fairly easy to make in R once you get your head round how to do so, but can be unintuitive without having it explained to you first. For this example we’re going to do a defensive heatmap, looking at how often teams make a % of their overall defensive actions in certain zones, then comparing that % vs league average:

library(tidyverse)

heatmap = events %>%
mutate(location.x = ifelse(location.x>120, 120, location.x),
location.y = ifelse(location.y>80, 80, location.y),
location.x = ifelse(location.x<0, 0, location.x),
location.y = ifelse(location.y<0, 0, location.y)) #1 

heatmap$xbin <- cut(heatmap$location.x, breaks = seq(from=0, to=120, by = 20),include.lowest=TRUE )
heatmap$ybin <- cut(heatmap$location.y, breaks = seq(from=0, to=80, by = 20),include.lowest=TRUE) #2

heatmap = heatmap%>%
filter(type.name==”Pressure” | duel.type.name==”Tackle” | type.name==”Foul Committed” | type.name==”Interception” |
type.name==”Block” ) %>%

group_by(team.name) %>%
mutate(total_DA = n()) %>%
group_by(team.name, xbin, ybin) %>%
summarise(total_DA = max(total_DA),
bin_DA = n(),
bin_pct = bin_DA/total_DA,
location.x = median(location.x),
location.y = median(location.y)) %>%
group_by(xbin, ybin) %>%
mutate(league_ave = mean(bin_pct)) %>%
group_by(team.name, xbin, ybin) %>%
mutate(diff_vs_ave = bin_pct – league_ave) #3

  1. Some of the coordinates in our data sit outside the bounds of the pitch (you can see the layout of our pitch coordinates in our event spec, but it’s 0-120 along the x axis and 0-80 along the y axis). This will cause issue with a heatmap and give you dodgy looking zones outside the pitch. So what we’re doing here is using ifelse() to say ‘if a location.x/y coordinate is outside the bounds that we want, then replace it with one that’s within the boundaries. If it is not outside the bounds just leave it as is’.
  2. cut() literally cuts up the data how you ask it to. Here, we’re cutting along the x axis (from 0-120, again the length of our pitch according to our coordinates in the spec) and the y axis (0-80), and we’re cutting them ‘by’ the value we feed it, in this case 20. So we’re splitting it up into buckets of 20. This creates 6 buckets/zones along the x axis (120/20 = 6) and 4 along the y axis (80/20 = 4). This creates the buckets we need to plot our zones.
  3. This is using those buckets to create the zones. Let’s break it down bit-by-bit: – Filtering to only defensive events – Grouping by team and getting how many defensive events they made in total ( n() just counts every row that you ask it to, so here we’re counting every row for every team – i.e counting every defensive event for each team) – Then we group again by team and the xbin/ybin to count how many defensive events a team has in a given bin/zone – that’s what ‘bin_DA = n()‘ is doing. ‘total_DA = max(total_DA),‘ is just grabbing the team totals we made earlier. ‘bin_pct = bin_DA/total_DA,‘ is dividing the two to see what percentage of a team’s overall defensive events were made in a given zone. The ‘location.x = median(location.x/y)‘ is doing what it says on the tin and getting the median coordinate for each zone. This is used later in the plotting. – Then we ungroup and mutate to find the league average for each bin, followed by grouping by team/bin again subtracting the league average in each bin from each team’s % in those bins to get the difference.

Now onto the plotting. For this please install the package ‘grid’ if you do not have it, and load it in. You could use a package like ‘ggsoccer’ or ‘SBPitch’ for drawing the pitch, but for these purposes it’s helpful to try and show you how to create your own pitch, should you want to:

library(grid)

defensiveactivitycolors <- c(“#dc2429”, “#dc2329”, “#df272d”, “#df3238”, “#e14348”, “#e44d51”, “#e35256”, “#e76266”, “#e9777b”, “#ec8589”, “#ec898d”, “#ef9195”, “#ef9ea1”, “#f0a6a9”, “#f2abae”, “#f4b9bc”, “#f8d1d2”, “#f9e0e2”, “#f7e1e3”, “#f5e2e4”, “#d4d5d8”, “#d1d3d8”, “#cdd2d6”, “#c8cdd3”, “#c0c7cd”, “#b9c0c8”, “#b5bcc3”, “#909ba5”, “#8f9aa5”, “#818c98”, “#798590”, “#697785”, “#526173”, “#435367”, “#3a4b60”, “#2e4257”, “#1d3048”, “#11263e”, “#11273e”, “#0d233a”, “#020c16”) #1

ggplot(data= heatmap, aes(x = location.x, y = location.y, fill = diff_vs_ave, group =diff_vs_ave)) +
geom_bin2d(binwidth = c(20, 20), position = “identity”, alpha = 0.9) + #2
annotate(“rect”,xmin = 0, xmax = 120, ymin = 0, ymax = 80, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 0, xmax = 60, ymin = 0, ymax = 80, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 18, xmax = 0, ymin = 18, ymax = 62, fill = NA, colour = “white”, size = 0.6) +
annotate(“rect”,xmin = 102, xmax = 120, ymin = 18, ymax = 62, fill = NA, colour = “white”, size = 0.6) +
annotate(“rect”,xmin = 0, xmax = 6, ymin = 30, ymax = 50, fill = NA, colour = “white”, size = 0.6) +
annotate(“rect”,xmin = 120, xmax = 114, ymin = 30, ymax = 50, fill = NA, colour = “white”, size = 0.6) +
annotate(“rect”,xmin = 120, xmax = 120.5, ymin =36, ymax = 44, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 0, xmax = -0.5, ymin =36, ymax = 44, fill = NA, colour = “black”, size = 0.6) +
annotate(“segment”, x = 60, xend = 60, y = -0.5, yend = 80.5, colour = “white”, size = 0.6)+
annotate(“segment”, x = 0, xend = 0, y = 0, yend = 80, colour = “black”, size = 0.6)+
annotate(“segment”, x = 120, xend = 120, y = 0, yend = 80, colour = “black”, size = 0.6)+
theme(rect = element_blank(), line = element_blank()) +
annotate(“point”, x = 12 , y = 40, colour = “white”, size = 1.05) + # add penalty spot right
annotate(“point”, x = 108 , y = 40, colour = “white”, size = 1.05) +
annotate(“path”, colour = “white”, size = 0.6, x=60+10*cos(seq(0,2*pi,length.out=2000)),
y=40+10*sin(seq(0,2*pi,length.out=2000)))+ # add centre spot
annotate(“point”, x = 60 , y = 40, colour = “white”, size = 1.05) +
annotate(“path”, x=12+10*cos(seq(-0.3*pi,0.3*pi,length.out=30)), size = 0.6,
y=40+10*sin(seq(-0.3*pi,0.3*pi,length.out=30)), col=”white”) +
annotate(“path”, x=108-10*cos(seq(-0.3*pi,0.3*pi,length.out=30)), size = 0.6,
y=40-10*sin(seq(-0.3*pi,0.3*pi,length.out=30)), col=”white”)  + #3
theme(axis.text.x=element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.caption=element_text(size=13,family=”Source Sans Pro”, hjust=0.5, vjust=0.5),
plot.subtitle = element_text(size = 18, family=”Source Sans Pro”, hjust = 0.5),
axis.text.y=element_blank(),
legend.title = element_blank(),
legend.text=element_text(size=22,family=”Source Sans Pro”),
legend.key.size = unit(1.5, “cm”),
plot.title = element_text(margin = margin(r = 10, b = 10), face=”bold”,size = 32.5, family=”Source Sans Pro”, colour = “black”, hjust = 0.5),
legend.direction = “vertical”,
axis.ticks=element_blank(),
plot.background = element_rect(fill = “white”),strip.text.x = element_text(size=13,family=”Source Sans Pro”)) + #4
scale_y_reverse() + #5
scale_fill_gradientn(colours = defensiveactivitycolors, trans = “reverse”, labels = scales::percent_format(accuracy = 1), limits = c(0.02, -0.02)) + #6
labs(title = “Where Do Teams Defend vs League Average?”, subtitle = “FAWSL, 2019/20”) + #7
coord_fixed(ratio = 95/100) + #8
annotation_custom(grob = linesGrob(arrow=arrow(type=”open”, ends=”last”, length=unit(2.55,”mm”)), gp=gpar(col=”black”, fill=NA, lwd=2.2)), xmin=25, xmax = 95, ymin = -83, ymax = -83) + #9
facet_wrap(~team.name)+ #10
guides(fill = guide_legend(reverse = TRUE)) #11

  1. These are the colours we’ll be using for our heatmap later on.
  2. ‘geom_bin2d‘ is what will create the heatmap itself. We’ve set the binwidths to 20 as that’s what we cut the pitch up into earlier along the x and y axis. Feeding ‘div_vs_ave’ to ‘fill’ and ‘group’ in the ggplot() will allow us to colour the heatmaps by that variable.
  3. Everything up to here is what is drawing the pitch. There’s a lot going on here and, rather than have it explained to you, just delete a line from it and see what disappears from the plot. Then you’ll see which line is drawing the six-yard-box, which is drawing the goal etc.
  4. Again more themeing. You can change this to be whatever you like to fit your aesthetic preferences.
  5. Reversing the y axis so the pitch is the correct way round along that axis (0 is left in SBD coordinates, but starts out as right in ggplot).
  6. Here we’re setting the parameters for the fill colouring of heatmaps. First we’re feeding the ‘defensiveactivitycolors’ we set earlier into the ‘colours‘ parameter, ‘trans = “reverse”‘ is there to reverse the output so red = high. ‘labels = scales::percent_format(accuracy = 1)‘ formats the text on the legend as a percentage rather than a raw number and ‘limits = c(0.03, -0.03)‘ sets the limits of the chart to 3%/-3% (reversed because of the previous trans = reverse).
  7. Setting the title and subtitle of the chart.
  8. ‘coord_fixed()‘ allows us to set the aspect ratio of the chart to our liking. Means the chart doesn’t come out looking all stretched along one of the axes.
  9. This is what the grid package is used for. It’s drawing the arrow across the pitches to indicate direction of play. There’s multiple ways you could accomplish though, up to you how you do it.
  10. facet_wrap()‘ creates separate ‘facets’ for your chart according to the variable you give it. Without it, we’d just be plotting every team’s numbers all at once on chart. With it, we get every team on their own individual pitch.
  11. Our previous trans = reverse also reverses the legend, so to get it back with the positive numbers pointing upwards we can re-reverse it.

Shot Maps

Another of the quintessential football visualisations, shot maps come in many shapes and sizes with an inconsistent overlap in design language between them. This version will attempt to give you the basics, let you get to grip with how to put one of these together so that if you want to elaborate or make any of your own changes you can explore outwards from it. Be forewarned though – the options for what makes a good, readable shot map are surprisingly small when you get into visualising it!

shots = events %>%
filter(type.name==”Shot” & (shot.type.name!=”Penalty” | is.na(shot.type.name)) & player.name==”Bethany England”) #1

shotmapxgcolors <- c(“#192780”, “#2a5d9f”, “#40a7d0”, “#87cdcf”, “#e7f8e6”, “#f4ef95”, “#FDE960”, “#FCDC5F”, “#F5B94D”, “#F0983E”, “#ED8A37”, “#E66424”, “#D54F1B”, “#DC2608”, “#BF0000”, “#7F0000”, “#5F0000”) #2

ggplot() +
annotate(“rect”,xmin = 0, xmax = 120, ymin = 0, ymax = 80, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 0, xmax = 60, ymin = 0, ymax = 80, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 18, xmax = 0, ymin = 18, ymax = 62, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 102, xmax = 120, ymin = 18, ymax = 62, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 0, xmax = 6, ymin = 30, ymax = 50, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 120, xmax = 114, ymin = 30, ymax = 50, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 120, xmax = 120.5, ymin =36, ymax = 44, fill = NA, colour = “black”, size = 0.6) +
annotate(“rect”,xmin = 0, xmax = -0.5, ymin =36, ymax = 44, fill = NA, colour = “black”, size = 0.6) +
annotate(“segment”, x = 60, xend = 60, y = -0.5, yend = 80.5, colour = “black”, size = 0.6)+
annotate(“segment”, x = 0, xend = 0, y = 0, yend = 80, colour = “black”, size = 0.6)+
annotate(“segment”, x = 120, xend = 120, y = 0, yend = 80, colour = “black”, size = 0.6)+
theme(rect = element_blank(), line = element_blank()) + # add penalty spot right
annotate(“point”, x = 108 , y = 40, colour = “black”, size = 1.05) +
annotate(“path”, colour = “black”, size = 0.6, x=60+10*cos(seq(0,2*pi,length.out=2000)),
y=40+10*sin(seq(0,2*pi,length.out=2000)))+ # add centre spot
annotate(“point”, x = 60 , y = 40, colour = “black”, size = 1.05) +
annotate(“path”, x=12+10*cos(seq(-0.3*pi,0.3*pi,length.out=30)), size = 0.6,
y=40+10*sin(seq(-0.3*pi,0.3*pi,length.out=30)), col=”black”) +
annotate(“path”, x=107.84-10*cos(seq(-0.3*pi,0.3*pi,length.out=30)), size = 0.6,
y=40-10*sin(seq(-0.3*pi,0.3*pi,length.out=30)), col=”black”) +
geom_point(data = shots, aes(x = location.x, y = location.y, fill = shot.statsbomb_xg, shape = shot.body_part.name), size = 6, alpha = 0.8) + #3
theme(axis.text.x=element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.caption=element_text(size=13,family=”Source Sans Pro”, hjust=0.5, vjust=0.5),
plot.subtitle = element_text(size = 18, family=”Source Sans Pro”, hjust = 0.5),
axis.text.y=element_blank(), legend.position = “top”,
legend.title=element_text(size=22,family=”Source Sans Pro”),
legend.text=element_text(size=20,family=”Source Sans Pro”),
legend.margin = margin(c(20, 10, -85, 50)),
legend.key.size = unit(1.5, “cm”),
plot.title = element_text(margin = margin(r = 10, b = 10), face=”bold”,size = 32.5, family=”Source Sans Pro”, colour = “black”, hjust = 0.5),
legend.direction = “horizontal”,
axis.ticks=element_blank(), aspect.ratio = c(65/100),
plot.background = element_rect(fill = “white”), strip.text.x = element_text(size=13,family=”Source Sans Pro”)) +
labs(title = “Beth England, Shot Map”, subtitle = “FAWSL, 2019/20”) + #4
scale_fill_gradientn(colours = shotmapxgcolors, limit = c(0,0.8), oob=scales::squish, name = “Expected Goals Value”) + #5
scale_shape_manual(values = c(“Head” = 21, “Right Foot” = 23, “Left Foot” = 24), name =””) + #6
guides(fill = guide_colourbar(title.position = “top”),  shape = guide_legend(override.aes = list(size = 7, fill = “black”))) + #7 coord_flip(xlim = c(85, 125)) #8

  1. Simple filtering, leaving out penalties. Choose any player you like of course.
  2. Much like the defensive activity colours earlier, these will set the colours for our xG values.
  3. Here’s where the actual plotting of shots comes in, via geom_point. We’re using the the xG values as the fill and the body part for the shape of the points. This could reasonably be anything though. You could even add in colour parameters which would change the colour of the outline of the shape.
  4. Again titling. This can be done dynamically so that it changes according to the player/season etc but we will leave that for now. Feel free to explore for youself though.
  5. Same as last time but worth pointing out that ‘name’ allows you to change the title of a legend from within the gradient setting.
  6. Setting the shapes for each body part name. The shape numbers correspond to ggplot’s pre-set shapes, which you can find here. The shapes numbered 21 and up are the ones which have inner colouring (controlled by fill) and outline colouring (controlled by colour) so that’s why those have been chosen here. oob=scales::squish takes any values that are outside the bounds of our limits and squishes them within them.
  7. guides() allows you to alter the legends for shape, fill and so on. Here we are changing the the title position for the fill so that it is positioned above the legend, as well as changing the size and colour of the shape symbols on that legend.
  8. coord_flip() does what it says on the tin – switches the x and y axes. xlim allows us to set boundaries for the x axis so that we can show only a certain part of the pitch, giving us:

 


That’s all for now. Hopefully this wasn’t all too confusing and you picked up some bits and bobs you can take away to play with yourselves. Don’t worry if some of this is overwhelming or you have to do copious amounts of googling to overcome odd specific errors and whatnot. That’s just part and parcel with coding (seriously, get used to googling for errors, everyone has to).

Much love. Be well and have great days.

Taking ‘Em On: Digging Deeper With Dribbles

To define our terms upfront: a successful dribble in this context is when a player with possession of the ball takes it past an opponent. This is the definition that Opta uses, as well as sites like Squawka and Whoscored that present Opta’s data.

‘Dribbles’/’Take-Ons’/whatever your football website of choice calls them are an odd stat in isolation. We can probably assume that a player who completes a lot dribbles is of a certain stylistic mould. Other than that though, there isn’t a lot to be learnt from those raw numbers about where these dribbles take place, where they go and what the players do afterwards.

To illustrate the point let’s compare two players: Manchester City’s Leroy Sané and Huddersfield’s Rajiv van La Parra. Both are wingers who play predominantly on the left and complete a bit over 3 dribbles per 90 in similar minutes (Sané at a 63.5% completion rate overall and La Parra at 53.5%). Their base stats are very similar, however if we map out where those dribbles start/end and the actions they follow them up with we can see a difference.

La Parra is often starting from deeper areas – he has attempted near the most dribbles starting in his own half of any player in the top 5 leagues –  and ends up going inwards surprisingly frequently (most of his completed passes are received by their strikers or other attackers). His post-dribble work, especially in more traditional winger areas, isn’t great. However he does win his share of fouls and generally advances his team up the pitch. Sané, meanwhile, is operating in the opposition’s third and a whole hell of a lot in that cutback area his manager Guardiola loves. He’s already in such a dangerous area to begin with that the simple act of just beating his man is hugely concerning for the opposition to deal with. The rest is just the icing on the cake.

Obviously there is complexity wrapped around all this. Huddersfield are a world away from Man City, especially in terms of wider squad quality. Different players are needed to bring different qualities to different situations. La Parra needs to beat his man in order to help his team’s progression. Sané, who on average receives the ball already in the final third, needs to beat his man in order help his team break down deep blocks. Po-tay-to po-tah-to.

Below are the top players in terms of their dribble and post-dribble numbers entering and within the opposition box (all stats per 90 for the 2017/18 season. This dataset is missing a couple of Ligue 1 matches). There are clear standouts here: Messi is eye-watering (he already has more post-dribble box passes/shots in 17/18 than in the entirety of 16/17. At 30-years-old!), the Premier League names you’d expect are all there, Leon Bailey is having a lovely season for himself, so on and so forth.

 

Dribbles Ending In Opposition Box Post-Dribble Passes Ending In Opposition Box Post-Dribble Shots In Opposition Box
Post-Dribble Box Passes + Shots
Lionel Messi 1.88 0.76 0.58 1.34
Eden Hazard 1.00 0.39 0.50 0.89
Leon Bailey 0.61 0.54 0.34 0.87
Kingsley Coman 1.20 0.56 0.24 0.80
Kylian Mbappe 1.21 0.42 0.24 0.67
Raheem Sterling 0.75 0.37 0.28 0.66
Christian Pulisic 0.51 0.45 0.17 0.62
Philippe Coutinho 0.61 0.27 0.34 0.61
Wilfried Zaha 1.16 0.30 0.30 0.60
Riyad Mahrez 0.86 0.14 0.46 0.59
Gonçalo Guedes 1.15 0.22 0.36 0.58
Ángel Correa 0.86 0.19 0.37 0.56
Iago Aspas 0.68 0.32 0.23 0.55
Johan Mojica 0.34 0.54 0.00 0.54
Leroy Sané 1.16 0.29 0.23 0.52
Florian Thauvin 0.98 0.13 0.38 0.51
Neymar 0.83 0.33 0.17 0.50
Mohamed Salah 0.97 0.09 0.40 0.49
Ruben Loftus-Cheek 0.74 0.27 0.20 0.47
Dennis Praet 0.11 0.33 0.11 0.44

 

We can zoom out further, to take a look at involvement in possessions that go on to reach the opposition’s final 18 yards (both via a dribble or a post-dribble pass), along with a player’s own individual entries to those areas. The added value from the dribbling of someone like a Hazard, a Boufal or whomever is obvious here. The final ball is the eye-catcher but offering a means of progression is important too.

 

Unique Possessions Ending in Opposition Final 18 Yards Involved In (Via a Dribble That Starts Outside Final 18 Yards) Individual Entires to Final 18 Yards (Via Dribble or Post-Dribble Pass) Individual Entires to Final Third (Via Dribble or Post-Dribble Pass)
Average Vertical Dribble Distance On Those Possessions (Metres)
Eden Hazard 2.39 1.11 0.78 7.71
Neymar 2.17 0.89 0.72 6.73
Kingsley Coman 2.16 1.12 0.48 10.48
Lionel Messi 1.97 0.98 0.67 6.82
Diego Perotti 1.90 0.59 1.03 5.86
Sofiane Boufal 1.78 0.77 0.39 8.40
Isco 1.59 0.30 0.53 8.50
Douglas Costa 1.45 1.16 0.29 11.06
Luka Modric 1.42 0.37 0.31 5.47
Jonathan Viera 1.39 0.22 0.61 5.86
Fede Cartabia 1.38 0.58 0.51 6.51
Jack Wilshere 1.38 0.20 0.69 7.73
Rémy Cabella 1.38 0.72 0.22 6.59
Tanguy NDombele Alvaro 1.31 0.36 0.36 6.10
Ruben Loftus-Cheek 1.28 0.95 0.20 8.11
Malcom 1.28 0.09 0.46 7.62
Andros Townsend 1.25 0.83 0.29 8.10
Christian Pulisic 1.25 0.91 0.45 14.47
Alex Oxlade-Chamberlain 1.23 1.23 0.48 15.96
Florian Thauvin 1.23 0.68 0.30 6.63
Mario Lemina 1.20 0.28 0.28 8.84
Valentin Rosier 1.18 0.45 0.11 9.73
Gonçalo Guedes 1.15 0.86 0.58 19.56
Naby Keita 1.14 0.43 0.43 7.34
Manuel Lanzini 1.11 0.37 0.50 10.25
Paul Pogba 1.11 0.13 0.39 5.34

 

The focus shouldn’t be on just pure attackers though. The list below shows involvement in possessions that end in the opposition’s final third, filtered to players whose median dribble location is outside the final third. Sort by percentage of these dribbles that come through the centre (within the width of the penalty boxes, minimum 30 possessions involved in) and this is where some real atypical profiles come up. E.g: Mousa Dembélé. A main point of consternation for Tottenham right now is what the team will look like without Dembélé. He’s an attacking midfielder turned central midfielder with the ability to move with the ball like an AM in congested areas while also holding up as a defensive presence. His dribbles don’t always directly lead to the final third, but they help the team eventually get there (and this has even been a slight down season by his standards).

 

Unique Possessions Ending in Final Third Involved In (via a Dribble)
% of Dribbles That Occur Centrally
Giannelli Imbula 1.78 87.2%
Tanguy NDombele Alvaro 2.32 77.6%
Mousa Dembélé 2.34 76.1%
Naby Keita 2.36 74.4%
Mario Lemina 2.41 72.7%
Jonathan Viera 2.41 66.3%
Nabil Fekir 2.14 58.0%
Houssem Aouar 1.82 56.3%
Manuel Lanzini 2.35 55.6%
Éver Banega 2.19 55.4%
Radja Nainggolan 1.61 54.5%

 

It’s a difficult skill set to replace. Do you try to re-train a younger AM like Dembélé himself was? 20-year-old Amine Harit (who just misses out on this list) at Schalke could fit the bill, with similar dribbling tendencies. Or do you go for someone who is doing a similar job at CM elsewhere? Southampton’s Mario Lemina is one of those, and a growing favourite candidate amongst supporters. Outside of the PL there’s Tanguy Ndombele who is doing a stellar job in a messily structured Lyon team. He had 10(!) dribbles against Rennes at the weekend.

The list of prospective replacements could go on but the point is that, while there’s players who rack up more eye-popping dribbling numbers than Dembélé, the type of dribble he’s executing combined with his efficacy is maddeningly rare.

Of course none of this is to knock the dribbles stat entirely, or to say it’s without use. Statsbomb’s own radars use them in what seems like the ideal way: present it alongside the rest of a player’s numbers so as to give an at-a-glance impression of their general ability and style. If two players have high Key Pass numbers but one completes a lot of dribbles and the other doesn’t, that tells you a fair amount right there. 

__________________

That’s all for right now though. Maybe we’ll expand on this and have more fun with it in the future. Thank you for reading. You can find me on Twitter @EuanDewar.

Bournemouth: South Coasting

In May 2015 Bournemouth clinched top spot in the Championship and achieved promotion to England’s top division for the first time in the club’s history. When that happened there was much fanfare (deserved, of course), yet a few matches into 2015/16 season the novelty appeared to wear off a little and they haven’t received a great deal of coverage since. This despite having achieved safety from relegation twice in a row. It’s a shame really because a deeper look at them reveals some interesting tidbits and lessons to learn.

Their first season in the Premier League was shaky in the extreme. They finished 16th and five points above the relegation spots, yet had a worse goal difference (-22) than the two teams below them (Sunderland and the relegated Newcastle). Their non-pen expected goal difference however was -11.4.

This disparity stems from a defence with a tendency to collapse. Despite having a non-pen xGA of 48.4 they ended up actually conceding 63 goals. Parts of this may well be variance, but we have to also consider that goals were possibly easier to score against them than a team with a more defensive focus. They conceded 3 or more goals in 11 matches. Of those 11 only 5 were against teams who finished in the PL’s top 6, so this wasn’t a case of just getting bullied by the big boys.

Cut to the end of the 2016/17 season and they finished 9th with an improved goal difference of -12. So they sorted out that defence then? Well, no. In fact, it got even worse. They conceded 60 non-penalty goals and again conceded 3 or more goals in 11 matches. But this time with an expected goals against of 57.2. They gave up more shots (11.6 in 15/16 vs 14.5 in 16/17) and those shots were also closer to goal on average (17.7 metres in 16/17 vs 18.3 metres in 15/16).

There’s also little to suggest they’re getting defenders in front of shots and fooling expected goals models in a Burnley-esque style. Stratagem data for 2016/17 shows that they had 2 to 4 players in front of the shot 72% of the time and 5+ players in front of it 13.7% of the time, both numbers being bang in line with the league average.

The improvement they did make, such as it is, lay in the attack. Last season they posted an iffy non-pen 36.9 expected goals for tally. This season that has moved forward to 43.0 xGF, a number more befitting a midtable finish.

 

bmoth

Despite this improvement, however, open play is not where Bournemouth’s bread is buttered. They scored 7 penalties in 16/17, joint-first in the PL alongside teams you’d perhaps ‘expect’ to be there: Tottenham, Liverpool and Man City. It could have been even more as their 10 penalties won overall was the best in the league.

This is an interesting quirk to Howe’s Bournemouth. It appeared absent in 15/16 when they only won 4 penalties, but if you look back to their time in the Championship it becomes clear it is a point of emphasis. In 2014/15 when they achieved promotion and the Championship title they won a staggering 16 penalties, a full 7(!) more than the team who won the 2nd most.

This could well be a result (intended or otherwise) of Bournemouth’s playing style. Per Stratagem data, they were 8th in the league in key entries into the box via a run this last season:

key entries euan

The seven teams that were ahead of them are the league’s actual top seven, so in this regard Bournemouth are the best of the rest. The evidence points towards Howe telling his players to put their heads down and run when near the opposition box, and to some degree it’s working. Something he should be given praise for as most teams are dying for anything similar that separates them from the morass. Whether it’s a sustainable edge is another question entirely. One that depends on how thoroughly their opponents are scouting them and whether their coaching staff can drill it into them to just not foul.

Transfer business needs to be addressed because it’s been a bit of a bumpy ride in that department. There has been a degree of success. Benik Afobe, Josh King and Nathan Aké on loan were all agreeable moves. Problem is these bright spots have been in the margins of a wider, more confusing transfer picture.

Jordon Ibe – a player who, with the best will in the world, didn’t even flash much talent at Liverpool – was brought in for £15m. Even if he did turn out to be the absolute bee’s knees Bournemouth were rumoured to have very generously offered Liverpool a buy-back clause, all but dooming the deal to be an overpaid loan at best and a complete waste at worst (Ibe played a shade over 1000 minutes in 2016/17, registering no goals or assists).

This week, as you’ve no doubt heard, they picked up Jermaine Defoe on a three-year deal from newly relegated Sunderland. There’s plenty of reason to believe that Defoe isn’t all he’s cracked up to be but let’s put that to side for a moment and assume for the sake of argument that the conventional wisdom (‘he gets you goals’) is correct on him. He’s 34-years-old, turning 35 in October yet has been signed up on a three-year deal.

Josh King alone scored 16 for Bournemouth in 16/17 as a primary option, not to mention Afobe and Callum Wilson’s contributions. Defoe had a whole team built around him last season and notched 15 goals, 5 of those being penalties (Is that it? Did they bring him in to take all these penalties they’re winning? He certainly didn’t win any penalties himself last year). There’s no case for him as a creation option either: Last season Defoe produced fewer key passes (20) in 3323 minutes than Afobe produced in 1454 minutes (23).

(While we’re here: Bournemouth were dead last in terms of regaining the ball past their opposition’s 18-yard-line via a turnover this season, and by a large margin too. Having a 35-year-old up front doing the pressuring will not help with generating those sorts of opportunities.)

Then you have all these players hanging around the squad seemingly without much purpose like Lys Mousset or Max Gradel. Howe has been loyal to his starting group of players, but left others apparently marginalised. Not to mention the strange Jack Wilshere experiment. It’s very hard to get a grasp on what the overall plan is, if one exists.

That’s Bournemouth in a nutshell really. Their supporters should probably feel relatively at ease as they head into a third year in the PL. The team’s overall profile is likely complete enough to expect safety for the near future, barring a freak season. An accomplishment that is fairly ahead of schedule for a club their size. Yet with a defence as dodgy as theirs that margin of error is is always going to be a bit slighter than you’d like. However, Aké and Begovic are decent defensive signings that could be the bedrock for good things to come and may well address this. 

At the moment they scan as a side devoid of a direction outside of just keeping on keeping on. Cut out the head-scratcher signings, tighten up the defence and we could be looking at a Premier League mainstay. As it stands though, there’s plenty to work on.

(Parts of this article were written with the aid of StrataData, which is property of Stratagem Technologies. StrataData powers the StrataBet Sports Trading Platform, in addition to StrataBet Premium Recommendations.)

Creeping Forward: Improving Shot Location

The idea behind shot quality in football is really a fairly intuitive one. A shot from the halfway line isn’t as good an idea as a shot from inside the six-yard box. There’s more nuance to it but you don’t need any sort of deep analytical education to grasp it. Hell, it’s right there in why goals like this one from Memphis are so memorable: because we don’t expect them to happen.

 

memphis gif

Yet, in spite of this, shot quality and the improvement of it feels like a bit of an uphill battle. Especially when it comes to coaching the idea into younger players who have a particularly frustrating problem with their shot selection (Hakim Ziyech, bij voorbeeld). However, it would appear that, one way or another, teams are starting to pay real attention to their shot selection. I noticed this while compiling some shots numbers and tweeted about it, prompting the ever lovely Colin Trainor to produce this nice summary:

 

 

Whether you talk about it in terms of shot distance or shot zones, teams across Europe’s top five leagues are cutting the fat off of their shots. This article is going to focus on the Premier League specifically, mainly because there’s just so many things to digest across Europe that this could go on forever, so a cutoff point has to be set somewhere. If you want details on what’s going elsewhere give me a bell on twitter and if there’s enough curiosity there might be a follow up. 

 

Season Total Shots Outside Box Shots % of shots outside box Average shot distance (metres)
12/13 10562 4626 43.80% 18.96
13/14 10238 4599 44.92% 19.15
14/15 9881 4221 42.72% 18.72
15/16 9781 4046 41.37% 18.50
16/17 9734 3971 40.80% 18.37

 

(*distance numbers for 16/17 are a few matches out of date, but you get the gist)

The first thing that sticks out is the relationship between shots taken outside the box and the total shots numbers. Bits are getting shaved off the outside numbers with each passing season, yet those shots aren’t really being replaced with anything. However, this isn’t really ending up as a loss in end product because of the increased focus on better shots.  Everything is floating around in similar totals, and the goals aren’t going away that’s for sure.

(If you’re wondering about the slight increase in distance in 13/14 that season was very, very odd in an attacking sense. There were 184 goals scored from outside the box that season, 22 more goals than the next highest total over the last five seasons. Most of those were Luis Suarez scoring against Norwich. Or at least that’s what it felt like).

Which teams then are embracing this change and leading the charge in these numbers?

2rod3cje

Arsenal put up the lowest % from outside the box in the recently finished 2016/17 season with an exceedingly low 33.03%. This makes sense for a couple of reasons. Firstly it fits with the image of them of as the English Barcelona, building their attack around getting high value shots (by the by, Barca’s % of shots outside the box in 16/17 was 31.9%). You may also remember that in late 2014 they bought StatsDNA, an analytics company. Now, obviously it’s hard to tell from the outside how much sway they have, but Wenger has mentioned things like expected goals in the past so it seems quite likely that the sharp dropoff between the 14/15 to 15/16 season is at least partially down to StatsDNA being in the discussion and Wenger being open to what they have to say.

In that 15/16 season they absolutely crushed it on the attacking end. It was the ne plus ultra of ‘they always try to walk it in’. Their average shot distance that season was the lowest of any team over the last 5 seasons. This saw their xG per shot jump from 0.105 in 2014/15 to 0.125 which, again, was the highest of any team over that timeframe. This season they’ve become more dysfunctional in attack but that’s a whole other story entirely.

 

Arsenal
 Season xG per shot Average shot distance (metres)
12/13 0.1056 18.45
13/14 0.1114 17.93
14/15 0.1056 17.80
15/16 0.1253 16.08
16/17 0.1035 17.25

 

Their North London neighbours Tottenham are another interesting case. Plenty has been said about how Mauricio Pochettino seems to emphasise long range shots as a part of his gameplan, and sure enough his Tottenham sides have a similarly high % from outside the box as his Southampton one. Yet even though they had the highest % overall in the 16/17 season he has still actually brought the number down from where it was before he took over. It appears that AVB was even more content for his players to take pot shots than Pochettino is. Bless his soul.

 

Tottenham
Season Average shot distance (metres)
12/13 20.75
13/14 19.98
14/15 20.22
15/16 19.81
16/17 19.76

 

Another big (and perhaps unexpected) contributor to the overall league dropoff is your friend and mine Sam Allardyce. West Ham under Allardyce from 2012 to 2015 were always posting low %s, and then as soon as he leaves and Slaven Billic takes over those numbers shoot up. Sure enough in his lone season at Palace they had a similarly low average. His time at Sunderland is the outlier, but it seems none of the many managers they’ve gone through have been able to greatly change their numbers. Much was made from early on in Allardyce’s career about how he embraced stats and let it shape how he worked. Billic meanwhile seems to prefer the volume over quality approach.

 

West Ham under Allardyce West Ham under Billic
Season Average shot distance (metres) Season Average shot distance (metres)
12/13 16.60 15/16 18.29
13/14 17.33 16/17 18.52
14/15 17.25

 

Funnily enough there’s another manager who has this effect: the Right Honourable Tony Pulis.

 

West Brom Pre-Pulis West Brom under Pulis
Season Average shot distance (metres) Season Average shot distance (metres)
12/13 19.27 15/16 18.13
13/14 18.85 16/17 17.68
14/15 19.25

 

Allardyce and Pulis doing this shows that it’s the idea of shot location that matters, not how you achieve it. They aren’t bringing down their teams’ average shot distances with intricate play and sly throughballs like an Arsenal or a Man City are. They’re adapting the idea to the strengths of their players, utilising more headers and the like. An equally valid way of reaching the same end result.

And that’s the point of all this: teams are getting the message on shot locations and starting to remove some of the more pointless shots out of their attacking diet. Will long shots ever go away? No, nor should they. Everyone loves a thunderbastard goal from outside the box. The aim here isn’t to turn every team into a Poundland version of Barcelona. It’s just to make them a little bit smarter and to maximise what they get out of their attack.