Attacking Contributions: Markov Models for Football

Messi or Ronaldo? Kroos or Modric? Mbappe or Neymar? Every football fan loves to argue over who they think is a better player. Depending on where your loyalties lie, arguments can range from simple statistics; like the number of goals they’ve scored or the trophies they have won, to advanced metrics like expected goal values from ghosting. To the layman football fan, the former argument is almost certainly more digestible. But for the rest of us, we often want a metric that’s more objective, more extendable, and more rigorous, while still being able to understand it and explain it to your counterpart to assert your football dominance.

The evolution of football analytics – how we got to non-shot expected goals models.

Every football analytics nerd understands the (slow) evolution of football statistics. The story begins with football’s notorious and frustratingly difficult objective of scoring goals, the historic hindrance for American spectating. Analyzing goals scored and goals conceded appeased few and people quickly realized the value of shot volume for depicting a team’s performance and ability. The obvious pitfall in comparing shot volume was the quality of a shot can vary drastically. This led to everyone under the sun defining their own expected goals (xG) model to objectify chance quality and aggregate goal likelihood as a better metric for attacking production. xG is now omnipresent in football analytics as a tool for attackers’ and teams’ performance. Most recently in the sports analytics community, people have extended the concept of expected goals to allocate ball progression contributions throughout a team’s possession of the ball.

Commonly referred to as “non-shot expected goals (NSxG) models”, these models are effective tools to quantify passes and carries into dangerous areas of the pitch, assigning value to actions other than shots and allowing for the comparison of attacking contribution of ALL players. Fivethirtyeight even uses a non-shot xG model as a component in their soccer projections.

The original research – before “non-shot expected goals” became a thing – was by Sarah Rudd, presented at NESSIS in 2011. Rudd used markov models to assign individuals offensive production values defined as the change in the probability of a possession ending in a goal from  the previous state of possession to the current state of possession.

For example, imagine a player standing 30 yards from the goal line, close to the sideline. They are in a non-threatening position and that possession will rarely result in a goal. Let’s say it has a 1% chance of resulting in a goal. Now, that player gets a cross off, the defender clears it out of bounds for a corner kick. Corner kicks resulted in goals approximately 4% of the time. This play would attribute a +3% change in NSxG for the player who crossed the ball.

As data becomes increasingly utilized and accessible, the variants of NSxG models grow just as xG models did. Mark Taylor further explains NSxG models here. Nils Mackay defines “xG added” to grade passing skill and extends it to allocate value for carries and structures as a possession based model. Similarly (and most recently), Karun Singh published his version of xG added, introducing xG threat, explaining it with beautiful interactive visualizations.

All of this publicly facing research has been pivotal in advancing the applications and effectiveness of sports analytics. Today, I am going to walk through a tutorial on StatsBomb’s first iteration of a Ball Progression Model. I like to refer to NSxG as “contribution”, simply because it’s easier to say and not everything in football analytics needs an “x” in it.

Markov Model – Framework and Methodology.

Adopting the framework set forth by Rudd, we construct a possession based markov model we call our “Ball Progression Model”. We define attacking possessions to have two possible outcomes, a Goal or a Turnover. In a markov model, these two outcomes are known as the “absorption states”. The most crucial condition of an absorption state is that the probability of transitioning out of the state is 0 and the probability of remaining in the state is 1, given that it is the end of a possession this condition holds and the data must be structured as such (this condition makes it more difficult to consider shots or xG bins as potential absorption states). Leading up to the absorption state, a possession can transition between any number of “transient states”. We define transient states based on the context of the state and the geographical location of the possession at a current state. Extending the states defined by Rudd and applying to StatsBomb data, we define the following context-based transient states:

  • Attacking Third Free Kick
  • Central Third Free Kick
  • Defending Third Free Kick
  • Attacking Third Throw In
  • Central Third Throw In
  • Defending Third Throw In
  • Corner Kick
  • Penalty Won

We then define the following geographic zones as transient states:

 

Since a state can depend greatly on defensive pressure, we define the geographic zones each when they are absent of pressure and when they are under pressure. This leaves us with 76 geographic zones (38 with pressure, 38 without pressure) and 8 contextual zones for a total of 84 transient states.

Transient states can transition between other transient states and ultimately an absorption state based on some observed transition probability. The transition probability is dependent only on the current state of the possession and is independent of previous states. This is known as the markov property and is a key assumption in markov models (in the discussion we consider this a limitation and propose extensions to this property). For instance, if you have the ball in zone 21, the probability that you pass the ball to zone 28 is the same regardless of the fact that the ball came from zone 14 as opposed to any other zone. This is known as the “memoryless” property.

Quick notation – n is the number of transient states (in our case 84), r is the number of absorbing states (in our case 2). Q is the matrix of transition probabilities, Q is n x n. R is the matrix of absorption probabilities, R is n x r. N is known as the fundamental matrix and it is calculated as the inverse to the n x n identity matrix, I,  minus the transition matrix Q, formally N = (I – Q)-1 .

Calculations – for each transient state, we can calculate the expected number of plays (progressing actions: passes, carries, and shots) until absorption as the row sums of the fundamental matrix.  Then, the probability of reaching either absorption state for the current transient state is equal to N x R. For more on the theory behind markov models, please see here.  Special thanks to Ron Yurko for the code.

Results

We prepare the data (this is the most time consuming portion) and run our ball progression model for Europe’s big five leagues, England Championship and England League One for the 2017/2018 and 2018/2019 (through 2/18/19) seasons.

For each transient state, we calculate the probability of a goal in absorption as well as the expected number of plays until absorption. The three most likely states to result in a goal are (refer to geographic zones above): 36 w/ pressure (Pr(Goal) = 19.2%), 31 w/ pressure (Pr(Goal) = 9%), and 36 w/o pressure (Pr(Goal) = 8.3%). The three most likely zones to result in a turnover are: 1 w/ pressure (Pr(Turnover) = 99.5%), 3 w/ pressure (Pr(Turnover) = 99.5%), and 2 w/ pressure (Pr(Turnover) = 99.5%). We present a possession that resulted in a goal below, with the contribution value for each action.

We then calculate our “contribution” metric as the change in the probability of a goal from the current state to the next state. Formally,

contribution = Pr(Goal|Statet+1) – Pr(Goal|Statet) for each transient state at time t.

We can also calculate total attacking contributions for each individual, i, as the sum of all of their attacking contributions,

contributioni = ∑Pr(Goal|Statet+1) – Pr(Goal|Statet)i  ·I(action by player i)

We then scale their total contribution by the number of matches played to get a player’s “contribution per game”. We choose to leave the contribution per game metric raw, not standardizing by league strength. This is to simply to see the crude output from the model, giving every player a fair chance to shine regardless of where they play. The top five contributors for each position (attackers, midfielders, defenders and goalkeepers) are presented in the tables below:

We also formulate a hypothetical “Ultimate Team” for the top contributors for each position of a standard 4-4-2 against a 4-3-3. Again, we purposely make the naive assumption that contributions between different leagues are equal. We also, in order to show you some names you might know, purposely didn’t stress that the ultimate teams are extremely broad and unrealistic when it comes to positional categorizations. The two squads we formulate highlight plenty of young stars to remember during the next transfer window.

Discussion

Our ball progression model has clearly identified the top players across Europe, and offers some justification for the money needed to acquire them. We have clearly designed a model that is easily interpretable even by the less-technical analytics sides. And ultimately, the model works without much computational power. Markov models are good at handling sequences of arbitrary length (as possessions in soccer can be anywhere from one event to 100s of events), and they allow for the attribution of final outcome contributions further along in the sequence. Nonetheless, there exist several limitations to a simple markov model.

  • First, markov models’ assume the “memoryless” property when in reality a soccer possession is not memoryless. The probability of scoring when you are in a current state can depend on previous passes and carries leading up to the current state.
    • A further extension of our ball progression model, that would appease this limitation, is higher order markov models. In higher order markov models, instead of assuming the markov property of independence, you assume that transition probabilities are conditionally independent based on the value of the current state and the value of the previous, 2nd previous, nth previous state, where the number of previous states you consider is the nth order of the markov model.
  • Another limitation is that this simple markov model does not consider the action required to transition between states. For instance,  the probability of a possession resulting in a goal may be different given that you passed into a zone vs. dribbled into a zone.
    • This limitation can be appeased with markov decision processes, in which you consider the action at each state and time step, some examples of markov decision processes in other sports can be found here, here, and here.
  • Lastly, and perhaps the most obvious limitation of this markov model is the categorized structure of transient and absorption states. This causes the loss of information and limits applications especially in the free-flowing game of football.  
    • There exists some methods for continuous stochastic processes, but their use in the public sphere is limited and the concepts are far more difficult to understand.

This leads us to StatsBomb’s latest endeavor. Based on the limitations outlined above, we recognized the need for a model that accounts for the continuous nature of football, the retention of information from previous states, and the actions chosen by decision makers. Our next model will improve on the limitations noted above as well as layer on additional components essential to a football team’s success such as the timing of goals and the style of play under different game states. This will be the primary model we use for holistic ball progression in player and team stats, and a white paper detailing the model will be made available to current StatsBomb customers in March.

Shots Under Pressure Part 2: Headers

We first introduced our exploratory data analysis of shot pressure last week, that article can be found here. Now we’re going to examine specifically at headers. The goal is to see if there are any striking differences in the raw pressure metrics, and also look at the players topping each list and which passers are the best at pinpointing “wide open” shooters.

As the tables show, 72% of shots from a player’s head are under pressure versus 62% of all shots are under pressure. The average distance from goal for headers is 9.6 yards, roughly half of the average distance for all shots which is 18.8. That means the pressure radius is on average only, 2.3 yards. Thus, the defensive pressure on these headers is typically even closer to the shooter. Let’s take a closer look at where the pressure is coming from.

Looking at the pressure directions for headers, we see nearly identical results as what we saw with all shots in part 1 of this exploration into shot pressure.

Which players in the EPL take the most shots under pressure?

Players who take the highest percentage of their shots under pressure tend to be the big beefy boys who come up and crowd the box on set pieces. Then there is Steve Mounie and Ashley Barnes, attackers for notably conservative teams Huddersfield and Burnley. To some degree this is likely to be a function of the fact that the vast majority of many defenders shots come from set plays while most attackers will have some mix of set piece and open play headers on their statistical resume. In the cases of the attackers on the list disentangling their set play attempts from open play attempts would be a worthwhile next step.

 

Which players in the EPL take the fewest shots under pressure?

Notably, players that take a low percentage of their headers under pressure are a much more traditional set of attackers.

Which passers find the “open” player most often?

What we want to look at now is which players are finding teammates in positions to take open shots. It can be assumed at this point, that shots free from pressure are prefered to shots under pressure. Using the “pass_shot_assist” variable in the StatsBomb data we first look at the difference in proportion of shots under pressure given that they were taken after a key pass or not.

Now we can argue this is confounded by team strengths, teammates who make great runs and other factors. We could also argue that headers taken under pressure are not necessarily bad shots. But, given the clear, increased likelihood of scoring in the absence of pressure, it is definitely not a bad thing to be able to target open shooters. The players in the list above definitely pass the eye-test and in coming articles of this series on shot pressure, we will dive even deeper into those passers, their own shooting tendencies and identifying open players/better decisions through freeze frames.

Header image courtesy of the Press Association

Closing Down: How Defensive Pressure Impacts Shots

One of the things that makes StatsBomb data unique is that we have defensive positioning data for every shot. The challenge is rigorously translating the freeze frames that capture that positioning into usable information about how defensive pressure impacts the players taking the shots.

Defining the Problem

To start, we need to define what specifically pressure means. Obviously we’re looking at defenders within a certain radius of a shooter. Our first challenge is defining the distance around the shooter that we should examine. Adjusting the Pressure Radius (PR) for arbitrary distance categories would unjustly increase or decrease the radius for shots just inside or just outside a given category. Therefore, we create a piecewise linear function for all shots. This is defined as:

–          PR = DFG*0.15 + 0.85 for DFG <= 24 (where DFG = distance from goal)

–          PR = 4.5 for DFG > 24.

–          In other words, it is the straight line from the point (PR, DFG) of (1, 1) to (4.5, 24).

The PR is relatively arbitrary, but intuitively defined. It is designed to differentiate pressure at different positions on the pitch and with further research can be easily modified to best model defensive pressure. In simplest terms, the closest to the goal you are, the closer defenders must be to pressure you and influence your shot and the further away from the goal you are defenders may be slightly further away in order to impact a shot.

Here is an example of what that radius looks like:

 

Pressures from each side

In addition to dealing with distance, we also need to separate out the pressure defenders apply by direction. The idea here is to more specifically understand both how pressure effects change goal scoring probabilities and also how they interact with a shooter’s own tendencies. If we take the circle around the shooter with radius = PR as defined above, then we can define pressure from four different sides as:

–          Pressure Left = AngleToShooter > 315 and AngleToShooter < 45.

–          Pressure Behind = AngleToShooter >= 45 and AngleToShooter <= 135.

–          Pressure Right = AngleToShooter > 135 and AngleToShooter < 225.

–          Pressure Front = AngleToShooter >= 225 and AngleToShooter <= 315.

We will later see how the pressure from each of these sides influences a player’s decision of shooting foot and the observed goal rates. For example, here is a shot that had pressure from the left of the shooter.

 

 

Overall Shot and Goal Rates Under Pressure

After testing the pressure metrics both visually and numerically we apply the pressure values to all shots in our data set, and then view some summary statistics. Please note that all results below only reflect shots in the run of play.

In the tables above we can see that roughly 65% of all shots, under this definition are “under pressure”. We can also see that a significantly higher proportion of goals are scored in the absence of pressure than under pressure. The only angle that does not have a higher proportion of goals scored is the pressure behind the ball which would make sense, since the pressure behind the shooter is not disrupting the view to goal.

The Cool Stuff We Found

Once we define “under pressure” values for all of our shots, we can analyze a multitude of different aspects of the game and the player. First, we will look at whether the direction of pressure has any impact on the player’s choice of foot to shoot with. Furthermore, we look at whether the combination of the player’s dominant foot, shooting foot and direction of pressure has a significant relationship with the proportion of shots that become goals. We view the results in the tables below.

 

 

When the pressure from the left is greater than the pressure from the right, the proportion of shots taken with the player’s right foot is about 75%. When the pressure from the right is greater than the pressure from the left, the proportion of shots taken with the player’s right foot is still greater than 50% at approximately 50.2%. An obvious explanation for this is the vast majority of right footed players continuing to want to shoot with their right foot regardless of the pressure. Interestingly enough, we see in Table 3 that the highest proportion of goals are scored when a right footed player gets off a shot with his right foot and there is less pressure on the right side.

The next thing we wanted to look at with shots under pressure were the raw numbers on a player level. Which players players are trigger happy and will pull up regardless of who’s around? Which players are more selective with their shots? Or, which players happen to find themselves in better places to get shots off without defenders breathing all over them. Using 2017-18 Premier League numbers, the 10 players who take the greatest proportion of shots under pressure are summarized in table 5 and the 10 players who take the lowest proportion of shots under pressure are in table 6.

This pressure data contains tons of interesting information about what exactly is happening and players put the ball on net. There are certainly further factors that can be worked in too. This is only the beginning of the kind of information that we can glean from how players are pressured as they take shots.