Expected Goals, Tracking Data & Data Accuracy : An Investigation

“Tracking data will lead to lush fields of new insights and knowledge”, is a common perception. The jump leads for public analytics they say. Yet oil covered mechanics extracts their heads from under the bonnet…. “You can get it started but.. there are problems ahead”


I am convinced that once tracking data becomes (more) available in football, we’ll be in for 2-5 years of absolute chaos before we understand how to use it. There will be wailing and gnashing of teeth and absolute dogshit visualisations everywhere.

Marek Kwiatkowski, New Kind of Analytics Inc.

Tracking data wont effect football decision making for two decades, if ever.

Paul Riley, Brand Excel

The advice of wise veterans is oft ignored by the adrenaline plugged ears of the masses – their logic, an unwanted dampening whilst we career into the ecstasy of the unexplored. A more careful entry would no doubt make for an easier ride to our destinations. But alas.
So… what challenges are ahead of us? 
Many… but at the very basic level… first up is…


Event data is coded by humans and with this comes error. Big thumbs, tired brains and even unknowingly racist eyes. Tracking data is algorithmically derived from multiple cameras, yet has some inherent error.


As you enter the unexplored, you will quickly notice a mismatch between the datasets.. excitedly you continue.. then you notice another.. and another.. then with a sinking feeling you realise this is a big issue.


Now you have two choices. One, ignore the mismatch, don’t tell anyone and hope it’s never mentioned again. Two, investigate further.


Let’s choose option two…


… For each event compare the distance between the player’s x,y positioning from the event data and the tracking data… for 202,719 events from 77 games. The difference in distance shows the dataset matching error.


The mean difference distance between the datasets was 470 cm. Alarming but let’s look closer at the distribution of the differences.


Distribution of Difference Distance


A high number of events are very accurately matched but there is a large distribution with a significant difference distance.
Let’s have a look if there is less difference depending on the ‘type’ of event.


Defensive Actions : 120cm
Goalkeeper Actions: 151cm
Shots: 216cm
Possession Based: 502cm


Some evident differences in the means, let’s have a look at the distributions.


The distributions clearly show the most amount of difference occurs when matching possession based actions such as passes and receives. Surely down to the volume of fast paced actions and a lower level of scrutiny of accuracy (compared to shots). Worryingly there were a number of shots that were mismatched by over 500cm.


Let’s see if there are any directional patterns in the differences, using the event data locations as the origin for all 202,719 events.


Visually not specific significant directional patterns can be identified. The error is a scary. In an xG model, two shots with a 100cm positional change could result in a large variance of xG values.

Lets see if there are any patterns in the direction of error in the subset of shots.


Once again there are no visually identified patterns in the direction of error. There appears to be a generalised error. My gut reaction would be this could have a large impact on Expected Goals (xG) values.

Impacts on xG 

Let’s calculate xG values for all 3,751 shots based on their event data and tracking data x,y positions. Ben Torvaney’s xG model is used (there are more accurate models but this is the one I have access to).  Comparing the difference in xG between the two locations will give us an initial glimpse at the size of the problem.

The maximum shift in xG values was -0.3770, which is a significant change in value and a real issue as a single figure….

A quick peak at the distribution…

Look at that green middle-finger to my hypothesis!

So the easy-defence team wins? … “yeah but with enough data the errors will be drain away via insignificant runoff!” … let’s see…

The mean difference distance in xG  -0.0081 whilst the median  was 0.0004.  So it’s true, when aggregating over a season the potential error in xG values becomes insignificant.

However.. what about at a player level?

Let’s get rid of all players that haven’t taken 10 shots or more and then rank the players by biggest mean xG difference. Here’s the top 10.

Once again the xG difference is almost insignificant and would not significantly impact the use of season long xG values when filtering and comparing players to recruit.

So… what about at per match basis?

We could use the mean distance difference for all shots of 216cm and do some simulations. A more reflective methodology would be to probabilistically generate distance differences based on their frequencies within dataset.

The probability distribution shows clear patterns:

So… Let’s simulate each game 1,880 times using the following steps for each shot within that game:

  1. Generate a ‘distance error’ based on the above probability distribution.
  2. Chose a random spot that is is same distance away from the original event as the ‘distance error’ yet on the pitch!
  3. Calculate the new xG value for the simulated x,y position.

For each match simulation the original and simulated xG values can be summed and compared to better understand the impact on single match xG analysis.

Let’s study the probability of various swings in xG per team. We are more interested in the extent of the swing rather than the direction of the swing so all negatives values have been converted to their additive inverse.

It’s positive to see that there is a 19.8% chance that there will be a swing of less than 0.1 xG per team per match. However, it’s concerning to see that there is a 5% chance that there will be a swing of more than 1 xG per team per match!

The potential of large swings makes you think that many single match xG results are wrong! What % of simulated games showed a reversal of xG match result? 6.7%!

Final Thoughts..

Although coaches utilise long-term trends, they have a large thirst for the data of singular matches. Therefore we have to ask ourselves are we comfortable with the dataset’s inherent error when presenting results at this level of granularity? If we are not, how do we present this margin or error/uncertainty to coaches?

Our audience may be skeptical and probabilities can be intuitively misread, so it would be easy to hide our error/uncertainty in a cupboard. However, I believe this only provides skeptics with ammunition to shoot us in the head if that error/uncertainty ever trips us up. Others will strongly disagree. There are good examples emerging within the media.

I would love to see more data visualisations within our small community that develop our competencies to discuss error and uncertainty with our audiences. Showing some weakness just might strength our relationships and influence with our audience.

** Disclaimer – TRACAB and Opta data are not used in this article **


Add yours →

  1. Very interesting findings. Is there not a case to use tracking data alone as the ‘ground truth’? Given it’s inherent error can be quantified quite easily (I’m guessing it’s mostly instrumentation error in the cameras) and as you say Opta’s event data error is more variable, could we not ignore Opta entirely and derive all events from the tracking data? I suppose we could still use the Opta event timestamps to confirm what occurred and ignore the x/y coordinates.

    • It’s true that the inherent error could be more precisely quantified and therefore accounted for. I believe that companies are working on auto-tagging events based on tracking data, this should reduce the errors but also make things much data cheaper to produce.. and therefore maybe to buy?

      I have ‘resynced’ the x/y coordinates based on the timestamp before.. its a solution but this relies on the timestamp being right 😉

    • ..”Opta’s event data error is more variable, could we not ignore Opta entirely and derive all events from the tracking data?..””

      I must have missed this declaration in the article. Where does it mention that Optas data was even used so how can we conclude the above statement.

  2. This is really good. I quite often see people noting the model’s inaccuracies but input data not usually mentioned. Also some of those locations being off by > 20m! Are those extremes from the human or tracking (or both) side?

  3. Hi,
    Great study, could you do a Bland and Altman chart to see of there is any pattern between meassures?

Leave a Reply