How and Why

The story behind the analysis — what motivated it, how it was built, and what it demonstrates.

Why I Built This

I have run several marathons. My first did not go particularly well — I went off too fast, paid for it from about 30km onwards, and finished considerably slower than I should have. It is a very common experience. What interested me afterwards was not the physical side but the decision-making: almost everyone who blows up in a marathon does so in exactly the same way, at roughly the same point, for the same reason.

I wanted to know whether that was actually true in the data, and if so, by how much. The London Marathon publishes split times for hundreds of thousands of runners going back over a decade. That felt like a dataset worth taking seriously.

What This Project Does

The analysis covers 390,000 finishers across eleven years of London Marathon results (2014–2025). It looks at how runners pace their races — specifically the relationship between how fast they start and how much they slow down in the second half.

The core finding is quantified: starting 10% faster than runners at your ability level is associated with an additional 11 minutes of second-half fade. The project also covers year-on-year trends, gender differences in pacing discipline, age effects within ability groups, and a breakdown of which countries and runner profiles produce the most even pacing. There is a race strategy calculator that applies the regression model to a user's own target time.

How I Approached It

The project runs as a structured Python pipeline: data cleaning, feature engineering, statistical analysis, and site output as four discrete stages. Each stage is reproducible and logged. Raw data required substantial cleaning — handling missing splits, inconsistent nationality codes, and name parsing across 45,000 distinct first names.

The frontend is plain HTML, CSS, and JavaScript with Chart.js — no frameworks, no build step, no dependencies to break. That was a deliberate choice: the analysis should be what stands up, not the tooling around it. The charts are interactive but the underlying numbers are embedded directly in the page, so everything works without a server.

What It Demonstrates

The strongest correlation in the dataset — r = 0.94 — is between pacing consistency and finish time. That is not surprising in isolation. What is useful is being able to show it precisely, control for ability group, and isolate it from confounding factors like field composition and race conditions.

More broadly, the project is an exercise in turning a large, messy dataset into something that communicates clearly. The technical work is only useful if the conclusions are easy to understand. That balance — rigorous analysis presented simply — is what I was trying to get right.

390,000
Runners analysed
11 years
2014 – 2025
10 charts
Interactive findings
r = 0.94
Pacing consistency vs finish time
Built with
Python pandas NumPy matplotlib SciPy HTML / CSS / JS Chart.js