Data Science Analysis of Running Form

6 minute read

Analyzing the Correlation Between Running Stride Length and Average Pace Per Mile

from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np

Importing the Data

The data generating process was as follows: I went on my run. Once I finished my run, I saved it. I uploaded the run to my iPhone via bluetooth. Once the running data uploaded, I exported a CSV file of the data to my computer.

Here is a link to the google sheets version of the data. For reference, the watch I used to collect this data is the COROS Pace 2.

As a side note, this statistical analysis only takes into account one run. In the future I plan to run a more comprehensive analysis because it is essential to have a variety of data to truly prove a statistical relationship. However, this is a good jumping-off point.

The following running data is from an 18 mile long run. I started at 6:39 mile pace and closed in 5:27 for the last mile. As it was progressive, I chose to analyze the linear relationship between stride length as running pace increases. For the actual analysis, I used the column that contains minutes per kilometer rather than minutes per mile. We are also, for the purpose of this project, holding constant external factors such as elevation gain per mile, heart rate, and prior muscle fatigue.

progressive_longrun_data = Table.read_table("ProgressiveLongRun.csv")
progressive_cleaned = progressive_longrun_data.take(np.arange(0,18,1))
progressive_cleaned.show()

Split	Time	Moving Time	Distance	Elevation Gain	Elev Loss	Avg Pace	Avg Moving Pace	Best Pace	Avg Run Cadence	Max Run Cadence	Avg Stride Length	Avg HR	Max HR	Avg Temperature	Calories
1	0:06:39	0:06:39	1.61	0	31	0:04:08	0:04:05	0:03:39	82	84	148	128	147	24	48
2	0:06:33	0:06:33	1.61	6	30	0:04:04	0:04:03	0:03:46	82	83	150	141	145	22	63
3	0:06:33	0:06:33	1.61	27	21	0:04:04	0:04:08	0:03:47	83	86	149	149	154	20	59
4	0:06:31	0:06:31	1.61	16	31	0:04:03	0:04:04	0:03:54	82	85	150	151	153	21	70
5	0:06:33	0:06:33	1.61	56	15	0:04:04	0:04:06	0:03:14	84	89	147	148	156	21	58
6	0:06:31	0:06:31	1.61	5	31	0:04:03	0:04:02	0:03:52	82	85	150	155	165	23	72
7	0:06:29	0:06:29	1.61	49	0	0:04:02	0:04:01	0:03:51	84	87	148	167	172	23	70
8	0:06:31	0:06:27	1.61	34	4	0:04:00	0:04:01	0:03:47	84	86	148	165	173	24	68
9	0:06:27	0:06:27	1.61	7	29	0:04:00	0:04:01	0:03:48	83	85	151	159	167	25	65
10	0:06:21	0:06:21	1.61	0	34	0:03:57	0:03:57	0:03:52	83	84	154	162	170	25	78
11	0:06:15	0:06:15	1.61	0	35	0:03:53	0:03:53	0:03:38	83	85	155	165	169	25	69
12	0:06:12	0:06:12	1.61	34	24	0:03:51	0:03:53	0:03:37	84	88	154	165	170	24	68
13	0:06:05	0:06:05	1.61	15	32	0:03:47	0:03:49	0:03:28	84	89	157	159	167	24	66
14	0:05:55	0:05:55	1.61	46	17	0:03:41	0:03:41	0:02:46	86	90	159	164	176	23	67
15	0:05:51	0:05:51	1.61	17	30	0:03:38	0:03:39	0:03:23	84	86	163	168	173	24	71
16	0:05:46	0:05:46	1.61	48	1	0:03:35	0:03:38	0:03:29	87	89	162	171	177	23	72
17	0:05:39	0:05:39	1.61	38	5	0:03:31	0:03:32	0:03:24	86	89	165	172	179	23	61
18	0:05:27	0:05:27	1.61	9	22	0:03:23	0:03:21	0:02:53	87	91	169	173	180	24	74

Initial Visualization of the Data

We want to look at an initial visualization of the data so that we can see if there are any obvious patterns. Upon initial inspection, there looks to be good possibility for a linear relationship between average running pace and average stride length.

progressive_cleaned.scatter("Avg Pace", "Avg Stride Length")
plots.xlabel("Average Running Pace (Minutes per Kilometer)")
plots.ylabel("Average Stride Length (Centimeters)")

Text(0, 0.5, 'Average Stride Length (Centimeters)')

png

In the initial visualization of the data, it becomes apparent that the minutes and second display on the x axis isn’t very elegant. Consequently, I spent the following lines of code converting the average pace per kilometer from a minutes and seconds version to purely seconds form.

import datetime

string_version = [str(i) for i in progressive_cleaned.column("Avg Pace")]
#modifying the running time in minutes per kilometer from elements in a numpy array to individual strings

updated_list = []
#creating an empty list that we can append the datetime version of the strings to
for i in string_version:
    date_time_version = datetime.datetime.strptime(i, "%H:%M:%S")
    updated_list.append(date_time_version)

updated_list

final_list = []
#the final list for which we are going to append the time in seconds per kilometer
for i in updated_list:
    a_timedelta = i - datetime.datetime(1900, 1, 1)
    seconds = a_timedelta.total_seconds()
    final_list.append(int(seconds))

running_pace_vs_stride_length = Table().with_column("Average Running Pace", final_list).with_column("Average Stride Length", progressive_cleaned.column("Avg Stride Length"))

running_pace_vs_stride_length.show()

Average Running Pace	Average Stride Length
248	148
244	150
244	149
243	150
244	147
243	150
242	148
240	148
240	151
237	154
233	155
231	154
227	157
221	159
218	163
215	162
211	165
203	169

running_pace_vs_stride_length.scatter("Average Running Pace", "Average Stride Length")
plots.xlabel("Average Running Pace (Seconds per Kilometer)")
plots.ylabel("Average Stride Length (Centimeters)")

Text(0, 0.5, 'Average Stride Length (Centimeters)')

png

Now we have cleaned the data, dropped the unnecessary parts, and have a simple visualization of what we are going to look at in further detail. We have a two column table containing what we would like to analyze for this particular moment. Time for some statistical testing.

# Creating some functions that we will use to analyze the data.
#We need standard units so that we can run our correlation function.

def standard_units(x):
    """ Converts an array x to standard units """
    return (x - np.mean(x)) / np.std(x)

def correlation(t, x, y):
    """ Computes correlation: t is a table, and x and y are column names """
    x_su = standard_units(t.column(x))
    y_su = standard_units(t.column(y))
    return np.mean(x_su * y_su)

#generating the correlation coefficient
correlation(running_pace_vs_stride_length, "Average Running Pace", "Average Stride Length")

-0.9819409345425044

Wow! We generated a correlation coefficient of -0.98, which means there is almost an exactly proportionate decrease in stride length for an increase in the amount of seconds it take to complete each kilometer! However, a correlation coefficient alone is not enough. We need to also look at a linear regression line and plot the residuals.

#our slope and intercept functions
def slope(t, x, y):
    """ Computes the slope of the regression line, like correlation above """
    r = correlation(t, x, y)
    y_sd = np.std(t.column(y))
    x_sd = np.std(t.column(x))
    return r * y_sd / x_sd

def intercept(t, x, y):
    """ Computes the intercept of the regression line, like slope above """
    x_mean = np.mean(t.column(x))
    y_mean = np.mean(t.column(y))
    return y_mean - slope(t, x, y)*x_mean

slope_for_regression_line = slope(running_pace_vs_stride_length, "Average Running Pace", "Average Stride Length")
slope_for_regression_line

-0.48736086175942556

intercept_for_regression_line = intercept(running_pace_vs_stride_length, "Average Running Pace", "Average Stride Length")
intercept_for_regression_line

267.67321364452425

#generating the predicted stride length from running pace and adding that to the table
predictions = slope_for_regression_line * running_pace_vs_stride_length.column("Average Running Pace") + intercept_for_regression_line

predictions

array([146.80771993, 148.75716338, 148.75716338, 149.24452424,
       148.75716338, 149.24452424, 149.7318851 , 150.70660682,
       150.70660682, 152.16868941, 154.11813285, 155.09285458,
       157.04229803, 159.9664632 , 161.42854578, 162.89062837,
       164.84007181, 168.73895871])

run_with_predictions = running_pace_vs_stride_length.with_column("Predicted Stride Length", predictions)

run_with_errors = run_with_predictions.with_column("Error", run_with_predictions.column("Average Stride Length") - run_with_predictions.column("Predicted Stride Length"))

run_with_errors.show()

Average Running Pace	Average Stride Length	Predicted Stride Length	Error
248	148	146.808	1.19228
244	150	148.757	1.24284
244	149	148.757	0.242837
243	150	149.245	0.755476
244	147	148.757	-1.75716
243	150	149.245	0.755476
242	148	149.732	-1.73189
240	148	150.707	-2.70661
240	151	150.707	0.293393
237	154	152.169	1.83131
233	155	154.118	0.881867
231	154	155.093	-1.09285
227	157	157.042	-0.042298
221	159	159.966	-0.966463
218	163	161.429	1.57145
215	162	162.891	-0.890628
211	165	164.84	0.159928
203	169	168.739	0.261041

root_mean_squared_error = np.mean(run_with_errors.column("Error") ** 2) ** 0.5

root_mean_squared_error

1.2311567965332846

run_with_errors.scatter("Average Running Pace")

png

Taking A Closer Look at the Residuals

From that plot, we can tell that the trend in the residuals looks good. The aim of any linear regression estimate is to generate residuals that appear as white noise. However, the plot is a little bit zoomed out. Let’s manipulate the x and y axis to generate a closer look at the residuals.

run_with_errors.scatter("Average Running Pace", "Error")

png

mean_stride_length = np.mean(run_with_errors.column("Average Stride Length"))
mean_stride_length

154.38888888888889

maximum_error = max(abs(run_with_errors.column("Error")))
maximum_error

2.706606822262131

2 / 154.38

0.012955045990413267

root_mean_squared_error

1.2311567965332846

root_mean_squared_error / mean_stride_length

0.007974387311118792

With this zoomed in look, we can tell that most of our errors hover around the range +- 2. This means that our usual error is around 2 centimeters. That’s not a bad error. Given the mean stride length of 154, the predicted percentage of stride length that we will be off by is about 1.3 percent. That’s a darn good estimate. If we take into consideration the RMSE of 1.23, we take 1.23 / 154.38 and get a result of 0.007, which is 0.7 percent.

Conclusion

From analysis of the errors, and correlation coefficient of -0.98, we can conclude that linear prediction is a good indicator for calculating my stride length, given a running pace between 200 and 300 seconds per kilometer.

Share on

X Facebook LinkedIn Bluesky

Colin FitzGerald

Data Science Analysis of Running Form

Analyzing the Correlation Between Running Stride Length and Average Pace Per Mile

Importing the Data

Initial Visualization of the Data

Taking A Closer Look at the Residuals

Conclusion

Share on

You may also enjoy

Behavioral Economics, An Introspective

Collegiate Track and Field Data, Summarized

Analyzing the Difference Between 2019 and 2021 Collegiate Outdoor 1500m Data

Prickly Pear Invitational Split Analysis

Average Running Pace	Average Stride Length
248	148
244	150
244	149
243	150
244	147
243	150
242	148
240	148
240	151
237	154
233	155
231	154
227	157
221	159
218	163
215	162
211	165
203	169

Average Running Pace	Average Stride Length
248	148
244	150
244	149
243	150
244	147
243	150
242	148
240	148
240	151
237	154
233	155
231	154
227	157
221	159
218	163
215	162
211	165
203	169

Average Running Pace	Average Stride Length
248	148
244	150
244	149
243	150
244	147
243	150
242	148
240	148
240	151
237	154
233	155
231	154
227	157
221	159
218	163
215	162
211	165
203	169