Difference between 2019 and 2021 Outdoor Track and Field Times - Collegiate Level

Question: Is there a significant difference in the average 1500m time between 2019 and 2021? More specifically, are the 2021 College Outdoor 1500m times faster, on average, than those of 2019? For context, the 1500m is the metric equivalent of the mile. While the mile is not run at the Olympics, the 1500m is the championship standard.

  • Spikes are the shoes that runners use to compete. There has been quite a bit of commentary regarding Nike's new "super spikes", namely the Air Zoom Victory and the Dragonfly. Both shoes offer superior cushioning (Nike ZoomX) foam to be exact, while the Air Zoom Victory also includes a full length carbon plate. These spikes have been said by some to even offer a synthetic advantage over traditional spikes.
  • According to Nike, "...[ZoomX] is lighter, softer and more responsive than any Nike foam, designed to maximize speed by delivering greater energy return. ZoomX was derived from a foam traditionally used in aerospace innovation, applied for the first time in performance footwear in the Nike Zoom Vaporfly Elite and 4%."

The data used for this project is the simplest out there: the top 95 1500m times in the country. Both lists are taken from Division 1, combining the East and West regions. Here is a link to the 2019 Data. Here is a link to the 2021 Data.

The 2021 Data is taken from the date 4/26/2021. The season is not complete. I will update the data as meets complete and new results are available.

  • This is not a randomized control trial. This is an observational study.
  • This data collection process was quite simple.
  • These results are what they are and take with them what you will.
  • I do not have data on the shoes of the athletes, however from a layman's perspective, almost all of the athletes we have seen competing at an extremely high level this season have been wearing, you guessed it, the Victory or Dragonfly. In a more general sense then, we are testing whether or not the 2021 and 2019 1500m teams are significantly different.

$H_{0}: \mu_{2019} = \mu_{2021}$ : The average 1500m time in 2019 is the same as that of 2021.

$H_{1}: \mu_{2019} \neq \mu_{2021}$ : The average 1500m time in 2021 is either faster or slower than that of 2021; there is a significant difference between the two.

In [1]:
#import all of the necessary modules 
from datascience import * 
import datetime
import numpy as np 
from scipy import stats 
from scipy.optimize import curve_fit 
import matplotlib.pyplot as plot
%matplotlib inline
In [2]:
#load datasets 
twenty_twenty_one_outdoor_times = Table.read_table("2021 Outdoor 1500m Times 426.csv")
In [3]:
twenty_nineteen_outdoor_times = Table.read_table("2019 Outdoor Times.csv")
In [4]:
twenty_nineteen_outdoor_times
Out[4]:
Rank Name Year School Time Meet Date
1 Hoare, Oliver JR-3 Wisconsin 03:37.2 Bryan Clay Invitational 17-Apr-19
2 Villarreal, Carlos JR-3 Arizona 03:37.2 Bryan Clay Invitational 17-Apr-19
3 Nuguse, Yared SO-2 Notre Dame 03:38.3 Bryan Clay Invitational 17-Apr-19
4 Paulson, William SR-4 Arizona State 03:38.3 Bryan Clay Invitational 17-Apr-19
5 Worley, Sam SO-2 Texas 03:38.6 Bryan Clay Invitational 17-Apr-19
6 Suliman, Waleed SO-2 Ole Miss 03:38.7 Bryan Clay Invitational 17-Apr-19
7 Brown, Reed SO-2 Oregon 03:38.8 Payton Jordan Invitational 2-May-19
8 Beamish, Geordie JR-3 Northern Arizona 03:39.1 Bryan Clay Invitational 17-Apr-19
9 Kusche, George SO-2 Nebraska 03:39.3 Payton Jordan Invitational 2-May-19
10 Grijalva, Luis SO-2 Northern Arizona 03:39.5 Bryan Clay Invitational 17-Apr-19

... (85 rows omitted)

In [5]:
#turn non-string characters into strings so that the computer can read them 
string_2019_times = [str(i) for i in twenty_nineteen_outdoor_times.column("Time ")]
In [6]:
string_2021_times = [str(i) for i in twenty_twenty_one_outdoor_times.column("Time")]
In [7]:
#write a function to convert each time string in the dataset into a seconds-only version, which makes it easier to graph 
def split_minutes_and_seconds(time_str):
    """Get Seconds from time."""
    split_list = (time_str.split(":"))
    milliseconds = (split_list[1].split("."))
    int_milliseconds = [int(i) for i in milliseconds]
    return int(split_list[0]) * 60 + int_milliseconds[0] + int_milliseconds[1] / 10

split_minutes_and_seconds("03:39.7")
Out[7]:
219.7
In [8]:
clean_2019_times = [split_minutes_and_seconds(i) for i in string_2019_times]
In [9]:
clean_2021_times = [split_minutes_and_seconds(i) for i in string_2021_times]
In [10]:
len(clean_2021_times)
Out[10]:
100
In [11]:
#deleting the last five rows of the 2021 dataset so that the length of each dataset is the same 
del clean_2021_times[95:100]
In [12]:
len(clean_2021_times)
Out[12]:
95
In [13]:
len(clean_2019_times)
Out[13]:
95
In [14]:
#creating the final table 
final_table = Table().with_columns("Rank", np.arange(1, 96, 1))
In [15]:
final_table = final_table.with_column("2019 Outdoor 1500m Times", clean_2019_times)
In [16]:
final_table = final_table.with_column("2021 Outdoor 1500m Times", clean_2021_times)
In [17]:
final_table
Out[17]:
Rank 2019 Outdoor 1500m Times 2021 Outdoor 1500m Times
1 217.2 216
2 217.2 216.5
3 218.3 217
4 218.3 217.2
5 218.6 217.2
6 218.7 217.8
7 218.8 217.8
8 219.1 218.1
9 219.3 218.3
10 219.5 218.6

... (85 rows omitted)

In [18]:
final_table.hist(1, bins = np.arange(217, 225, 0.5))
final_table.hist(2, bins = np.arange(216, 225, 0.5))
final_table.hist(1,2, bins = np.arange(215, 225, 0.5))

You can see visually from these distributions that the times of 2019 have less clustering in the range 218 to 223 (3:38 to 3:43). A visual inspection is nice, but let's perform a couple of hypothesis tests. Mind you, the 2021 season has not even been completed. There are still weeks to go!

In [97]:
#code to create a scatter of the overall rank vs. 2019 outdoor 1500m time 
final_table.scatter("Rank","2019 Outdoor 1500m Times")
In [99]:
#code to create a scatter of the overall rank vs. 2021 outdoor 1500m time 
final_table.scatter("Rank","2021 Outdoor 1500m Times")

Upon inspection of both of these datasets, I am actually tempted to use exponential regression to model the data. I'll do that another day though, because that is not the focus of this current project.

In [19]:
nineteen_avg = np.mean(final_table.column("2019 Outdoor 1500m Times"))
twentyone_avg = np.mean(final_table.column("2021 Outdoor 1500m Times"))
nineteen_sd = np.std(final_table.column("2019 Outdoor 1500m Times"))
twentyone_sd = np.std(final_table.column("2021 Outdoor 1500m Times"))                    
In [20]:
print(nineteen_avg, nineteen_sd)
223.37578947368414 2.1763182714935256
In [21]:
print(twentyone_avg, twentyone_sd)
221.62105263157895 2.0807839988479087

Now it's time to talk about the two distributions. For 2019, the average, $\bar{X}_{2019}$, was $03:43.37$, while the average for 2021 was $03:41.62$. The standard deviation for 2019, $\sigma_{2019}$, was 2.176, while that of 2021, $\sigma_{2021}$, was 2.0807.

You can tell from these two histograms that there is obvious left skewness, and that they are not normally distributed. I'm going to use the fact that I can take the sample mean of a sample of sufficient size, and that sample will be approximately normal. Under those rules, I can use the central limit theorem.

What follows is that the distribution of the difference of sample means is distributed, $\bar{X}_{2019} - \bar{X}_{2021}$, and the difference in standard deviation can be computed as $\sqrt{\frac{2.1763182714935256^2}{95} + \frac{2.0807839988479087^2}{95}}$

In [22]:
import math 
In [23]:
nineteen_variance = (nineteen_sd**2 / 95)
twentyone_variance = (twentyone_sd**2 / 95)
difference_sd = math.sqrt(nineteen_variance + twentyone_variance)
difference_sd
Out[23]:
0.3089204167435882

So, we can write the distribution of the difference in sample means as

In [24]:
difference_mean = (nineteen_avg - twentyone_avg)
difference_mean 
Out[24]:
1.7547368421051885

A 95% confidence interval for the true difference between the two sample means would be as follows:

To select our z-score, we look at the normal curve. For a $95\%$ confidence interval, we look at the data in the middle $95\%$. This means that we need to find $\Phi^{-1}(0.975)$ for the positive upper value. Here, $\Phi^{-1}$ indicates that we are using the inverse of the normal CDF.

The formula to find the upper and lower bounds of our confidence interval is:

CI = $\bar{X} \pm z * SD(\bar{X})$

In [102]:
z = stats.norm.ppf(0.975)
ci_lower = difference_mean - (z * difference_sd)
ci_upper = difference_mean + (z * difference_sd)
ci_lower, ci_upper
Out[102]:
(1.1492639511986513, 2.3602097330117258)

Notice how 0 is not in the interval. This proves that there is a significant difference between the 1500m times of 2019 and 2020. The positive difference indicates that the 2019 1500m times were slower, on average. I will eventually update this page to include an exponential regression model that minimizes the SSE, so we can use a runner's rank to determine their time.

  • It must be noted that this is a 95% confidence interval. We must perform a hypothesis test to come to an outright conclusion, of whether the data favor the null or alternative hypotheses. Once the season is over and the data is final, I will complete the hypothesis test.

Now let's have some fun with the data. I mentioned before that I thought that there was some clustering in relation to the range 3:38 to 3:43. Well, let's see the count of each between both data sets.

In [26]:
final_table.where("2019 Outdoor 1500m Times", are.between_or_equal_to(218,223)).num_rows
Out[26]:
27
In [27]:
final_table.where("2021 Outdoor 1500m Times", are.between_or_equal_to(218,223)).num_rows
Out[27]:
56

Wow! There are almost double the amount of times between 3:38 and 3:43 in comparison to those of 2019. 56 vs. 27. That's a 107% increase in concentration of 1500m times in that range from 2019.

In [28]:
nineteen_percentiles = [percentile(i, final_table.column(1)) for i in range(0, 110, 10)]
twentyone_percentiles = [percentile(i, final_table.column(2)) for i in range(0, 110, 10)]
x_axis_values = np.arange(0, 101, 10)
In [76]:
figure, axis_1 = plot.subplots()
axis_1.plot(x_axis_values, nineteen_percentiles, color = "blue")
axis_2 = axis_1.twinx()
axis_2.plot(x_axis_values, twentyone_percentiles, color = "purple")
Out[76]:
[<matplotlib.lines.Line2D at 0x11fe380d0>]

We can see from the graph that for each percentile, the 2019 data is above that of 2021, signaling that you must be faster this year to be of a higher percentile.

In [55]:
#export the cleaned data to my computer so I can easily visualize within R and Plotly 
final_table.to_csv("2019 and 2021 Outdoor 1500m Data.csv")
In [69]:
#creating an x-axis for the normal curve 
x_values = np.arange(0, 500, 0.01)
In [70]:
#using the equation of the gaussian function to model a theoretical distribution of the mean for each year 
y_values = (1 / math.sqrt(2*math.pi))**(np.e**(-0.5*((x_values - nineteen_avg) / (nineteen_sd / math.sqrt(95))))**2)
<ipython-input-70-35b6ab5a80d6>:2: RuntimeWarning: overflow encountered in power
  y_values = (1 / math.sqrt(2*math.pi))**(np.e**(-0.5*((x_values - nineteen_avg) / (nineteen_sd / math.sqrt(95))))**2)
In [71]:
y_values_2 = (1 / math.sqrt(2*math.pi))**(np.e**(-0.5*((x_values - twentyone_avg) / (twentyone_sd / math.sqrt(95))))**2)
<ipython-input-71-99ab16e7ad22>:1: RuntimeWarning: overflow encountered in power
  y_values_2 = (1 / math.sqrt(2*math.pi))**(np.e**(-0.5*((x_values - twentyone_avg) / (twentyone_sd / math.sqrt(95))))**2)
In [93]:
#using the guassian equation to draw two normal distributions of the data 
figure_2, axis_2= plot.subplots()
axis_2.plot(x_values, y_values, color = "red", label = "2019")
axis_3 = axis_2.twinx()
axis_3.plot(x_values, y_values_2, color = "blue")
plot.xlim(220, 226)
plot.axvline(x=nineteen_avg)
plot.axvline(x=twentyone_avg)
Out[93]:
<matplotlib.lines.Line2D at 0x1201d2a00>

These are two normal curves, approximated using the gaussian equation, for the 2019 distribution (in red) and the 2021 distribution (in blue). You can see that virtually 100% of 2019's normalized mean data finishes before 2021's mean data starts.