Question: Is there a significant difference in the average 1500m time between 2019 and 2021? More specifically, are the 2021 College Outdoor 1500m times faster, on average, than those of 2019? For context, the 1500m is the metric equivalent of the mile. While the mile is not run at the Olympics, the 1500m is the championship standard.
The data used for this project is the simplest out there: the top 95 1500m times in the country. Both lists are taken from Division 1, combining the East and West regions. Here is a link to the 2019 Data. Here is a link to the 2021 Data.
The 2021 Data is taken from the date 4/26/2021. The season is not complete. I will update the data as meets complete and new results are available.
$H_{0}: \mu_{2019} = \mu_{2021}$ : The average 1500m time in 2019 is the same as that of 2021.
$H_{1}: \mu_{2019} \neq \mu_{2021}$ : The average 1500m time in 2021 is either faster or slower than that of 2021; there is a significant difference between the two.
#import all of the necessary modules
from datascience import *
import datetime
import numpy as np
from scipy import stats
from scipy.optimize import curve_fit
import matplotlib.pyplot as plot
%matplotlib inline
#load datasets
twenty_twenty_one_outdoor_times = Table.read_table("2021 Outdoor 1500m Times 426.csv")
twenty_nineteen_outdoor_times = Table.read_table("2019 Outdoor Times.csv")
twenty_nineteen_outdoor_times
#turn non-string characters into strings so that the computer can read them
string_2019_times = [str(i) for i in twenty_nineteen_outdoor_times.column("Time ")]
string_2021_times = [str(i) for i in twenty_twenty_one_outdoor_times.column("Time")]
#write a function to convert each time string in the dataset into a seconds-only version, which makes it easier to graph
def split_minutes_and_seconds(time_str):
"""Get Seconds from time."""
split_list = (time_str.split(":"))
milliseconds = (split_list[1].split("."))
int_milliseconds = [int(i) for i in milliseconds]
return int(split_list[0]) * 60 + int_milliseconds[0] + int_milliseconds[1] / 10
split_minutes_and_seconds("03:39.7")
clean_2019_times = [split_minutes_and_seconds(i) for i in string_2019_times]
clean_2021_times = [split_minutes_and_seconds(i) for i in string_2021_times]
len(clean_2021_times)
#deleting the last five rows of the 2021 dataset so that the length of each dataset is the same
del clean_2021_times[95:100]
len(clean_2021_times)
len(clean_2019_times)
#creating the final table
final_table = Table().with_columns("Rank", np.arange(1, 96, 1))
final_table = final_table.with_column("2019 Outdoor 1500m Times", clean_2019_times)
final_table = final_table.with_column("2021 Outdoor 1500m Times", clean_2021_times)
final_table
final_table.hist(1, bins = np.arange(217, 225, 0.5))
final_table.hist(2, bins = np.arange(216, 225, 0.5))
final_table.hist(1,2, bins = np.arange(215, 225, 0.5))
You can see visually from these distributions that the times of 2019 have less clustering in the range 218 to 223 (3:38 to 3:43). A visual inspection is nice, but let's perform a couple of hypothesis tests. Mind you, the 2021 season has not even been completed. There are still weeks to go!
#code to create a scatter of the overall rank vs. 2019 outdoor 1500m time
final_table.scatter("Rank","2019 Outdoor 1500m Times")
#code to create a scatter of the overall rank vs. 2021 outdoor 1500m time
final_table.scatter("Rank","2021 Outdoor 1500m Times")
Upon inspection of both of these datasets, I am actually tempted to use exponential regression to model the data. I'll do that another day though, because that is not the focus of this current project.
nineteen_avg = np.mean(final_table.column("2019 Outdoor 1500m Times"))
twentyone_avg = np.mean(final_table.column("2021 Outdoor 1500m Times"))
nineteen_sd = np.std(final_table.column("2019 Outdoor 1500m Times"))
twentyone_sd = np.std(final_table.column("2021 Outdoor 1500m Times"))
print(nineteen_avg, nineteen_sd)
print(twentyone_avg, twentyone_sd)
Now it's time to talk about the two distributions. For 2019, the average, $\bar{X}_{2019}$, was $03:43.37$, while the average for 2021 was $03:41.62$. The standard deviation for 2019, $\sigma_{2019}$, was 2.176, while that of 2021, $\sigma_{2021}$, was 2.0807.
You can tell from these two histograms that there is obvious left skewness, and that they are not normally distributed. I'm going to use the fact that I can take the sample mean of a sample of sufficient size, and that sample will be approximately normal. Under those rules, I can use the central limit theorem.
What follows is that the distribution of the difference of sample means is distributed, $\bar{X}_{2019} - \bar{X}_{2021}$, and the difference in standard deviation can be computed as $\sqrt{\frac{2.1763182714935256^2}{95} + \frac{2.0807839988479087^2}{95}}$
import math
nineteen_variance = (nineteen_sd**2 / 95)
twentyone_variance = (twentyone_sd**2 / 95)
difference_sd = math.sqrt(nineteen_variance + twentyone_variance)
difference_sd
So, we can write the distribution of the difference in sample means as
difference_mean = (nineteen_avg - twentyone_avg)
difference_mean
A 95% confidence interval for the true difference between the two sample means would be as follows:
To select our z-score, we look at the normal curve. For a $95\%$ confidence interval, we look at the data in the middle $95\%$. This means that we need to find $\Phi^{-1}(0.975)$ for the positive upper value. Here, $\Phi^{-1}$ indicates that we are using the inverse of the normal CDF.
The formula to find the upper and lower bounds of our confidence interval is:
CI = $\bar{X} \pm z * SD(\bar{X})$
z = stats.norm.ppf(0.975)
ci_lower = difference_mean - (z * difference_sd)
ci_upper = difference_mean + (z * difference_sd)
ci_lower, ci_upper
Notice how 0 is not in the interval. This proves that there is a significant difference between the 1500m times of 2019 and 2020. The positive difference indicates that the 2019 1500m times were slower, on average. I will eventually update this page to include an exponential regression model that minimizes the SSE, so we can use a runner's rank to determine their time.
Now let's have some fun with the data. I mentioned before that I thought that there was some clustering in relation to the range 3:38 to 3:43. Well, let's see the count of each between both data sets.
final_table.where("2019 Outdoor 1500m Times", are.between_or_equal_to(218,223)).num_rows
final_table.where("2021 Outdoor 1500m Times", are.between_or_equal_to(218,223)).num_rows
Wow! There are almost double the amount of times between 3:38 and 3:43 in comparison to those of 2019. 56 vs. 27. That's a 107% increase in concentration of 1500m times in that range from 2019.
nineteen_percentiles = [percentile(i, final_table.column(1)) for i in range(0, 110, 10)]
twentyone_percentiles = [percentile(i, final_table.column(2)) for i in range(0, 110, 10)]
x_axis_values = np.arange(0, 101, 10)
figure, axis_1 = plot.subplots()
axis_1.plot(x_axis_values, nineteen_percentiles, color = "blue")
axis_2 = axis_1.twinx()
axis_2.plot(x_axis_values, twentyone_percentiles, color = "purple")
We can see from the graph that for each percentile, the 2019 data is above that of 2021, signaling that you must be faster this year to be of a higher percentile.
#export the cleaned data to my computer so I can easily visualize within R and Plotly
final_table.to_csv("2019 and 2021 Outdoor 1500m Data.csv")
#creating an x-axis for the normal curve
x_values = np.arange(0, 500, 0.01)
#using the equation of the gaussian function to model a theoretical distribution of the mean for each year
y_values = (1 / math.sqrt(2*math.pi))**(np.e**(-0.5*((x_values - nineteen_avg) / (nineteen_sd / math.sqrt(95))))**2)
y_values_2 = (1 / math.sqrt(2*math.pi))**(np.e**(-0.5*((x_values - twentyone_avg) / (twentyone_sd / math.sqrt(95))))**2)
#using the guassian equation to draw two normal distributions of the data
figure_2, axis_2= plot.subplots()
axis_2.plot(x_values, y_values, color = "red", label = "2019")
axis_3 = axis_2.twinx()
axis_3.plot(x_values, y_values_2, color = "blue")
plot.xlim(220, 226)
plot.axvline(x=nineteen_avg)
plot.axvline(x=twentyone_avg)
These are two normal curves, approximated using the gaussian equation, for the 2019 distribution (in red) and the 2021 distribution (in blue). You can see that virtually 100% of 2019's normalized mean data finishes before 2021's mean data starts.