Collegiate Track and Field Data, Summarized
Statistical Updates - Collegiate Track and Field
By Colin FitzGerald
Author’s Note: This page will be updated weekly as new data becomes available.
Page last updated: 5/16/2021 at 10:09 P.M.
As I watched the times go up on the board, I couldn’t believe my eyes.
“That many guys broke 3:40?!”
I couldn’t believe it. Then again, maybe I could.
With the release of Nike’s new Dragonfly and Air Zoom Victory spikes, the running world has experienced a new depth of speed across the board. The shockingness of the results of each meet parallel those of 2017, when Nike first released the Vaporfly.
When new shoe technology is released, people are quick to question the credibility of performance.
Are we experiencing advancements in diet, coaching, and training that account for the sheer depth and difference in times? Is there a new miracle drug? How can we explain this insanity?
Those are all questions I, among others, ask myself as a fan of track and field.
This year’s depth across distance events from 800 meters to 10,000 has been remarkable. Times that used to be considered “other-worldly” are now commonly run in a prelim heat.
Go back to the 2016 Olympic Trials.
The very best runners lined up and went toe to toe in the final of the men’s 1500m. The result? A commanding win by Matthew Centrowitz. The time? 3:34.09, with a 53.95 final lap.
By comparison, Yared Nuguse just soloed a 3:34.68 in the prelims of the Men’s 1500m at ACC’s.
That got me thinking, is there any way that I can prove that the results of this year are statistically different than those of the past?
I went to work and created the project below, which shows the different statistics from 800m to 10,000m, starting in 2012 and ending with this year.
The project and this page will be updated over time as I continue to do work and develop more overall functionality of the project. Here is what I have so far:
The Project
The TFRRS website has an archive page that contains the NCAA Track and Field Outdoor Final Qualifying lists for every year since 2012.
The following functions reproduce those tables from the TFRRS website into pandas dataframes that can be directly manipulated for the purpose of data visualization.
I decided to do the data collection process this way so that I do not have to download each dataset to my computer, but rather I can pull it for each year directly to python.
I spent time studying the TFRRS website in order to properly configure my web scraper. The following functions are a cleaned and robust version of several python jupyter notebooks.
The first function below generates the URL on the TFRRS website. You input a year as a string, and the function will search the TFRRS archive HTML for the correct link to the outdoor performance list for that year.
def url_generator(year):
base_url = "https://www.tfrrs.org/archives.html"
req = Request(base_url, headers = {'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
url_search = page_soup.find_all("a", text=re.compile(year + " " + "Outdoor"))
refined_url = url_search[0]["href"].strip("//").partition(")")
return "http://" + refined_url[0] + refined_url[1]
The event_dictionary maps the TFRRS website HTML “div” classes to their respective disciplines. The event_code_generator is the function that, as you will see later in the code, implements that dictionary. Having the event dictionary allows our table generator to be more robust and abstract.
event_dictionary = {"100m": "row Men 6",
"200m": "row Men 7",
"400m": "row Men 11",
"800m": "row Men 12",
"1500m": "row Men 13",
"5000m": "row Men 21",
"10,000m": "row Men 22"}
def event_code_generator(str):
return event_dictionary[str]
The table generator function generates a table from the URL created from the input year. However, the table is a bunch of gobbldy-gook html, so we need a couple more functions to recover the original table. I also created a separate one for the year 2021, because that data is still updating as the season completes. This way, as meets are completed and new data is uploaded to TFRRS, my functions will automatically update.
def table_generator(url, event):
url_to_search = url
div_class = event_code_generator(event)
req_generator = Request(url_to_search, headers = {'User-Agent': 'Mozilla/5.0'})
webpage_generator = urlopen(req_generator).read()
page_soup_generator = soup(webpage_generator, "html.parser")
html_class = page_soup_generator.find_all("div", class_ = div_class)
for i in html_class:
table_html_format = i.find("table")
return table_html_format
The twenty_twenty_one table generator was built specifically for this year, because that information is contained on a different base URL, and is currently updating on a weekly or daily basis.
def twenty_twenty_one_table_generator(event):
url_to_search = "https://www.tfrrs.org/lists/3191/2021_NCAA_Division_I_Outdoor_Qualifying/2021/o?gender=m"
div_class = event_code_generator(event)
req_generator = Request(url_to_search, headers = {'User-Agent': 'Mozilla/5.0'})
webpage_generator = urlopen(req_generator).read()
page_soup_generator = soup(webpage_generator, "html.parser")
html_class = page_soup_generator.find_all("div", class_ = div_class)
for i in html_class:
table_html_format = i.find("table")
return table_html_format
The split minutes and seconds function turns the time string into a float. For example, a time such as “3:39.7” would be converted to 219.7. The reason for this is that it makes the data visualization process easier. The function by itself does not make much sense because it is used in conjunction with the table formatter function.
def split_minutes_and_seconds(time_str):
"""Get Seconds from time."""
split_list = (time_str.split(":"))
return int(split_list[0])*60 + float(split_list[1])
The table formatter creates the tables themselves.
def table_formatter(table):
headings = ["Rank", "Name", "Year", "School", "Time", "Meet", "Year"]
# the head will form our column names
body = table.find_all("tr")
# Head values (Column names) are the first items of the body list
body_rows = body[1:] # All other items becomes the rest of the rows
all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
row = [] # this will old entries for one row
for row_item in body_rows[row_num].find_all("td"): #loop through all row entries
# row_item.text removes the tags from the entries
# the following regex is to remove \xa0 and \n and comma from row_item.text
# xa0 encodes the flag, \n is the newline and comma separates thousands in numbers
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
aa_stripped = aa.strip()
#append aa to row - note one row entry is being appended
row.append(aa_stripped)
# append one row to all_rows
all_rows.append(row)
new_df = pd.DataFrame(data=all_rows,columns=headings)
new_df["Time"] = [i[:7] for i in new_df["Time"]]
new_df["Time"] = [split_minutes_and_seconds(i) for i in new_df["Time"]]
return new_df
The TFRRS table generator function combines all of the above functions to properly create the table for a specific year and event.
def tfrrs_table_generator(year, event):
if year == "2021":
table_created = twenty_twenty_one_table_generator(event)
table_formatted = table_formatter(table_created)
else:
url = url_generator(year)
table_created = table_generator(url, event)
table_formatted = table_formatter(table_created)
return table_formatted
tfrrs_table_generator("2012", "1500m")
Heres an example:
Rank | Name | Year | School | Time | Meet | Year | |
---|---|---|---|---|---|---|---|
0 | 1 | Lalang Lawi | Arizona | 216.77 | Payton Jordan Cardinal Invitational | Apr 29 2012 | |
1 | 2 | O'Hare Chris | JR-3 | Tulsa | 217.95 | Payton Jordan Cardinal Invitational | Apr 29 2012 |
2 | 3 | van Ingen Erik | SR-4 | Binghamton | 218.06 | Virginia Challenge | May 12 2012 |
3 | 4 | Leslie Cory | SR-4 | Ohio State | 219.00 | Payton Jordan Cardinal Invitational | Apr 29 2012 |
4 | 5 | Hammond Michael | SR-4 | Virginia Tech | 219.22 | Payton Jordan Cardinal Invitational | Apr 29 2012 |
... | ... | ... | ... | ... | ... | ... | ... |
95 | 96 | Shawel Johnathan | SR-4 | Notre Dame | 225.65 | GVSU 2nd to Last Chance Meet | May 11 2012 |
96 | 97 | Goose Mitch | SR-4 | Iona | 225.70 | Princeton Larry Ellis Invitational | Apr 20 2012 |
97 | 98 | Vaziri Shyan | FR-1 | UC Santa Barbara | 225.71 | Big West Track & Field Championships | May 11 2012 |
98 | 99 | Kalinowski Grzegorz | SO-2 | Eastern Michigan | 225.73 | 54th Annual Mt. SAC Relays | Apr 19 2012 |
99 | 100 | Dowd Kevin | Virginia Tech | 225.79 | Wolfpack Last Chance (College) | May 13 2012 |
100 rows × 7 columns
The code above is sufficient to create plots for all of the archives, and the year 2021.
Now that we have all of the functions written, let’s look at cleaned kernel density plots for each event, 800m to 10,000m, compared from 2012 until now.
For these kernel density plots, we need to compile a list of the times for each event for each year. We will store each year’s data in an array. Finally, we will produce arrays for each year.
def y_value_generator(event):
rank = np.arange(1, 101, 1)
base_df = pd.DataFrame(data = rank, columns = ["Rank"])
for i in range(2012, 2020):
year = str(i)
y_values = tfrrs_table_generator(year, event)["Time"]
base_df[year] = y_values
twenty_one = tfrrs_table_generator("2021", event)["Time"]
base_df["2021"] = twenty_one
return base_df
The code cell above generates all of the times for every year, from events 800m - 10,000m. It works properly. This is verified by the summary statistics print-out for each event. Now it’s time to generate the proper values for each year.
#generate all the values for the years from 2012 until now, for the 800m
eight_hundred = y_value_generator("800m")
#generate all the values for the years from 2012 until now, for the 1500m
fifteen_hundred = y_value_generator("1500m")
#generate all the values for the years from 2012 until now, for the 5000m
five_thousand = y_value_generator("5000m")
#generate all the values for the years from 2012 until now, for the 10,000m
ten_thousand = y_value_generator("10,000m")
800m Kernel Density
The kernel density for the 800m, years 2012-2021.
Rank | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 100.000000 | 100.000000 | 100.000000 | 100.00000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
mean | 50.500000 | 108.741600 | 109.148700 | 108.85960 | 108.378000 | 108.189200 | 108.419100 | 108.629700 | 108.754000 | 108.403100 |
std | 29.011492 | 1.044447 | 0.850278 | 0.96219 | 1.061886 | 1.155218 | 1.211719 | 1.154088 | 1.128423 | 1.019719 |
min | 1.000000 | 104.750000 | 106.200000 | 105.35000 | 105.580000 | 104.630000 | 103.730000 | 103.250000 | 104.760000 | 105.160000 |
25% | 25.750000 | 108.240000 | 108.670000 | 108.47750 | 107.717500 | 107.335000 | 107.765000 | 108.040000 | 108.275000 | 107.927500 |
50% | 50.500000 | 109.030000 | 109.290000 | 109.19500 | 108.545000 | 108.500000 | 108.745000 | 108.990000 | 108.925000 | 108.755000 |
75% | 75.250000 | 109.495000 | 109.810000 | 109.56500 | 109.252500 | 109.142500 | 109.302500 | 109.452500 | 109.610000 | 109.190000 |
max | 100.000000 | 109.890000 | 110.210000 | 109.90000 | 109.730000 | 109.690000 | 109.830000 | 109.870000 | 110.060000 | 109.530000 |
800m Analysis
2020 is not included because of the pandemic year. As results continue to be run and we head into the post season, let’s take note. While the average of this year is not an outlier compared to years past, the depth of the percentiles shows a shift down, to a concentration of faster times. Right now, this year’s data is most comparable to 2016.
1500m Kernel Density
The kernel density for the 1500m, years 2012-2021.
Rank | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
mean | 50.500000 | 223.449200 | 223.508700 | 223.412600 | 222.748600 | 222.981800 | 223.132800 | 222.998600 | 223.486600 | 220.620800 |
std | 29.011492 | 1.957686 | 1.774577 | 1.811622 | 1.611957 | 1.605672 | 1.659264 | 2.259134 | 2.174883 | 2.130243 |
min | 1.000000 | 216.770000 | 218.530000 | 216.340000 | 218.350000 | 217.740000 | 215.990000 | 215.010000 | 217.200000 | 214.680000 |
25% | 25.750000 | 222.592500 | 222.247500 | 222.602500 | 221.987500 | 222.275000 | 222.355000 | 222.402500 | 222.927500 | 218.990000 |
50% | 50.500000 | 223.970000 | 224.185000 | 223.895000 | 223.185000 | 223.470000 | 223.220000 | 223.805000 | 224.430000 | 221.165000 |
75% | 75.250000 | 224.980000 | 224.930000 | 224.780000 | 224.010000 | 224.220000 | 224.367500 | 224.597500 | 224.895000 | 222.430000 |
max | 100.000000 | 225.790000 | 225.680000 | 225.330000 | 224.610000 | 224.930000 | 225.440000 | 225.220000 | 225.620000 | 223.150000 |
1500m Analysis
2020 is not included because of the pandemic year. The striking characteristics of this graph explain the summary statistics below. This distribution looks as though it has been shifted to the left. Not surprisingly, as verified by the summary statistics, the 75th, 50th, and 25th percentiles have been shifted down. The slowest time in the array for 2021 is over two seconds ahead of that of 2019.
5,000m Kernel Density
The kernel density for the 5000m, years 2012-2021.
Rank | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 100.000000 | 100.00000 | 100.000000 | 100.00000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
mean | 50.500000 | 833.05800 | 834.613000 | 833.60200 | 832.736000 | 832.357000 | 833.640000 | 831.041000 | 831.195000 | 821.896000 |
std | 29.011492 | 10.80636 | 10.625248 | 9.40652 | 9.044129 | 8.876655 | 8.766822 | 10.151828 | 9.936428 | 8.773674 |
min | 1.000000 | 798.40000 | 795.300000 | 806.90000 | 800.300000 | 804.200000 | 797.500000 | 798.700000 | 805.000000 | 799.900000 |
25% | 25.750000 | 828.00000 | 830.400000 | 828.90000 | 830.250000 | 828.725000 | 829.600000 | 822.850000 | 825.525000 | 816.425000 |
50% | 50.500000 | 836.85000 | 837.900000 | 836.00000 | 834.400000 | 834.850000 | 835.750000 | 832.650000 | 834.400000 | 824.700000 |
75% | 75.250000 | 841.62500 | 842.225000 | 840.47500 | 839.150000 | 838.800000 | 840.275000 | 839.750000 | 839.375000 | 828.650000 |
max | 100.000000 | 844.30000 | 846.400000 | 845.10000 | 844.400000 | 842.400000 | 844.200000 | 843.500000 | 843.300000 | 832.100000 |
5,000m Analysis
2020 is not included because of the pandemic year. Once again, we see what looks like a kernel density plot that had a linear transformation applied to it. The graph looks like someone physically picked it up and moved it. Compared to 2019, the 2021 75th percentile is about 5.1 seconds faster, the 50th is about 9.1, and the 25th is 9 seconds faster.
10,000m Kernel Density
The kernel density for the 10,000m, years 2012-2021.
Rank | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2021 | |
---|---|---|---|---|---|---|---|---|---|---|
count | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
mean | 50.500000 | 1757.587000 | 1766.814000 | 1754.033000 | 1756.877000 | 1758.971000 | 1755.463000 | 1756.511000 | 1746.805000 | 1733.517000 |
std | 29.011492 | 28.643347 | 19.773306 | 22.069459 | 21.690103 | 17.572248 | 22.791127 | 22.929263 | 21.014883 | 21.015766 |
min | 1.000000 | 1647.900000 | 1672.300000 | 1656.700000 | 1674.200000 | 1672.700000 | 1684.900000 | 1684.400000 | 1691.300000 | 1667.200000 |
25% | 25.750000 | 1749.250000 | 1760.375000 | 1744.350000 | 1745.875000 | 1749.650000 | 1740.050000 | 1748.475000 | 1735.625000 | 1722.050000 |
50% | 50.500000 | 1766.700000 | 1771.950000 | 1757.300000 | 1759.800000 | 1759.700000 | 1761.750000 | 1760.000000 | 1753.300000 | 1737.700000 |
75% | 75.250000 | 1775.775000 | 1780.275000 | 1771.575000 | 1775.625000 | 1772.800000 | 1773.800000 | 1774.075000 | 1763.325000 | 1750.675000 |
max | 100.000000 | 1787.900000 | 1787.600000 | 1778.900000 | 1786.600000 | 1782.300000 | 1783.400000 | 1782.600000 | 1772.200000 | 1759.500000 |
10,000m Analysis
2020 is not included because of the pandemic year.
Once again, the 75th, 50th, and 25th percentiles have been shifted, by 12, 9, and 7 seconds, respectively. Check out the charts and study them for a little bit, and see if you notice anything interesting that you think I might have missed.
Final Remarks
I’ve been analyzing this data since April and it’s been a joy to run the kernel density plots each week. It’s always a joyous feeling to make predictions and see them come to life in real world scenarios.
I was talking the ear off of my family members since the first indoor races of the season, telling them that I thought that the spikes would blow open track and field times. According to the data, we are witnessing a new era in track and field.
While I still give much credit to the athletes for working extremely hard during quarantine, Nike has also done an amazing job with innovation.
The addition of ZoomX foam to spikes has been transformational, and I think the outdoor seasons in years to come should be quite thrilling. I feel quite confident in saying that I think a world record is possible in the 1500m this year.
I figured that the new spikes would cause a linear transformation to the dataset densities, and that has been corroborated by each and every plot. It’s truly astounding.
If you found something in this project interesting, have a criticism/ suggestion on what I can improve, or would like to collaborate on a project, feel free to shoot me an email, colinfitzgerald@berkeley.edu.
Have a great day. I hope you enjoyed reading this, and thank you for your time.