10 minute read

Statistical Updates - Collegiate Track and Field

By Colin FitzGerald

Author’s Note: This page will be updated weekly as new data becomes available.

Page last updated: 5/16/2021 at 10:09 P.M.

As I watched the times go up on the board, I couldn’t believe my eyes.

“That many guys broke 3:40?!”

I couldn’t believe it. Then again, maybe I could.

With the release of Nike’s new Dragonfly and Air Zoom Victory spikes, the running world has experienced a new depth of speed across the board. The shockingness of the results of each meet parallel those of 2017, when Nike first released the Vaporfly.

When new shoe technology is released, people are quick to question the credibility of performance.

Are we experiencing advancements in diet, coaching, and training that account for the sheer depth and difference in times? Is there a new miracle drug? How can we explain this insanity?

Those are all questions I, among others, ask myself as a fan of track and field.

This year’s depth across distance events from 800 meters to 10,000 has been remarkable. Times that used to be considered “other-worldly” are now commonly run in a prelim heat.

Go back to the 2016 Olympic Trials.

The very best runners lined up and went toe to toe in the final of the men’s 1500m. The result? A commanding win by Matthew Centrowitz. The time? 3:34.09, with a 53.95 final lap.

By comparison, Yared Nuguse just soloed a 3:34.68 in the prelims of the Men’s 1500m at ACC’s.

That got me thinking, is there any way that I can prove that the results of this year are statistically different than those of the past?

I went to work and created the project below, which shows the different statistics from 800m to 10,000m, starting in 2012 and ending with this year.

The project and this page will be updated over time as I continue to do work and develop more overall functionality of the project. Here is what I have so far:

The Project

The TFRRS website has an archive page that contains the NCAA Track and Field Outdoor Final Qualifying lists for every year since 2012.

The following functions reproduce those tables from the TFRRS website into pandas dataframes that can be directly manipulated for the purpose of data visualization.

I decided to do the data collection process this way so that I do not have to download each dataset to my computer, but rather I can pull it for each year directly to python.

I spent time studying the TFRRS website in order to properly configure my web scraper. The following functions are a cleaned and robust version of several python jupyter notebooks.

The first function below generates the URL on the TFRRS website. You input a year as a string, and the function will search the TFRRS archive HTML for the correct link to the outdoor performance list for that year.

def url_generator(year): 
    base_url = "https://www.tfrrs.org/archives.html"
    req = Request(base_url, headers = {'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    page_soup = soup(webpage, "html.parser")
    url_search = page_soup.find_all("a", text=re.compile(year + " " + "Outdoor"))
    refined_url = url_search[0]["href"].strip("//").partition(")")
    return "http://" + refined_url[0] + refined_url[1]

The event_dictionary maps the TFRRS website HTML “div” classes to their respective disciplines. The event_code_generator is the function that, as you will see later in the code, implements that dictionary. Having the event dictionary allows our table generator to be more robust and abstract.

event_dictionary = {"100m": "row Men 6", 
                    "200m": "row Men 7", 
                    "400m": "row Men 11", 
                    "800m": "row Men 12", 
                    "1500m": "row Men 13", 
                    "5000m": "row Men 21", 
                    "10,000m": "row Men 22"}

def event_code_generator(str): 
    return event_dictionary[str]

The table generator function generates a table from the URL created from the input year. However, the table is a bunch of gobbldy-gook html, so we need a couple more functions to recover the original table. I also created a separate one for the year 2021, because that data is still updating as the season completes. This way, as meets are completed and new data is uploaded to TFRRS, my functions will automatically update.

def table_generator(url, event): 
    url_to_search = url
    div_class = event_code_generator(event)
    req_generator = Request(url_to_search, headers = {'User-Agent': 'Mozilla/5.0'})
    webpage_generator = urlopen(req_generator).read()
    page_soup_generator = soup(webpage_generator, "html.parser")
    html_class = page_soup_generator.find_all("div", class_ = div_class)
    for i in html_class: 
        table_html_format = i.find("table")
    return table_html_format

The twenty_twenty_one table generator was built specifically for this year, because that information is contained on a different base URL, and is currently updating on a weekly or daily basis.

def twenty_twenty_one_table_generator(event): 
    url_to_search = "https://www.tfrrs.org/lists/3191/2021_NCAA_Division_I_Outdoor_Qualifying/2021/o?gender=m"
    div_class = event_code_generator(event)
    req_generator = Request(url_to_search, headers = {'User-Agent': 'Mozilla/5.0'})
    webpage_generator = urlopen(req_generator).read()
    page_soup_generator = soup(webpage_generator, "html.parser")
    html_class = page_soup_generator.find_all("div", class_ = div_class)
    for i in html_class: 
        table_html_format = i.find("table")
    return table_html_format

The split minutes and seconds function turns the time string into a float. For example, a time such as “3:39.7” would be converted to 219.7. The reason for this is that it makes the data visualization process easier. The function by itself does not make much sense because it is used in conjunction with the table formatter function.

def split_minutes_and_seconds(time_str):
        """Get Seconds from time."""
        split_list = (time_str.split(":"))
        return int(split_list[0])*60 + float(split_list[1])

The table formatter creates the tables themselves.

def table_formatter(table): 
    headings = ["Rank", "Name", "Year", "School", "Time", "Meet", "Year"]
    # the head will form our column names
    body = table.find_all("tr")
    # Head values (Column names) are the first items of the body list
    body_rows = body[1:] # All other items becomes the rest of the rows
    all_rows = [] # will be a list for list for all rows
    for row_num in range(len(body_rows)): # A row at a time
        row = [] # this will old entries for one row
        for row_item in body_rows[row_num].find_all("td"): #loop through all row entries
            # row_item.text removes the tags from the entries
            # the following regex is to remove \xa0 and \n and comma from row_item.text
            # xa0 encodes the flag, \n is the newline and comma separates thousands in numbers
            aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
            aa_stripped = aa.strip()
            #append aa to row - note one row entry is being appended
            row.append(aa_stripped)
        # append one row to all_rows
        all_rows.append(row)
        new_df = pd.DataFrame(data=all_rows,columns=headings)
        new_df["Time"] = [i[:7] for i in new_df["Time"]]
        new_df["Time"] = [split_minutes_and_seconds(i) for i in new_df["Time"]]
    return new_df

The TFRRS table generator function combines all of the above functions to properly create the table for a specific year and event.

def tfrrs_table_generator(year, event): 
    if year == "2021": 
        table_created = twenty_twenty_one_table_generator(event)
        table_formatted = table_formatter(table_created)
    else: 
        url = url_generator(year)
        table_created = table_generator(url, event)
        table_formatted = table_formatter(table_created)
    return table_formatted

tfrrs_table_generator("2012", "1500m")

Heres an example:

Rank Name Year School Time Meet Year
0 1 Lalang Lawi Arizona 216.77 Payton Jordan Cardinal Invitational Apr 29 2012
1 2 O'Hare Chris JR-3 Tulsa 217.95 Payton Jordan Cardinal Invitational Apr 29 2012
2 3 van Ingen Erik SR-4 Binghamton 218.06 Virginia Challenge May 12 2012
3 4 Leslie Cory SR-4 Ohio State 219.00 Payton Jordan Cardinal Invitational Apr 29 2012
4 5 Hammond Michael SR-4 Virginia Tech 219.22 Payton Jordan Cardinal Invitational Apr 29 2012
... ... ... ... ... ... ... ...
95 96 Shawel Johnathan SR-4 Notre Dame 225.65 GVSU 2nd to Last Chance Meet May 11 2012
96 97 Goose Mitch SR-4 Iona 225.70 Princeton Larry Ellis Invitational Apr 20 2012
97 98 Vaziri Shyan FR-1 UC Santa Barbara 225.71 Big West Track & Field Championships May 11 2012
98 99 Kalinowski Grzegorz SO-2 Eastern Michigan 225.73 54th Annual Mt. SAC Relays Apr 19 2012
99 100 Dowd Kevin Virginia Tech 225.79 Wolfpack Last Chance (College) May 13 2012

100 rows × 7 columns

The code above is sufficient to create plots for all of the archives, and the year 2021.

Now that we have all of the functions written, let’s look at cleaned kernel density plots for each event, 800m to 10,000m, compared from 2012 until now.

For these kernel density plots, we need to compile a list of the times for each event for each year. We will store each year’s data in an array. Finally, we will produce arrays for each year.

def y_value_generator(event): 
    rank = np.arange(1, 101, 1)
    base_df = pd.DataFrame(data = rank, columns = ["Rank"])
    for i in range(2012, 2020): 
        year = str(i)
        y_values = tfrrs_table_generator(year, event)["Time"]
        base_df[year] = y_values
    twenty_one = tfrrs_table_generator("2021", event)["Time"]
    base_df["2021"] = twenty_one
    return base_df

The code cell above generates all of the times for every year, from events 800m - 10,000m. It works properly. This is verified by the summary statistics print-out for each event. Now it’s time to generate the proper values for each year.

#generate all the values for the years from 2012 until now, for the 800m 
eight_hundred = y_value_generator("800m")
#generate all the values for the years from 2012 until now, for the 1500m
fifteen_hundred = y_value_generator("1500m")
#generate all the values for the years from 2012 until now, for the 5000m
five_thousand = y_value_generator("5000m")
#generate all the values for the years from 2012 until now, for the 10,000m
ten_thousand = y_value_generator("10,000m")

800m Kernel Density

png

The kernel density for the 800m, years 2012-2021.

Rank 2012 2013 2014 2015 2016 2017 2018 2019 2021
count 100.000000 100.000000 100.000000 100.00000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
mean 50.500000 108.741600 109.148700 108.85960 108.378000 108.189200 108.419100 108.629700 108.754000 108.403100
std 29.011492 1.044447 0.850278 0.96219 1.061886 1.155218 1.211719 1.154088 1.128423 1.019719
min 1.000000 104.750000 106.200000 105.35000 105.580000 104.630000 103.730000 103.250000 104.760000 105.160000
25% 25.750000 108.240000 108.670000 108.47750 107.717500 107.335000 107.765000 108.040000 108.275000 107.927500
50% 50.500000 109.030000 109.290000 109.19500 108.545000 108.500000 108.745000 108.990000 108.925000 108.755000
75% 75.250000 109.495000 109.810000 109.56500 109.252500 109.142500 109.302500 109.452500 109.610000 109.190000
max 100.000000 109.890000 110.210000 109.90000 109.730000 109.690000 109.830000 109.870000 110.060000 109.530000

800m Analysis

2020 is not included because of the pandemic year. As results continue to be run and we head into the post season, let’s take note. While the average of this year is not an outlier compared to years past, the depth of the percentiles shows a shift down, to a concentration of faster times. Right now, this year’s data is most comparable to 2016.

1500m Kernel Density

png

The kernel density for the 1500m, years 2012-2021.

Rank 2012 2013 2014 2015 2016 2017 2018 2019 2021
count 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
mean 50.500000 223.449200 223.508700 223.412600 222.748600 222.981800 223.132800 222.998600 223.486600 220.620800
std 29.011492 1.957686 1.774577 1.811622 1.611957 1.605672 1.659264 2.259134 2.174883 2.130243
min 1.000000 216.770000 218.530000 216.340000 218.350000 217.740000 215.990000 215.010000 217.200000 214.680000
25% 25.750000 222.592500 222.247500 222.602500 221.987500 222.275000 222.355000 222.402500 222.927500 218.990000
50% 50.500000 223.970000 224.185000 223.895000 223.185000 223.470000 223.220000 223.805000 224.430000 221.165000
75% 75.250000 224.980000 224.930000 224.780000 224.010000 224.220000 224.367500 224.597500 224.895000 222.430000
max 100.000000 225.790000 225.680000 225.330000 224.610000 224.930000 225.440000 225.220000 225.620000 223.150000

1500m Analysis

2020 is not included because of the pandemic year. The striking characteristics of this graph explain the summary statistics below. This distribution looks as though it has been shifted to the left. Not surprisingly, as verified by the summary statistics, the 75th, 50th, and 25th percentiles have been shifted down. The slowest time in the array for 2021 is over two seconds ahead of that of 2019.

5,000m Kernel Density

png

The kernel density for the 5000m, years 2012-2021.

Rank 2012 2013 2014 2015 2016 2017 2018 2019 2021
count 100.000000 100.00000 100.000000 100.00000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
mean 50.500000 833.05800 834.613000 833.60200 832.736000 832.357000 833.640000 831.041000 831.195000 821.896000
std 29.011492 10.80636 10.625248 9.40652 9.044129 8.876655 8.766822 10.151828 9.936428 8.773674
min 1.000000 798.40000 795.300000 806.90000 800.300000 804.200000 797.500000 798.700000 805.000000 799.900000
25% 25.750000 828.00000 830.400000 828.90000 830.250000 828.725000 829.600000 822.850000 825.525000 816.425000
50% 50.500000 836.85000 837.900000 836.00000 834.400000 834.850000 835.750000 832.650000 834.400000 824.700000
75% 75.250000 841.62500 842.225000 840.47500 839.150000 838.800000 840.275000 839.750000 839.375000 828.650000
max 100.000000 844.30000 846.400000 845.10000 844.400000 842.400000 844.200000 843.500000 843.300000 832.100000

5,000m Analysis

2020 is not included because of the pandemic year. Once again, we see what looks like a kernel density plot that had a linear transformation applied to it. The graph looks like someone physically picked it up and moved it. Compared to 2019, the 2021 75th percentile is about 5.1 seconds faster, the 50th is about 9.1, and the 25th is 9 seconds faster.

10,000m Kernel Density

png

The kernel density for the 10,000m, years 2012-2021.

Rank 2012 2013 2014 2015 2016 2017 2018 2019 2021
count 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
mean 50.500000 1757.587000 1766.814000 1754.033000 1756.877000 1758.971000 1755.463000 1756.511000 1746.805000 1733.517000
std 29.011492 28.643347 19.773306 22.069459 21.690103 17.572248 22.791127 22.929263 21.014883 21.015766
min 1.000000 1647.900000 1672.300000 1656.700000 1674.200000 1672.700000 1684.900000 1684.400000 1691.300000 1667.200000
25% 25.750000 1749.250000 1760.375000 1744.350000 1745.875000 1749.650000 1740.050000 1748.475000 1735.625000 1722.050000
50% 50.500000 1766.700000 1771.950000 1757.300000 1759.800000 1759.700000 1761.750000 1760.000000 1753.300000 1737.700000
75% 75.250000 1775.775000 1780.275000 1771.575000 1775.625000 1772.800000 1773.800000 1774.075000 1763.325000 1750.675000
max 100.000000 1787.900000 1787.600000 1778.900000 1786.600000 1782.300000 1783.400000 1782.600000 1772.200000 1759.500000

10,000m Analysis

2020 is not included because of the pandemic year.

Once again, the 75th, 50th, and 25th percentiles have been shifted, by 12, 9, and 7 seconds, respectively. Check out the charts and study them for a little bit, and see if you notice anything interesting that you think I might have missed.

Final Remarks

I’ve been analyzing this data since April and it’s been a joy to run the kernel density plots each week. It’s always a joyous feeling to make predictions and see them come to life in real world scenarios.

I was talking the ear off of my family members since the first indoor races of the season, telling them that I thought that the spikes would blow open track and field times. According to the data, we are witnessing a new era in track and field.

While I still give much credit to the athletes for working extremely hard during quarantine, Nike has also done an amazing job with innovation.

The addition of ZoomX foam to spikes has been transformational, and I think the outdoor seasons in years to come should be quite thrilling. I feel quite confident in saying that I think a world record is possible in the 1500m this year.

I figured that the new spikes would cause a linear transformation to the dataset densities, and that has been corroborated by each and every plot. It’s truly astounding.

If you found something in this project interesting, have a criticism/ suggestion on what I can improve, or would like to collaborate on a project, feel free to shoot me an email, colinfitzgerald@berkeley.edu.

Have a great day. I hope you enjoyed reading this, and thank you for your time.

Back to Top

Updated: