BuzzFeedNews / figure-skating-scores

ISU Figure Skating Score Sheets as Structured Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ISU Figure Skating Score Sheets as Structured Data

At the end of each competition it oversees, the International Skating Union releases a PDF containing all scores given for each performance. That report is known as a "Protocol," and an example can be found here. The code in this repository downloads a series of protocol PDFs, and then extracts structured data from the scoring sheets they contain.

Currently, the data in this repository includes every major international competition from October 2016 through December 2017. You can find a list of those 17 competitions below.

Competitions Included

2016–17 season:

  • ISU GP 2016 Progressive Skate America (Oct. 20-23, 2016)
  • ISU GP 2016 Skate Canada International (Oct. 27-30, 2016)
  • ISU GP Rostelecom Cup 2016 (Nov. 4-6, 2016)
  • ISU GP Trophee de France 2016 (Nov. 11-13, 2016)
  • ISU GP Audi Cup of China 2016 (Nov. 17-20, 2016)
  • ISU GP NHK Trophy 2016 (Nov. 25-27, 2016)
  • ISU Grand Prix of Figure Skating Final 2016 (Dec. 8-11, 2016)
  • ISU European Figure Skating Championships 2017 (Jan. 23-29, 2017)
  • ISU Four Continents Championships 2017 (Feb. 14-19, 2017)
  • ISU World Figure Skating Championships 2017 (Mar. 27 - Apr. 2, 2017)

2017–18 season:

  • ISU GP Rostelecom Cup 2017 (Oct. 20-22, 2017)
  • ISU GP 2017 Skate Canada International (Oct. 27-29, 2017)
  • ISU GP Audi Cup of China 2017 (Nov. 3-5, 2017)
  • ISU GP NHK Trophy 2017 (Nov. 10-12, 2017)
  • ISU GP Internationaux de France de Patinage 2017 (Nov. 17-19, 2017)
  • ISU GP 2017 Bridgestone Skate America (Nov. 24-26, 2017)
  • Grand Prix Final 2017 Senior and Junior (Dec. 7-10, 2017)

Data

The structured data in this repository is available in two formats:

CSV Structure

The CSV-formatted data is split up into four files:

  • programs.csv: One row for each program at each competition, e.g., the "ICE DANCE FREE DANCE" at the "Grand Prix Final 2017 Senior and Junior". Each row includes a reference to the source PDF.

  • performances.csv: One row for each skater/team, for each program.

  • judged-aspects.csv: One row for each "executed element" and "program component", for each performance at each competition.

  • judge-scores.csv: One row for each judge, for each judged aspect, for each performance at each competition.

Data Dictionary

  • programs.csv:

    • competition: The name of the competition, e.g., "ISU European Figure Skating Championships 2017".
    • program: The name of the program, e.g., "LADIES SHORT PROGRAM".
    • pdf: The filename of the corresponding Protocol PDF.
  • performances.csv:

    • performance_id: An ID unique to each performance in a program of a competition. Autogenerated for the CSV files.
    • competition: The name of the competition, e.g., "ISU European Figure Skating Championships 2017".
    • program: The name of the program, e.g., "LADIES SHORT PROGRAM".
    • name: The name(s) of the skater(s).
    • nation: The home country of the skater(s).
    • rank: Final place in the program.
    • starting_number: The order in which the skaters skated.
    • total_segment_score: The total score for the program.
    • total_element_score: The total score of all elements in the program.
    • total_component_score: The total score of all components in the program.
    • total_deductions: The total deductions given by the technical panel for the performance.
  • judged-aspects.csv:

    • aspect_id: A ID unique to each element or component during a skater's performance. Autogenerated for the CSV files.
    • performance_id: See above.
    • section: The type of aspect; either element or component.
    • aspect_num: The positional order of the aspect within the performance and section.
    • aspect_desc: Shorthand notation for the aspect. For instance, a double lutz would be marked 2Lz.
    • info_flag: A marking by the technical panel, such "<" for an under-rotated jump.
    • credit_flag: An "X" in this column means that the skater received "credit for highlight distribution" for that element, which increases the base value.
    • base_value: The base number of points for the performed element.
    • factor: The amount by which the component score is multipled to calculate its final value.
    • goe: The overall translated Grade of Execution (GOE) given by the judging panel.
    • scores_of_panel: The judging panel's total score for the aspect.
  • judge-scores.csv:

    • aspect_id: See above.
    • judge: The identifier assigned to the judge, e.g., "J1".
    • score: The GOE (for elements) or score (for components) awarded by the judge for the aspect.

Downloading the PDFs

This repository does not contain the PDFs themselves.

You can, however, find a list of the URLs of each PDF in the scripts/urls.txt file.

To automate the process of downloading the PDFs, download or clone this repository to your computer, navigate to the repository's root directory, and run sh scripts/download_pdfs.sh.

Extracting the Data Yourself

If you'd like to re-run the data-extraction scripts yourself, do the following:

  • Download or clone this repository to your computer
  • Navigate to the repository's root directory
  • Download the PDFs, per the instructions above
  • Ensure that you have Python 3 installed
  • Install the required libraries (ideally in a Python 3 virtual environment) by running pip3 install pandas==1.2.4; pip3 install -e git+https://github.com/jsvine/pdfplumber@v0.6.0-alpha#egg=pdfplumber
  • Run make reproduce

That last step will clear all previously-extracted data, re-run the PDF-to-JSON and JSON-to-CSV extractions.

That process will overwrite the data/parsing-log.txt file, which contains a transcript of each page that has been parsed, and whether the parser found any score sheets on that particular page.

Licensing

All code in this repository is available under the MIT License. All data files are available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Questions / Feedback

Contact Jeremy Singer-Vine jeremy.singer-vine@buzzfeed.com and John Templon at john.templon@buzzfeed.com.

Looking for more from BuzzFeed News? Click here for a list of our open-sourced projects, data, and code.

About

ISU Figure Skating Score Sheets as Structured Data


Languages

Language:Python 97.7%Language:Makefile 1.5%Language:Shell 0.9%