beanumber / deuce

R package for web scraping of tennis data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

deuce

Build Status

Overview

The goal of deuce is to provide east access to many different sets of online data on professional tennis. By making tennis data more available to R users, deuce aims to be a useful tool for tennis analysts and a fun resource for teachers of statistics.

If you are new to tennis analytics, please check out this guide.

Installation

To install in R, use the devtools package and the following:

library(devtools)
install_github("skoval/deuce")

Caution: there are 274 MB of data included with the package so the installation may take several minutes.

About the Contents

To find out about the datasets and functions included in deuce, you can use the following command to bring up the package index.

help(package = "deuce")

Datasets

Any of the individual datasets can be loaded with the data command. For example, the following command brings the atp_matches data into the R environment and runs a summary on all of the columns.

library(deuce)
str(atp_matches)
#> 'data.frame':    776620 obs. of  72 variables:
#>  $ tourney_id        : chr  "1968-580" "1968-580" "1968-580" "1968-580" ...
#>  $ tourney_name      : chr  "Australian Chps." "Australian Chps." "Australian Chps." "Australian Chps." ...
#>  $ surface           : chr  "Grass" "Grass" "Grass" "Grass" ...
#>  $ draw_size         : int  64 64 64 64 64 64 64 64 64 64 ...
#>  $ tourney_level     : Factor w/ 8 levels "250 or 500","Challenger",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ match_num         : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ winner_id         : int  110023 109803 100257 100105 109966 107759 100101 100025 108519 109799 ...
#>  $ winner_seed       : chr  NA NA NA "5" ...
#>  $ winner_entry      : chr  "" "" "" "" ...
#>  $ winner_name       : chr  "Richard Coulthard" "John Brown" "Ross Case" "Allan Stone" ...
#>  $ winner_hand       : chr  "R" "R" "R" "R" ...
#>  $ winner_ht         : int  NA NA NA NA NA NA NA 173 NA NA ...
#>  $ winner_ioc        : chr  "AUS" "AUS" "AUS" "AUS" ...
#>  $ winner_age        : num  NA 27.5 16.2 22.3 29.9 ...
#>  $ winner_rank       : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ winner_rank_points: int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ loser_id          : int  107760 106964 110024 110025 110026 110027 110028 108430 110029 110030 ...
#>  $ loser_seed        : chr  NA NA "15" NA ...
#>  $ loser_entry       : chr  "" "" "" "" ...
#>  $ loser_name        : chr  "Max Senior" "Ernie Mccabe" "Gondo Widjojo" "Robert Layton" ...
#>  $ loser_hand        : chr  "R" "R" "R" "R" ...
#>  $ loser_ht          : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ loser_ioc         : chr  "AUS" "AUS" "INA" "AUS" ...
#>  $ loser_age         : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ loser_rank        : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ loser_rank_points : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ score             : chr  "12-10 7-5 4-6 7-5" "6-3 6-2 6-4" "6-4 3-6 6-3 7-5" "6-4 6-2 6-1" ...
#>  $ best_of           : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ round             : Ord.factor w/ 22 levels "BR"<"Q1"<"Q2"<..: 10 10 10 10 10 10 10 10 10 10 ...
#>  $ minutes           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_ace             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_df              : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_svpt            : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_1stIn           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_1stWon          : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_2ndWon          : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_SvGms           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_bpSaved         : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ w_bpFaced         : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_ace             : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_df              : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_svpt            : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_1stIn           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_1stWon          : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_2ndWon          : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_SvGms           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_bpSaved         : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ l_bpFaced         : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ W1                : num  12 6 6 6 6 6 6 6 6 6 ...
#>  $ W2                : num  7 6 3 6 6 6 3 6 6 6 ...
#>  $ W3                : num  4 6 6 6 7 6 6 6 9 6 ...
#>  $ W4                : num  7 NA 7 NA NA NA 9 NA 6 NA ...
#>  $ W5                : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ L1                : num  10 3 4 4 4 4 2 3 0 4 ...
#>  $ L2                : num  5 2 6 2 1 1 6 0 2 2 ...
#>  $ L3                : num  6 4 3 1 5 2 4 3 11 4 ...
#>  $ L4                : num  5 NA 5 NA NA NA 7 NA 3 NA ...
#>  $ L5                : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ Retirement        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  $ WTB1              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ LTB1              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ WTB2              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ LTB2              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ WTB3              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ LTB3              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ WTB4              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ LTB4              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ WTB5              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ LTB5              : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ tourney_start_date: Date, format: "1968-01-19" "1968-01-19" ...
#>  $ year              : num  1968 1968 1968 1968 1968 ...
#>  $ match_id          : chr  "1968-580:1" "1968-580:2" "1968-580:3" "1968-580:4" ...

Updating the Datasets

The make.R file under the package parent directory does all of the pre-processing for the major historical datasets. All of the data sources can be accessed over an internet connection. The only change the user would have to make in order to update their local package would be to change the package_root to the path where their local version of deuce lives.

Functions

There are some analytic functions and some functions for fetching additional tennis data from the Web. One example of an analytic function is the elo_prediction which computes the win chances for a player against a specific opponent given both player Elo ratings. Suppose, the player has an Elo rating of 2100. What is their implied win chance versus a player with a rating of 1950? We can compute that as follows:

elo_prediction(2100, 1950)
#> [1] 0.703385

An example of one of the data-scraping functions is fetch_activity. When connected to the Internet, this can be used to retrieve the match results for an ATP player for a specific year of for their career. As an example, let’s show how we would fetch the 2017 match results for Rafael Nadal.

head(fetch_activity("Rafael Nadal", 2017))
#>                        name                          location start_date
#> 1          Nitto ATP Finals             London, Great Britain 2017-11-13
#> 2    ATP Masters 1000 Paris                     Paris, France 2017-10-30
#> 3    ATP Masters 1000 Paris                     Paris, France 2017-10-30
#> 4    ATP Masters 1000 Paris                     Paris, France 2017-10-30
#> 5 ATP Masters 1000 Shanghai                   Shanghai, China 2017-10-09
#> 6 ATP Masters 1000 Shanghai                   Shanghai, China 2017-10-09
#>     end_date draw matches surface    prize          score          round winner
#> 1 2017-11-19    8       7      NA $8000000 67(5)/76(4)/46    Round Robin      0
#> 2 2017-11-05   48      47      NA €4835975            W/O Quarter-Finals      0
#> 3 2017-11-05   48      47      NA €4835975    63/67(5)/63    Round of 16      1
#> 4 2017-11-05   48      47      NA €4835975          75/63    Round of 32      1
#> 5 2017-10-15   56      55      NA $8092625          46/36         Finals      0
#> 6 2017-10-15   56      55      NA $8092625       75/76(3)    Semi-Finals      1
#>         player player_rank         opponent opponent_rank player1 player2
#> 1 Rafael Nadal           1     David Goffin             8       6       7
#> 2 Rafael Nadal           1 Filip Krajinovic            77      NA      NA
#> 3 Rafael Nadal           1     Pablo Cuevas            36       6       6
#> 4 Rafael Nadal           1      Hyeon Chung            55       7       6
#> 5 Rafael Nadal           1    Roger Federer             2       4       3
#> 6 Rafael Nadal           1      Marin Cilic             5       7       7
#>   player3 player4 player5 opponent1 opponent2 opponent3 opponent4 opponent5
#> 1       4      NA      NA         7         6         6        NA        NA
#> 2      NA      NA      NA        NA        NA        NA        NA        NA
#> 3       6      NA      NA         3         7         3        NA        NA
#> 4      NA      NA      NA         5         3        NA        NA        NA
#> 5      NA      NA      NA         6         6        NA        NA        NA
#> 6      NA      NA      NA         5         6        NA        NA        NA
#>   TBplayer1 TBplayer2 TBplayer3 TBplayer4 TBplayer5 TBopponent1 TBopponent2
#> 1         5         7        NA        NA        NA           7           4
#> 2        NA        NA        NA        NA        NA          NA          NA
#> 3        NA         5        NA        NA        NA          NA           7
#> 4        NA        NA        NA        NA        NA          NA          NA
#> 5        NA        NA        NA        NA        NA          NA          NA
#> 6        NA         7        NA        NA        NA          NA           3
#>   TBopponent3 TBopponent4 TBopponent5
#> 1          NA          NA          NA
#> 2          NA          NA          NA
#> 3          NA          NA          NA
#> 4          NA          NA          NA
#> 5          NA          NA          NA
#> 6          NA          NA          NA

There are also several tidy functions for pre-processing the major datasets that are included with the package.

For users interested in updating or running their own player Elo ratings, I would recommend looking at the Rcpp implementation of martiningram, which you can find here.

About

R package for web scraping of tennis data


Languages

Language:R 100.0%