deuce
Overview
The goal of deuce
is to provide east access to many different sets of
online data on professional tennis. By making tennis data more available
to R users, deuce
aims to be a useful tool for tennis analysts and a
fun resource for teachers of statistics.
If you are new to tennis analytics, please check out this guide.
Installation
To install in R, use the devtools
package and the following:
library(devtools)
install_github("skoval/deuce")
Caution: there are 274 MB of data included with the package so the installation may take several minutes.
About the Contents
To find out about the datasets and functions included in deuce
, you
can use the following command to bring up the package index.
help(package = "deuce")
Datasets
Any of the individual datasets can be loaded with the data
command.
For example, the following command brings the atp_matches
data into
the R environment and runs a summary on all of the columns.
library(deuce)
str(atp_matches)
#> 'data.frame': 776620 obs. of 72 variables:
#> $ tourney_id : chr "1968-580" "1968-580" "1968-580" "1968-580" ...
#> $ tourney_name : chr "Australian Chps." "Australian Chps." "Australian Chps." "Australian Chps." ...
#> $ surface : chr "Grass" "Grass" "Grass" "Grass" ...
#> $ draw_size : int 64 64 64 64 64 64 64 64 64 64 ...
#> $ tourney_level : Factor w/ 8 levels "250 or 500","Challenger",..: 5 5 5 5 5 5 5 5 5 5 ...
#> $ match_num : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ winner_id : int 110023 109803 100257 100105 109966 107759 100101 100025 108519 109799 ...
#> $ winner_seed : chr NA NA NA "5" ...
#> $ winner_entry : chr "" "" "" "" ...
#> $ winner_name : chr "Richard Coulthard" "John Brown" "Ross Case" "Allan Stone" ...
#> $ winner_hand : chr "R" "R" "R" "R" ...
#> $ winner_ht : int NA NA NA NA NA NA NA 173 NA NA ...
#> $ winner_ioc : chr "AUS" "AUS" "AUS" "AUS" ...
#> $ winner_age : num NA 27.5 16.2 22.3 29.9 ...
#> $ winner_rank : int NA NA NA NA NA NA NA NA NA NA ...
#> $ winner_rank_points: int NA NA NA NA NA NA NA NA NA NA ...
#> $ loser_id : int 107760 106964 110024 110025 110026 110027 110028 108430 110029 110030 ...
#> $ loser_seed : chr NA NA "15" NA ...
#> $ loser_entry : chr "" "" "" "" ...
#> $ loser_name : chr "Max Senior" "Ernie Mccabe" "Gondo Widjojo" "Robert Layton" ...
#> $ loser_hand : chr "R" "R" "R" "R" ...
#> $ loser_ht : int NA NA NA NA NA NA NA NA NA NA ...
#> $ loser_ioc : chr "AUS" "AUS" "INA" "AUS" ...
#> $ loser_age : num NA NA NA NA NA NA NA NA NA NA ...
#> $ loser_rank : int NA NA NA NA NA NA NA NA NA NA ...
#> $ loser_rank_points : int NA NA NA NA NA NA NA NA NA NA ...
#> $ score : chr "12-10 7-5 4-6 7-5" "6-3 6-2 6-4" "6-4 3-6 6-3 7-5" "6-4 6-2 6-1" ...
#> $ best_of : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ round : Ord.factor w/ 22 levels "BR"<"Q1"<"Q2"<..: 10 10 10 10 10 10 10 10 10 10 ...
#> $ minutes : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_ace : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_df : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_svpt : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_1stIn : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_1stWon : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_2ndWon : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_SvGms : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_bpSaved : int NA NA NA NA NA NA NA NA NA NA ...
#> $ w_bpFaced : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_ace : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_df : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_svpt : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_1stIn : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_1stWon : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_2ndWon : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_SvGms : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_bpSaved : int NA NA NA NA NA NA NA NA NA NA ...
#> $ l_bpFaced : int NA NA NA NA NA NA NA NA NA NA ...
#> $ W1 : num 12 6 6 6 6 6 6 6 6 6 ...
#> $ W2 : num 7 6 3 6 6 6 3 6 6 6 ...
#> $ W3 : num 4 6 6 6 7 6 6 6 9 6 ...
#> $ W4 : num 7 NA 7 NA NA NA 9 NA 6 NA ...
#> $ W5 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ L1 : num 10 3 4 4 4 4 2 3 0 4 ...
#> $ L2 : num 5 2 6 2 1 1 6 0 2 2 ...
#> $ L3 : num 6 4 3 1 5 2 4 3 11 4 ...
#> $ L4 : num 5 NA 5 NA NA NA 7 NA 3 NA ...
#> $ L5 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ Retirement : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
#> $ WTB1 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ LTB1 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ WTB2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ LTB2 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ WTB3 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ LTB3 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ WTB4 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ LTB4 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ WTB5 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ LTB5 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ tourney_start_date: Date, format: "1968-01-19" "1968-01-19" ...
#> $ year : num 1968 1968 1968 1968 1968 ...
#> $ match_id : chr "1968-580:1" "1968-580:2" "1968-580:3" "1968-580:4" ...
Updating the Datasets
The make.R
file under the package parent directory does all of the
pre-processing for the major historical datasets. All of the data
sources can be accessed over an internet connection. The only change the
user would have to make in order to update their local package would be
to change the package_root
to the path where their local version of
deuce
lives.
Functions
There are some analytic functions and some functions for fetching
additional tennis data from the Web. One example of an analytic function
is the elo_prediction
which computes the win chances for a player
against a specific opponent given both player Elo ratings. Suppose, the
player has an Elo rating of 2100. What is their implied win chance
versus a player with a rating of 1950? We can compute that as follows:
elo_prediction(2100, 1950)
#> [1] 0.703385
An example of one of the data-scraping functions is fetch_activity
.
When connected to the Internet, this can be used to retrieve the match
results for an ATP player for a specific year of for their career. As an
example, let’s show how we would fetch the 2017 match results for Rafael
Nadal.
head(fetch_activity("Rafael Nadal", 2017))
#> name location start_date
#> 1 Nitto ATP Finals London, Great Britain 2017-11-13
#> 2 ATP Masters 1000 Paris Paris, France 2017-10-30
#> 3 ATP Masters 1000 Paris Paris, France 2017-10-30
#> 4 ATP Masters 1000 Paris Paris, France 2017-10-30
#> 5 ATP Masters 1000 Shanghai Shanghai, China 2017-10-09
#> 6 ATP Masters 1000 Shanghai Shanghai, China 2017-10-09
#> end_date draw matches surface prize score round winner
#> 1 2017-11-19 8 7 NA $8000000 67(5)/76(4)/46 Round Robin 0
#> 2 2017-11-05 48 47 NA €4835975 W/O Quarter-Finals 0
#> 3 2017-11-05 48 47 NA €4835975 63/67(5)/63 Round of 16 1
#> 4 2017-11-05 48 47 NA €4835975 75/63 Round of 32 1
#> 5 2017-10-15 56 55 NA $8092625 46/36 Finals 0
#> 6 2017-10-15 56 55 NA $8092625 75/76(3) Semi-Finals 1
#> player player_rank opponent opponent_rank player1 player2
#> 1 Rafael Nadal 1 David Goffin 8 6 7
#> 2 Rafael Nadal 1 Filip Krajinovic 77 NA NA
#> 3 Rafael Nadal 1 Pablo Cuevas 36 6 6
#> 4 Rafael Nadal 1 Hyeon Chung 55 7 6
#> 5 Rafael Nadal 1 Roger Federer 2 4 3
#> 6 Rafael Nadal 1 Marin Cilic 5 7 7
#> player3 player4 player5 opponent1 opponent2 opponent3 opponent4 opponent5
#> 1 4 NA NA 7 6 6 NA NA
#> 2 NA NA NA NA NA NA NA NA
#> 3 6 NA NA 3 7 3 NA NA
#> 4 NA NA NA 5 3 NA NA NA
#> 5 NA NA NA 6 6 NA NA NA
#> 6 NA NA NA 5 6 NA NA NA
#> TBplayer1 TBplayer2 TBplayer3 TBplayer4 TBplayer5 TBopponent1 TBopponent2
#> 1 5 7 NA NA NA 7 4
#> 2 NA NA NA NA NA NA NA
#> 3 NA 5 NA NA NA NA 7
#> 4 NA NA NA NA NA NA NA
#> 5 NA NA NA NA NA NA NA
#> 6 NA 7 NA NA NA NA 3
#> TBopponent3 TBopponent4 TBopponent5
#> 1 NA NA NA
#> 2 NA NA NA
#> 3 NA NA NA
#> 4 NA NA NA
#> 5 NA NA NA
#> 6 NA NA NA
There are also several tidy
functions for pre-processing the major
datasets that are included with the package.
For users interested in updating or running their own player Elo ratings, I would recommend looking at the Rcpp implementation of martiningram, which you can find here.