Funnyball is a a binary classifier for predicting post-season college basketball victories based on regular season data and seeds.
- Kaggle data
- NCAA
###Other potential data sets to incorporate
Team Stats - transform %
Vegas Odds
Regular season versus Unfortunately I found that regular season matchups of teams in the post season are too sparse to use, and not predictive.
Regular season win records vs common teams - i.e. top 4/8/16 seeds that season - i.e. all other seeded teams in the tournament
Aggregated Team Ratings - Sagarin
News - Injuries - Coaching Changes
Player based - biometric - Individual statistics
- Predict winner of team A vs B - Gaussian
- Predict scores for A and B - Binomial or Poisson)
- Predict proportion of posessions
For all matchups in previously known postseasons
- Observation (team1_team2)
- DID_WIN_IN_POSTSEASON (0,1)
- SEED_DIFFERENTIAL (lseed - wseed; higher means winner was favored)
- REGULAR_SEASON_WIN_LOSS (when the teams matched up in the regular season, what was the ratio of team 1 to team 2's wins)
This can be done with
(use 'funnyball.build :reload-all)
(save-to-r-file)
Visualize the data in a scatter plot
(use 'funnyball.model :reload-all)
view-dataset
You should see a graph like this:
Run a random forest classifier on this data to see how significant the features are in predicting the response variable:
- Run the steps in
R_model.R
in RStudio. Note that you may need to install the following packages- e1071
- ggplot2
- randomForest
data <- read.table("output/input-r.csv", header=TRUE, sep=",")
rf <- randomForest(x=data[,c("seed.advantage","seed.win.loss.advantage.64")], y=as.factor(data[,c("did.win")]), importance=TRUE, proximity=TRUE)
Run the random forest model created in the previous step against the current year's regular season results and tournament seeds
data <- read.table("kaggle_data/current_season-r.csv", header=TRUE, sep=",")
data$predictWillWin <- predict(rf,data[,c("seed.advantage","seed.win.loss.advantage.64")])