award-predicting basketball basketball-stats data-science machine-learning nba nba-stats python all-nba

2023/2024 NBA Awards prediction

1st 2nd 3rd All-NBA teams and 1st 2nd Rookie All-NBA teams prediction - project for course "Selected topics of machine learning"

Running the prediction

To run the prediction for the 2023/2024 season, run the main.py script with path to a file where the data (as JSON) should be save. Example:

python main.py ~/Documents/predictions.json

Requirements

The project was written in Python 3.11. The required packages are listed in the requirements.txt file. To install it, run:

pip install -r requirements.txt

1. Getting the data

The data was downloaded from nba.com/stats using the nba_api library. The data was downloaded for all NBA seasons (1946-47 - 2023-24) and contains:

player statistics in each game (downloaded by this script) - because the file with statistics from all matches is too big to be uploaded to the repository (around 250MB), it is available in this Kaggle dataset,
team statistics in each game (calculated based on the data from the previous point in this script),
player statistics in each season (calculated based on the data from the player statistics in this script),
player awards (downloaded by this script),
information about rookie seasons of the players (downloaded by this script),
dates of beginning and end of each season (regular season, playoffs and finals) - based on Wikipedia data.

All data was saved in the data directory in the csv format.

Note

I found some mistakes in the data in older seasons - for example, some players were in scoreboard of a match, but they didn't play in that game (they weren't playing for any team from that game). It was usually caused by the same last name of the players and the data was doubled and as a result final scores might differ from the real ones.

2. Data preprocessing

The data had to be preprocessed because on the NBA website the seasonal statistics were available only for seasons 1996-97 - 2023-24. Because of that, the data was downloaded for each game in history of NBA and then aggregated to get the seasonal statistics (if specific statistic was used at the time - link to list).

2.1. Seasons and types of matches

Because the All-NBA teams are selected after the regular season, the data was divided into the following types of matches:

Regular Season,
All-Star Game,
Play-in Tournament,
Playoffs,
Finals,
In-Season Tournament Final (other games of the In-Season Tournament are officially considered as Regular Season games).

Inside NBA_Seasons_Dates.csv file there are dates of beginning and end of regular season, playoffs and finals for each season. That information was used to add information about the type of match and season to the statistics.

2.2. Player statistics

Apart from the statistics available on the NBA website, the following statistics were calculated:

Fantasy Points - based on the formula: FP = PTS + 1.2 * REB + 1.5 * AST + 3 * STL + 3 * BLK - TO,
Player Impact Estimate - based on the formula: PIE = (PTS + FGM + FTM - FGA - FTA + DREB + 0.5 * OREB + AST + STL + 0.5 * BLK - PF - TO) / (GmPTS + GmFGM + GmFTM - GmFGA - GmFTA + GmDREB + 0.5 * GmOREB + GmAST + GmSTL + 0.5 * GmBLK - GmPF - GmTO),
number of statistics in double digits - if the number was >= 2, then the player had a double-double DD and if the number was >= 3, then the player had a triple-double TD,
field goals made (and 3PT shots made) only if the number of attempts was available - in older seasons not all of the statistics were saved and that could cause FG% to be over 100%,
information about win/loss in the match.

After that, the data was summed up to get the seasonal statistics for each player.

2.3. Awards

The data about awards was downloaded for each player and information about the following awards were added to the dataset:

Most Valuable Player,
Rookie of the Year,
Defensive Player of the Year,
Most Improved Player,
6th Man of the Year,
All-NBA teams (1st, 2nd, 3rd),
All-Defensive teams (1st, 2nd),
All-Rookie teams (1st, 2nd),
All-Star Game player,
All-Star Game MVP,
Finals MVP,
number of Player of the Week awards,
number of Player of the Month awards,
number of Rookie of the Month awards.

2.3.1. Awards and All-NBA teams

The correlation between the awards and the selection to All-NBA teams since 1988-89 season was checked and that data is shown in the table below (for POTM, POTW it meant that the player won at least one award during the season):

Award	1st All-NBA Team	2nd All-NBA Team	3rd All-NBA Team	Not selected
MVP	35	0	0	0
DPOY	11	6	8	10
ROY	1	0	1	35
6MOY	0	0	1	24
MIP	0	5	4	26
All-Star Game Player	163	156	142	356
All-Star Game MVP	26	7	1	2
POTW	151	125	105	423
POTM	109	56	31	52

The data shows that the MVPs are always selected to 1st All-NBA team, All-Star Game MVPs are usually selected to 1st All-NBA team or 2nd All-NBA team. DPOYs, All-Star Game Players, POTWs and POTMs have high chance to be selected to All-NBA teams.

2.3.2. Awards and Rookie All-NBA teams

The correlation between the awards and the selection to All-NBA Rookie teams since 1988-89 season was checked and that data is shown in the table below (for ROTM it meant that the player won at least one award during the season):

Award	1st All-NBA Rookie Team	2nd All-NBA Rookie Team	Not selected
MVP	0	0	0
DPOY	0	0	0
ROY	37	0	0
6MOY	1	0	0
MIP	0	0	0
All-Star Game Player	7	0	0
All-Star Game MVP	0	0	0
POTW	24	0	0
POTM	0	0	0
ROTM	111	28	23

In the data in the table we can see that most of the awards weren't won by players who were selected to any of the All-NBA Team. However all Rookie Of The Year winners were selected to 1st All-NBA Rookie Team. Winning Rookie Of The Month also means a player has high chance of being selected to the All-NBA Rookie Teams. Unfortunately in the data there were no Rising-Star matches that take place during All-Star Weekends as this could have impact.

2.4. Average statistics and normalization

The statistics were averaged for each player to get his average impact on the game per match (by doing so the number of games player played doesn't matter).

Also because basketball and players were evolving over the years, the statistics were normalized so that the player with highest certain statistic in specific season would have value 1 and the rest of the players would have proportionally lower values.

However this could cause problems with players who played just a few games during a season and had very high statistics in those games.

2.4.1. Eliminating players with low statistics for All-NBA teams prediction

To eliminate the issue, after displaying data for all players who were selected to All-NBA teams (graph below), the following filters were applied:

Games Played >= 40,
Minutes played during season >= 1250,
Points scored during season >= 333,
Fantasy Points scored during season >= 1250.

By doing so, the data for seasons 1988-89 till 2023-23 was reduced from 16711 to 6074 players. For season 2023-24 there was also a requirement for Games Played >= 65 and that caused that only 146 were eligible for All-NBA teams.

2.4.2. Eliminating players with low statistics for Rookie All-NBA teams prediction

Firstly all non Rookie players were removed from the dataset. After that the same statistics were chosen and displayed as for All-NBA teams. The filters for Rookie players were applied as follows:

Games Played >= 24,
Minutes played during season >= 650,
Points scored during season >= 250,
Fantasy Points scored during season >= 500.

The filters allowed to reduce the data from 2801 players to just 917. Also only 25 players were eligible to be selected to All-NBA Rookie Teams in 2023-24 season.

2.4.3. Statistics correlation for All-NBA teams prediction

After normalizing the data, the correlation between the normalized statistics and the selection to All-NBA teams was checked. The correlation matrix is shown below:

Based on the correlation matrix, the highest importance for the selection to All-NBA teams have the following statistics:

Player Impact Estimate,
Fantasy Points,
Points,
Free Throws Made,
Field Goals Made.

High correlation between those above statistics and being selected to All-NBA teams is understandable as those statistics (apart from Free Throws made) directly show impact on the game. Free Throws Made may be correlated because good players usually play more and create more actions so the possiblity of being fouled is higher.

The least correlated statistics are:

Free Throw Percentage,
3PT Field Goal Percentage,
3 PT Field Goals Made.

The low correlation between 3PT Shot statistics is probably caused by the fact that the Centers and Power Forwards usually don't shoot 3PT shots. And in the past those kind of players also weren't good in Free Throws what explains the low correlation with Free Throw Percentage.

2.4.4. Statistics correlation for All-NBA Rookie teams prediction

After normalizing the data for Rookie players, the correlation between the normalized statistics and the selection to All-NBA Rookie teams was checked. The correlation matrix is shown below:

The most correlated statistics with being selected to All-NBA Rookie teams are:

Field Goals Made,
Fantasy Points,
Points,
Player Impact Estimate,
Minutes.

Most of the statistics are the same as for All-NBA teams prediction. The fact that Minutes is highly correlated with being selected to All-NBA Rookie teams may be caused by the fact that most Rookies aren't starters and they don't play as much as the experienced players (so only really good rookies play a lot of time).

The least correlated statistics are:

3PT Field Goal Percentage,
Free Throw Percentage,
Wins,
Field Goal Percentage,
Triple Doubles.

The low correlation between Wins and being selected to All-NBA Rookie teams is understandable because the best rookies usually play in the teams from bottom of the table (because the worst teams get first picks in the draft). The low correlation between Triple Doubles and being selected to All-NBA Rookie teams is probably caused by the fact that achieving a Triple-Double is difficult even for experienced players and rookies usually spend less time on the court, so it's even harder to get any (so most of them don't achieve even one).

3. Splitting the data for training and validation sets

The data was split into training and validation sets and each season is fully in either training or validation set. 4 validation seasons were randomly selected and the score for validation set was calculated as mean value of the metric for each of the 4 seasons.

4. Metric

The following metric was used to evaluate the model (proposed by the course lecturer):

+10 points for each player in correct team,
+8 points for each player that is classified in a team that's number differ by 1 from the correct one,
+6 points for each player that is classified in a team that's number differ by 2 from the correct one,
+5 points if 2 players are in correct team,
+10 points if 3 players are in correct team,
+20 points if 4 players are in correct team,
+40 points if 5 players are in correct team.

That means that the maximum number of points for a season is $5 \cdot (5 \cdot 10 + 40) = 450$.

Using metrics like accuracy would be misleading because the number of players not selected to any of the All-NBA teams is much higher than those who got selected. An example could be to classify every of the 146 players eligible to be selected to All-NBA teams in 2023-24 season as not selected and the accuracy would be 0.89.

5. Models

5.1. All-NBA teams prediction

Below are some of the models that were used to predict the players selected to All-NBA teams.

5.1.1. Baseline model (score: 148.25)

The baseline model was a Random Forest Classifier with n_estimators = 100 that was predicting probability of player being selected to each of the All-NBA teams. The mean score on validation set was 148.25 out of 270 points. The feature importance for the model is shown below:

The baseline model got a high score so it's a good starting point, but also makes it harder to find early improvements.

5.1.2. Random Forest Classifier with only per game statistics (score: 141.50)

After removing the statistics that weren't calculated as mean per game, the score of the model decreased to 141.5.

5.1.3. Random Forest Classifier with prediction voting (score: 154.50)

The model was predicting the probability of player being selected to each of the All-NBA teams and than the predictions were used to calculate voting points from the formula:

$VotPts=5 \cdot P_{1st Team} + 3 \cdot P_{2nd Team} + 1 \cdot P_{3rd Team}$.

The formula is based on the formula to calculate results of real All-NBA Team voting. After calculating the points, top players were added to each team. The score of the model was 154.5.

Only the mean per game statistics were used as the score was higher than for the model with all statistics.

5.1.4. Comparison of different default models - Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, K-Nearest Neighbors, XGBOOST, LightGBM, Voting Classifier (score: 158.75)

The comparison of the models is shown in the table below:

Model	Only per game stats + Voting	No per game stats + Voting	All stats + Voting	Only per game stats + No Voting	No per game stats + No Voting	All stats + No Voting
Logistic Regression	121.75	109.25	109.25	116.00	109.25	111.75
Support Vector Machine	113.25	124.25	124.25	118.50	122.75	122.75
Decision Tree	120.50	102.75	110.00	117.00	86.75	87.75
Random Forest	154.50	158.75	145.75	142.00	149.25	148.25
K-Nearest Neighbors	106.50	105.25	105.25	95.00	104.25	104.25
XGBOOST	141.00	143.00	143.25	131.75	131.00	138.50
LightGBM	135.50	147.75	137.50	137.25	140.50	137.50
Voting Classifier*	133.75	145.00	136.50	139.25	139.50	136.25

*Voting Classifier was built from all the above models.

With bold are marked the best scores for each configuration.

The best score was achieved by Random Forest Classifier (158.75). Scores above 140 points were achieved also by:

XGBOOST - in 3 configurations,
LightGBM - in 2 configurations,
Voting Classifier - in 1 configuration.

5.1.5. Hyperparameter tuning and feature selection (score 175.75)

Only the 4 models that achieved 140 points at least once were selected for hyperparameter tuning.

The following parameter grid was created (Voting Classifier was built only from other models in this table):

	Random Forest Classifier	XGBoost	LightGBM	Voting Classifier
Parameters	`{'n_estimators': [100, 200, 300, 400, 500],` `'max_depth': [10, 25, 50, 100, None],` `'min_samples_split': [2, 5, 10],` `'min_samples_leaf': [1, 2, 4],` `'max_features': ['sqrt', 'log2']}`	`{'n_estimators': [100, 200, 300, 400, 500],` `'max_depth': [10, 25, 50, 100, None],` `'learning_rate': [0.01, 0.05, 0.1, 0.2],` `'subsample': [0.6, 0.8, 1],` `'colsample_bytree': [0.5, 0.8, 1],` `'gamma': [0, 0.1, 0.2, 0.3, 0.4]}`	`{'n_estimators': [100, 200, 300, 400, 500],` `'max_depth': [10, 25, 50, 100, None],` `'learning_rate': [0.01, 0.05, 0.1, 0.2],` `'subsample': [0.6, 0.8, 1],` `'colsample_bytree': [0.5, 0.8, 1]}`	`{'weights': [[1, 1, 1], [1, 2, 1], [1, 1, 2], [2, 1, 1], [2, 2, 1], [1, 2, 2]]` `'voting': 'soft'}`

Also the feature selection was implemented. In each iteration there were randomly chosen a random number of features (at least 5) from the list of statistics and for each set of features there were 50 iterations of hyperparameter tuning.

By randomly choosing the features 50 times and then randomly choosing parameters 50 times, there was a total of 2500 results for each model (10000 in total). Also each model was tested with and without additional voting for the prediction so as a result 20000 combinations were checked. The optimization process took ~6 hours. The best model got score 175.75 (what is a significant improvement over the baseline model). Features and hyperparameters for the best model are as follows:

model: Random Forest Classifier,
model parameters: {'n_estimators': 200, 'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'},
features: ['STL', 'FTM_2', 'STL_per_GP', 'PIE_per_GP', 'FTA', 'FG3M_per_GP', 'POTM', 'All-Star', 'MIN', 'FGM_2', 'FP', 'DD', 'GP', 'FTA_per_GP', 'PTS', 'REB', 'DPOY', 'FT_PCT', 'REB_per_GP', 'All-Star-MVP', 'L', 'TD', 'FG3_PCT', 'BLK_per_GP', 'PTS_per_GP', 'AST', 'PIE', 'W', 'FG3M_2', 'TO_per_GP', 'FGM_per_GP', 'FGA', 'FTM_per_GP', 'ROTM'],
additional voting: True.

All the models with their parameters and features were saved to a csv file.

5.1.6. How predictions for validation set could be improved

Before 2023/2024 season, each of the All-NBA teams was containing 2 guards, 2 forwards and 1 center (since 2023/24 the voting is positionless). With that in mind the model could be improved by adding information about the position of the player and then filtering the predictions to have correct number of players in each position.

5.2. Rookie All-NBA teams prediction

5.2.1. Baseline model (score: 131.25)

Random Forest Classifier with n_estimators = 100 was used as a baseline model. The score on the validation set (with all features) was 131.25 (out of 180).

5.2.2. Random Forest Classifier with voting (score: 136.50)

Similar to the model for All-NBA teams, the probability voting was added. The formula for voting was changed to:

$VotPts = 2 \cdot P_{1st Team} + 1 \cdot P_{2nd Team}$

After adding the voting, the score for the model increased to 136.5.

5.2.3. Best model for predicting All-NBA teams (score: 126.00)

Using the model that was best for predicting All-NBA teams and the same features, resulted in a score of 126.00 with additional voting and 115.50 without it.

5.2.4. Hyperparameter tuning and feature selection (score: 174.50)

The same parameter grid was used as for All-NBA teams prediction. Once again the parameters were randomly chosen 50 times for each of 50 randomly chosen sets of features. The best score (174.5) was achieved by 3 models:

XGBoost:
- model parameters: {'n_estimators': 400, 'max_depth': 25, 'learning_rate': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.5, 'gamma': 0.0},
- features: ['All-Star-MVP' 'FG3_PCT' 'W' 'REB_per_GP' 'POTW' 'FP_per_GP' 'FGM_2' 'STL_per_GP' 'STL' 'TD' 'FG_PCT' 'REB' 'FG3A' 'ROTM' 'FGA' 'FGA_per_GP' 'FTM_2' 'PTS' 'FP' 'AST_per_GP' 'DD' 'POTM' 'MIN' 'AST' 'FG3M_per_GP' 'TO_per_GP' 'All-Star' 'PIE' 'PTS_per_GP' 'L' 'BLK_per_GP' 'PIE_per_GP' 'FTA_per_GP'],
- additional voting: True,
LightGBM:
- model parameters: {'n_estimators': 300, 'max_depth': 10, 'learning_rate': 0.01, 'subsample': 0.6, 'colsample_bytree': 0.8},
- features: ['All-Star-MVP' 'POTW' 'W' 'ROTM' 'FG3A' 'STL' 'REB' 'PIE' 'REB_per_GP' 'TO_per_GP' 'FG3M_2' 'PTS' 'STL_per_GP' 'FP_per_GP' 'L' 'FP' 'MIN' 'MIN_per_GP' 'FG3A_per_GP' 'PIE_per_GP' 'FTM_per_GP' 'FGA_per_GP' 'FG3M_per_GP' 'POTM' 'GP' 'FG3_PCT'],
- additional voting: True,
LightGBM:
- model parameters: {'n_estimators': 100, 'max_depth': None, 'learning_rate': 0.05, 'subsample': 0.8, 'colsample_bytree': 1.0},
- features: ['All-Star-MVP' 'POTW' 'W' 'ROTM' 'FG3A' 'STL' 'REB' 'PIE' 'REB_per_GP' 'TO_per_GP' 'FG3M_2' 'PTS' 'STL_per_GP' 'FP_per_GP' 'L' 'FP' 'MIN' 'MIN_per_GP' 'FG3A_per_GP' 'PIE_per_GP' 'FTM_per_GP' 'FGA_per_GP' 'FG3M_per_GP' 'POTM' 'GP' 'FG3_PCT'],
- additional voting: True.

The XGBoost was chosen because it was the first model with the highest score. All the models with their parameters, used features and score were saved to a csv file.

6. Predictions for 2023/2024 season

6.1. All-NBA teams

Predictions for the 2023/2024 season are based on the best model from section 5.1.5. The predictions are shown in the table below:

1st Team	2nd Team	3rd Team
Nikola Jokic	Jalen Brunson	Devin Booker
Luka Doncic	Anthony Davis	Domantas Sabonis
Shai Gilgeous-Alexander	Anthony Edwards	Damian Lillard
Giannis Antetokounmpo	Kevin Durant	Kawhi Leonard
Jayson Tatum	LeBron James	Tyrese Haliburton

6.2. Rookie All-NBA teams

Predictions for the 2023/2024 season are based on the best model from section 5.2.4. The predictions are shown in the table below:

1st Team	2nd Team
Victor Wembanyama	Scoot Henderson
Chet Holmgren	Keyonte George
Brandon Miller	Amen Thompson
Jaime Jaquez Jr.	Dereck Lively II
Brandin Podziemski	GG Jackson

7. Summary

The final score on the 2023-24 season was 356 points (out of 450):

All-NBA Teams:
- 1st Team: 10+10+10+10+10+40 = 90,
- 2nd Team: 10+10+10+10+8+20 = 68,
- 3rd Team: 10+10+0+8+10+10 = 48,
All-NBA Rookie Teams:
- 1st Team: 10+10+10+10+10+40 = 90,
- 2nd Team: 0+10+10+10+10+20 = 60.

The models correctly predicted 21 out of 25 players, 2 players were in wrong teams (difference of 1 team) and 2 players were missing.

As mentioned in section 5.1.6., before 2023/24 season the players were chosen to All-NBA Teams based on their position what could be added to the model to predict on validation set but not on the 2023-24 season (that doesn't apply to All-NBA Rookie Teams as they were always positionless). That also creates that the data wasn't perfect for training.

8. Possible improvements

Apart from adding the information about the positions of the players, the following improvements could be implemented:

age/number of seasons in NBA of the player,
draft pick number,
use the whole dataset since 1946-47 season (the problem is that before 1988-89 only 2 All-NBA teams were selected),
create even bigger parameter grid and test more combinations.

About

Prediction of players being selected to All-NBA 1st 2nd 3rd Teams and All-NBA Rookie 1st, 2nd Teams. For season 2023/24 predicted correctly 21 out of 25 players. - project for course "Selected topics of machine learning" during 1 semester of masters