DOTA 2 Match Result Prediction

Datasets

Dota 2 Matches from Kaggle
Public match data fetched using the Opendota API

Analysis

It is a known fact that predicting a Dota 2 match's outcome at any given point of time during the match is very hard even with all related data such as net worth advantage, team kills, creep scores, hero levels, heroes and tower scores. Valve's Dota Plus win prediction during a pro game is an excellent example of this. The heavy fluctuation around the halfway probability mark throughout the game creates uncertainty in the calculated outcome even with a state-of-the-art-model.

For a lower ranked public game, there are more trends that models can pick up on. This is reflected in the hero winrates during a specific major patch on a source like Dotabuff.

However, lower rank means lower skill and the countering ability of a hero is sometimes lost due to lower mechanical skill and subpar/ineffective item builds.

This would mean that the played heroes alone would not suffice to give an accurate prediction of the result. We would need to add items purchased by the players into our feature set for more accurate predictions on a test set.

To show the difference in performance between features of heroes and features of heroes and items, the same models were trained on both datasets to benchmark them.

I decided to use two separate sources to verify that the inference holds irrespective of source. The kaggle dataset contains older matches from 2019 in a wide skill bracket while the opendota dataset only contains matches skilled above average and are more recent.

Training

The features were one hot encoded with each hero having a radiant and dire slot to account for side bias and team separation.

The kaggle-heroes dataset contains 50000 rows and 222 features while the heroes&items data contains 50000 rows and 612 features with a test split of 10% for both.

The opendota dataset contains 50000 rows and 242 features (more heroes due to it being more recent) with a test split of 5%.

All machine learning algorithms were trained after using scikit-learn's GridSearchCV for hyperparameter tuning

Machine Learning and Deep Learning Algorithms

Decision Tree Classifier

Logistic Regression

Stochastic Gradient Descent SVM

Linear Support Vector Machine

Gaussian Naive Bayes

XGBClassifier

Random Forest Classifier

Multi-Layer Perceptron

Soft-Voting Ensemble (LR, GNB, XGB, RFC)

Results

Heroes Only

Algorithm	Accuracy	Precision	Recall
Decision Tree	55%	57%	67%
Logistic Regression	59%	61%	65%
Stochastic Gradient Descent SVM	59%	62%	62%
Linear SVM	59%	61%	64%
Gaussian Naive Bytes	60%	60%	70%
XGB Classifier	59%	61%	64%
Random Forest	59%	60%	69%
Multi-layer Perceptron	59-60%	-	-
Soft-Voting Ensemble	62%	-	-

Heroes+Items

Algorithm	Accuracy	Precision	Recall
Decision Tree	83%	85%	84%
Logistic Regression	97%	97%	97%
Stochastic Gradient Descent SVM	97%	98%	96%
Linear SVM	97%	97%	97%
Gaussian Naive Bytes	86%	86%	88%
XGB Classifier	95%	96%	94%
Random Forest	95%	96%	94%
Multi-layer Perceptron	97-99%	-	-

Conclusion

From the results, it is pretty obvious that heroes alone is not enough to reliably predict the outcome of a match as the factors that come into play during the match have an extremely large role in deciding the outcome. One of these factors are the items purchased which provided us with a much better model across the board. The best classification model from the heroes dataset was the voting classifier ensemble of Logistic Regression, Gaussian Naive Bytes, XGBClassifier and The Random Forest which gave an accuracy of 62% on the test set which is quite good considering the lack of strongly deciding factors. Meanwhile, the models trained on heroes and items resulted in a worst performance of 83% and a best of ~99% accuracy on the test set using a deep neural network!

About

Dota 2 match result prediction benchmarking using a multitude of machine learning classification algorithms and datasets with different feature sets.

MIT License

Languages

Language:Jupyter Notebook 98.5%Language:Python 1.5%