NitinN77 / DOTA2-Match-Result-Prediction

Dota 2 match result prediction benchmarking using a multitude of machine learning classification algorithms and datasets with different feature sets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DOTA 2 Match Result Prediction

Datasets

Analysis

It is a known fact that predicting a Dota 2 match's outcome at any given point of time during the match is very hard even with all related data such as net worth advantage, team kills, creep scores, hero levels, heroes and tower scores. Valve's Dota Plus win prediction during a pro game is an excellent example of this. The heavy fluctuation around the halfway probability mark throughout the game creates uncertainty in the calculated outcome even with a state-of-the-art-model.

For a lower ranked public game, there are more trends that models can pick up on. This is reflected in the hero winrates during a specific major patch on a source like Dotabuff.

However, lower rank means lower skill and the countering ability of a hero is sometimes lost due to lower mechanical skill and subpar/ineffective item builds.

This would mean that the played heroes alone would not suffice to give an accurate prediction of the result. We would need to add items purchased by the players into our feature set for more accurate predictions on a test set.

To show the difference in performance between features of heroes and features of heroes and items, the same models were trained on both datasets to benchmark them.

I decided to use two separate sources to verify that the inference holds irrespective of source. The kaggle dataset contains older matches from 2019 in a wide skill bracket while the opendota dataset only contains matches skilled above average and are more recent.

Training

The features were one hot encoded with each hero having a radiant and dire slot to account for side bias and team separation.

The kaggle-heroes dataset contains 50000 rows and 222 features while the heroes&items data contains 50000 rows and 612 features with a test split of 10% for both.

The opendota dataset contains 50000 rows and 242 features (more heroes due to it being more recent) with a test split of 5%.

All machine learning algorithms were trained after using scikit-learn's GridSearchCV for hyperparameter tuning

Machine Learning and Deep Learning Algorithms

  • Decision Tree Classifier
  • Logistic Regression
  • Stochastic Gradient Descent SVM
  • Linear Support Vector Machine
  • Gaussian Naive Bayes
  • XGBClassifier
  • Random Forest Classifier
  • Multi-Layer Perceptron
  • Soft-Voting Ensemble (LR, GNB, XGB, RFC)

Results

Heroes Only

Algorithm Accuracy Precision Recall
Decision Tree 55% 57% 67%
Logistic Regression 59% 61% 65%
Stochastic Gradient Descent SVM 59% 62% 62%
Linear SVM 59% 61% 64%
Gaussian Naive Bytes 60% 60% 70%
XGB Classifier 59% 61% 64%
Random Forest 59% 60% 69%
Multi-layer Perceptron 59-60% - -
Soft-Voting Ensemble 62% - -

Heroes+Items

Algorithm Accuracy Precision Recall
Decision Tree 83% 85% 84%
Logistic Regression 97% 97% 97%
Stochastic Gradient Descent SVM 97% 98% 96%
Linear SVM 97% 97% 97%
Gaussian Naive Bytes 86% 86% 88%
XGB Classifier 95% 96% 94%
Random Forest 95% 96% 94%
Multi-layer Perceptron 97-99% - -

Conclusion

From the results, it is pretty obvious that heroes alone is not enough to reliably predict the outcome of a match as the factors that come into play during the match have an extremely large role in deciding the outcome. One of these factors are the items purchased which provided us with a much better model across the board. The best classification model from the heroes dataset was the voting classifier ensemble of Logistic Regression, Gaussian Naive Bytes, XGBClassifier and The Random Forest which gave an accuracy of 62% on the test set which is quite good considering the lack of strongly deciding factors. Meanwhile, the models trained on heroes and items resulted in a worst performance of 83% and a best of ~99% accuracy on the test set using a deep neural network!

About

Dota 2 match result prediction benchmarking using a multitude of machine learning classification algorithms and datasets with different feature sets.

License:MIT License


Languages

Language:Jupyter Notebook 98.5%Language:Python 1.5%