RutvijBhutaiya/Cricket-World-Cup-2019

                                          Source: ESPN Cricket World Cup

Cricket World Cup 2019 - Cricket Match Prediction Game

How To Use The Project

Expand For Steps

Step 1: Install R Studio

Step 2: Download ODI Matches - Data

Step 3: Clean Data and Get Format ready

Step 4: Clone/Download the Repository

Step 5: Make necessary changes [e.g add new matches data in WC_Train.csv file]

Step 6: Do necessary data analysis EDA

Step 7: Run Random Forest Model

Step 8: Store Results in Random Forest Prediction.csv

Step 9: Run Logistic Regression Model

Sept 10: Store Results in Logistic Regression Prediction.csv

Step 11: Run Compare Model Predict

Step 12: Store Models vs Actual Results in Comapre Predict - RF vs. LR

Objective

To Predict ICC World Cup 2019 Cricket Matches, based on Team’s individual past performances.

Approach

Collect data from – Link
Data Cleaning and Data Normalization
Exploratory Data Analysis Link - Repository
Build Random Forest Model Link - Repository
Performance of RF model & Results Link - Repository
Build Logistic Regression Model Link - Repository
Performance of LR model & Results Link - Repository
Compare Models performance vs. Actual Match Results Results

Data Collection

In this study, our approach is to predict ICC WC 2019 matches based on past ODI matches results. Now, stronger teams like Australia, India, New Zealand etc would perform better and weaker teams like Pakistan, West Indies would perish – we are not saying this – but our past ODI matches data study reveales the strong and weak team contender for World Cup 2019.

Hence, we decided to study past ODI matches since 2007 to 2018. To collect dataset, we followed HowStats

For data collection, we extract, ODI matches year on year [since 1987] and stored the dataset in excel sheets. However, for our study we considered only ODI matched played from 2007 to 2018. Because, we believe very old matches results [like early 1990s] should not have significant impact on team wise performance for 2019 WC. Hence, we decided to study latest team wise performances.

Data Cleaning

After extracting data from Howstats, we stored datasets in excel file sheets – year wise.

For cleaning purpose, we used ‘Test to Colum’ function very frequently [Basically we used few excel function to clean entire dataset]

NOTE: Due to lake of data for Afghanistan team matches, we decided to exclude team Afghanistan from the study. [If we would had considered Afghanistan team for WC 2019 world cup prediction study, probably model would have shown team Afghanistan is losing every match – and could become biased!]

Exploratory Data Analysis

For the WC 2019 cricket matches prediction study we decided to count data from 2007 to 2018. However, in many studies we found that more data make model better, True! But, for the objective of the study, we limited ourselves for number of observations. Because for particular study we feel – early 1990s team performance (Especially players which plays significant impact towards winning/loosing particular match.) Like West Indies was star performing team, but in a last decade and longer, the team is barley able to give consistence winning.

We also assume, higher the number of matches team plays, higher the ODI experience and this leads to overall performance of the team.

For the training dataset, we choose 983 observations, where most of the variables are factors.

> dim(ws)   ## Dimension of dataset
> str(ws)   ## Structure of dataset

And hence, before building supervised learning model we converted factors into dummy variables. Based on rpivotTable(wc) function, we found interesting study.

As we can see based on the above chart table, since last 2 years (2017 & 2018) – England team & India Team gave winning performance and are trending at the top positions.

Similarly, you can see the 2011 World Cup final match was between India and Sri Lanka. In these cluster of years Australia was top contender for finals, but how come Sri Lanka reached to the finals! This is because India knockouts Australia in 2nd Quarter Finals. And Sri Lanka faced New Zealand in Semi Finals – and Sri Lanka won by 5 wickets.

Similarly, in World Cup 2015, based on the following bar chart, we can see how New Zealand has emerged from 2012 to 2014 and challenged Australia in 2015 WC finals.

In World Cup 2019, strong contender for world cup are India, England, New Zealand and South Africa.

Build Random Forest Model

Successfully uploaded dataset in R, and we created train variable for 2007 to 2018 cricket matches.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.

wc = read.csv('WC_Train.csv')

## Data From 2007 World Cup till 2018 Cricket Matches

train = wc[which(wc$Year >= 2007 & wc$Year <=2018),]

For supervised learning technique RF, we created Team A & Team B’s category variables into dummy variables.

## Creat dummy variable sfor Team A and Team B TRAIN

Team.A.matrix = model.matrix(~ Trim.Team.A - 1, data = train)
train = data.frame(train, Team.A.matrix)

Team.B.matrix = model.matrix(~ Trim.Team.B - 1, data = train)
train = data.frame(train, Team.B.matrix)

As discussed earlier, in the study Target variable is Team.A.Won, which is counts of Team A level team winning particular match – as count ‘1’ and Team A lost particular match – as count ‘0’. Here, count ‘0’ means Team B team won particular match. And, hence with library function randomForest() we build random forest model for train dataset. After tuning the model, we predicted results in ‘class’ type and ‘prob’ type.

print(wc.rf.tune)

test1$Team.A.Win = predict(wc.rf.tune, test1, type = 'class')
test1$Team.A.Score = predict(wc.rf.tune, test1, type = 'prob')

And results ae stored in Random Forest Prediction.csv file

Random Forest Results

Due to high error rate in random Forest model - [And even after tuning the model, we were not able to reduce the error]

Based on the results we were not fully satisfied. And hence decided to work on supervised learning technique Logistic Regression to predict ICC Cricket 2019 World Cup matches.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.

Afger 26 June MAtch Results are store in - Random Forest Prediction after 25th June Matches. csv file

Build Logistic Regression Model

Similarly, for Logistic Regression we created a train dataset for ODI matches from 2007 to 2018, and created dummy variables to Target Team.A.Won variable with all the independent variables.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model NOTE: As on 08th July Codes has been tuned - For SemiFinal Predictions

logit = Team.A.Won ~ .  # Few Variables arenot significant, However, due to Teams we decided to consider All variables. 

logit.plot = glm(logit, data = train, family = binomial)

summary(logit.plot)

However, we also found few dummy variables for independent variables set are not significant for the study [like Bangladesh and West Indies]. And Finally, we decided to consider all the teams dummy variables for the study.

Based on the model logit.plot we predicted the test1 file matched for 2019 World Cup. And stored the results in Logistic Regression Prediction.csv file. We also did evaluation of the Logistic Regression model. However, we believe correct evaluation of the model is actual match result.

Logistic Regression Results

To evalute the model we ploted ROC curve and calculated the accuracy for the predicted results.

## Model Evaluation 

m3.matrix = confusion.matrix(test1$Team.A.Win, predict.logit, threshold = 0.5)
m3.matrix

library(pROC)
m3.roc = roc(test1$Team.A.Win, predict.logit)
m3.roc
plot(m3.roc)

## ON RESULT RATIOS DATA SET
accuracy.logit<-sum(diag(m3.matrix))/sum(m3.matrix)
accuracy.logit
[1] 0.7567568

As shown model accuracy is 75%, and following are the predicted results from the WC 2019 matches.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model. NOTE: As on 08th July Codes has been tuned - For SemiFinal Predictions

Afger 26 June Match Results are store in - Logistic Regression Prediction after 25th June Matches. csv file

CHAID Model

We also build Chaid model to predict the WC 2019 matched, However, we didn't get good outputs from the predicting model. Hence we didn't highlited the model in the study. CHAID codes in Repository

Out of total 37 matched - 4 matched had NO results (due to the Rain), CHAID predicted only 17 matched correct (Actually team won the match). Hence, success ration for the model is 48.5% (17/35 matches). CHAID results

Compare Model Performance

Based on the two supervised learning techniques we build model which can predict WC 2019 matched outcome even before actual match starts. And we compared the model results vs. actual matches result.

Hence, we uploaded both the models RF and LR results in -- > Compare Predict - RF vs. LR

colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Win'] = 'RF Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Win.1'] = 'LR Team.A.Win'

colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Score.1'] = 'Prob % RF Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'predict.logit'] = 'Prob % LR Team.A.Win'

In the same .csv file we also manually entered actual match result.

Update Date (15/07/2019)

Random Forest Predicted 23 correct matches out of 34 : 67.6% correct
Logistic Regression Predicted 22 correct matched out of 34 : 64.7% correct

Note: Afghanistan team matches and Match abandoned due to rain are not included in the result score.

However, few matches were very close call, e.g. in terms of % probability of winning for the team.

NOTE : Python code update for Neural Network Technique to predict WC 2019 results.

FINAL RESULT : ENGLAND WON THE ICC WORLD CUP 2019 [Our prediction was probability for England winning WC 2019 is 74.12% and New Zealand winning WC 2019 is 25.8%] We would be more happy if our results probability were near to 50%, because match went into the Super over, and both the teams were so much close to win the trophy.

How to Imrpove Project Results

Work closly on Overfitting - in model building.
Build Model based on CART
Build MOdel based on LDA
BUild Model based on Neural Network
Data Collection from various sources

Learnings:

First time worked on Real Time Machine Learning Project. It was intresting to choose Data from and previous matches and build Random Forest and Logistic Regression models.
Initially we tried to build CHAID, however, due to data (numurical) we were not able to fit model - the way we wanted. And hence, we decided to create dummy variable for categorical variables and build Random Forest (RF) and Logistic Regression (LR) models.
Our initial though was RF would not give good results, and hence we were dependent on LR. But, we saw that in few matches RF worked very well.
Convert probability results into binary (0 or 1) [Logistic Regression] based on Match Win - Used ifelse() function. Simple!

LICENSE

This Project/Repository is Licensed under MIT license.

Acknowledge

This Project/Repository is part of Great Learning - Cricket World Cup Challenge.

RutvijBhutaiya / Cricket-World-Cup-2019