Big Data Analytics - Group Project

IPL dataset

Authors:

Suhas CV
Mihir Manjrekar
Gautam Gadipudi

The Raw Data of the project can be found in Data directory.

The Source code is found src directory.

The following external python libraries were used

Installation: execute the codes in the following order to have dataset installed on your local Mongo DB server:

python3 src/load_player.py

python3 src/load_matches.py

python3 src/update_players.py

python3 src/get_partnerships.py

python3 src/extract_score.py

python3 src/runs_per_match.py

Analysis:

Clustering:

We are going to cluster based on the following attributes per match / game:

runs
strike_rate
sixes
fours

Firstly, normalize the data. Run the below script to get the normalized data into a collection named clustering_data:

python3 src/normalize_batting_stats.py

Then, run the following to get a plot of k (number of clusters) vs SSE (Sum of Squared Errors):

python3 src/Clustering/plot_sse_k.py

Then, select a particular k at the elbow of the plot (saved under ./Visualizations/Clustering) and run the following:

python3 src/analysis/Clustering/cluster.py <k-value> <iteration-limit>

Association:

We are going to perform item set mining (apriori) on batting partnerships which has following attributes:

partners(partner-1 ,partner-2)
venue
total runs

Initially, we will filter the partnerships with total runs > 30. The algorithm works sligthly different from usual apriori, here we have fixed number of items in a partnership(transaction), thus the algorithm stops at level 3. The minimum support is set to 15.

At Level 1 we will get the the individual players/venues involved in atleast 15 partnerships(30+ run).
At Level 2 we will have the player-venue/player1-player2 involved in atleast 15 partnerships(30+ run).
At Level 3 we will have the most frequent partnerships(partner1,partner2,venue).

python3 src/analysis/Association/partnershipVenueMining.py

suhascv / IPL_Analytics