Kritagya-web / IPL-Data-Analysis-Using-Apache-Spark

About This project focuses on performing an end-to-end analysis of IPL data using Apache Spark on Databricks. It begins with setting up a Databricks environment, followed by ingesting and exploring the IPL dataset.

Home Page:https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/19652298897236/1310443996444177/4655662666255799/latest.html

Repository from Github https://github.comKritagya-web/IPL-Data-Analysis-Using-Apache-SparkRepository from Github https://github.comKritagya-web/IPL-Data-Analysis-Using-Apache-Spark

IPL Data Analysis Using Apache Spark on Databricks


Untitled Diagram drawio (5)

This project focuses on performing an end-to-end analysis of IPL data using Apache Spark on Databricks. It begins with setting up a Databricks environment, followed by ingesting and exploring the IPL dataset. The project also involves optimizing Spark queries to handle large datasets efficiently, leveraging Databricks’ capabilities for distributed computing. Finally, the results are visualized using Databricks notebooks and integrated tools, creating interactive dashboards or reports. These visualizations are intended to provide stakeholders with actionable insights,

Steps Involved:

1. Cleaning Data

  • Handling missing values.
  • Changing data types as needed.
  • Aggregation of Total and average runs scored in each match and inning.
  • Filtering to include only valid deliveries

2. Data Analysis

  • Using Apache Spark to perform comprehensive data analysis, leveraging Databricks for efficient data processing.
  • Analyzing key metrics such as player performance, match statistics, and team trends over different IPL seasons.
  • Employ Spark's capabilities to calculate additional metrics like average scores, win rates, and player consistency across seasons.
  • Utilize Databricks’ integrated tools to visualize data, making it easier to interpret complex patterns.

4. Insights:

Team Performance After Winning the Toss:
  • Chennai Super Kings (CSK) has the highest number of wins after winning the toss, followed by Mumbai Indians and Kolkata Knight Riders.
  • Overall, there is a noticeable correlation between winning the toss and securing a win, with some teams taking better advantage of this than others.
Average Runs Scored by Batsmen in Winning Matches:
  • Rashid Khan stands out with the highest average runs scored in matches that his team won, significantly ahead of others.
Top Venues for High Scores:
  • OUTsurance Oval and Buffalo Park are the venues with the highest average scores, suggesting they may be favorable for batting.
  • Other high-scoring venues include Sheikh Zayed Stadium and Subrata Roy Sahara Stadium.
Most Frequent Dismissal Types:
  • "Caught" is the most common conventional dismissal, followed by "Bowled" and "Run Out."
  • Dismissal types like "Obstructing the field" and "Retired hurt" are the least frequent.
Team Performance After Winning Toss:
Top Performers:
  • Chennai Super Kings and Mumbai Indians have the highest number of wins after winning the toss, indicating a strong correlation between winning the toss and match performance for these teams.
  • Kolkata Knight Riders and Royal Challengers Bangalore also have a significant number of wins after winning the toss.
Lower Performers:
  • Teams like Pune Warriors and Kochi Tuskers Kerala have the fewest wins after winning the toss, suggesting that winning the toss has not been as beneficial for them.

Sanpshots:

Screenshot 2024-08-08 180103

Screenshot 2024-08-08 180123

Screenshot 2024-08-08 180739


Credits: Darshil Parmar

About

About This project focuses on performing an end-to-end analysis of IPL data using Apache Spark on Databricks. It begins with setting up a Databricks environment, followed by ingesting and exploring the IPL dataset.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/19652298897236/1310443996444177/4655662666255799/latest.html