COVID-19 Data Analysis with PySpark

Overview

This project uses PySpark to perform data analysis on the [NeurIPS 2020 Data Science for COVID-19 (DS4C)] dataset obtained from Kaggle. The dataset contains valuable information about COVID-19 patients, such as patient demographics, infection details, and outcomes. In this repository, we demonstrate how to load and preprocess the data, as well as perform various analyses to gain insights.

Prerequisites

Before getting started, make sure you have the following software and datasets ready:

Python: You should have Python installed on your system. We recommend using Python 3.x.

Apache Spark and PySpark: Install Apache Spark and PySpark to work with big data efficiently. You can find installation instructions on the Apache Spark website.

COVID-19 Dataset: Download the dataset, named "PatientInfo.csv," from the Kaggle dataset page. Place this CSV file in your working directory.

Project Structure

Here's a brief overview of the project structure:

README.md: The documentation you are currently reading.

covid-19-image.jpg: An optional image for visual appeal (replace it with your project-related image). covid19_data_analysis.py: The Python script containing PySpark code for data analysis. Setup To start the analysis, follow these steps:

Install Dependencies: Make sure you have all the necessary dependencies installed. This includes Python, Apache Spark, and PySpark.

Dataset: Download the "PatientInfo.csv" dataset and place it in the project's root directory.

Run the Script: Open your terminal or command prompt, navigate to the project directory, and execute the PySpark script using the following command:

bash Copy code spark-submit covid19_data_analysis.py This script will load the dataset, perform data preprocessing, and execute various analyses, displaying the results in your terminal.

Data Analysis In this project, we perform the following data analysis tasks:

Display Schema: We show the structure of the dataset, including column names and data types.

Summary Statistics: We provide summary statistics for the dataset, giving you an overview of key metrics.

Survival Analysis: We analyze the survival status of patients (released, isolated, or deceased).

Null Value Analysis: We identify the number of null values in each column.

Data Preprocessing: We demonstrate techniques such as filling null values, calculating time differences, and creating new columns based on specific criteria.

Results The results of each analysis are displayed in the terminal as you run the script. You will see information about the dataset's structure, summary statistics, survival analysis, and null value counts.

Abdullah28-gheyad / Practical-Session-Spark-SQL-

COVID-19 Data Analysis with PySpark

Overview

Prerequisites

Project Structure

README.md: The documentation you are currently reading.

About

Languages