Access Notebook: PySpark Vehicles.
Access Interactive Web Blog: PySpark Vehicles.
MIT License
Copyright (c) 2019 Devesh Sharma
This Project is a work in progress.
This project engineers Pyspark to load millions of used car sale records made across the United States (>500MB) and process them to understand various predictors of listed price through an ETL Pipeline.
The project is focused on Data Engineering - the goal is to develop a safe and reliable ETL Pipeline built on PySpark that can be deployed for Machine Learning tasks. It is divided into three stages -
- Identify the Problem Features, Aims and Variables.
- Configure Tools and Packages required.
- Determine Input and Output formats.
- Installation Setup:
- Environment Configuration
- Python Packages
- Apache Spark
Various Stages of the ETL Pipeline:
- Extract:
- Data Collection (Kaggle)
- Data Validation
- Data Cleaning
- Caching Data on S3
- Transform:
- Cleaning Data for Project Specifications
- Feature Engineering
- Sampling Data
- Exploratory Data Analysis using Pandas, Matplotlib and Seaborn
- Data Visualization
- Caching Data on S3
- Load:
- Data Preprocessing for Learning Model
- Model Selection.
- Feature & Target Preparation.
- Model Deployment.
Contents covered in this notebook include:
- Environment configuration: Jupyter Notebook, UNIX, Python and PySpark.
- Management of a Spark Session.
- Data Collection, Cleaning and Transformation.
- Data Analysis using Pyspark Dataframes
- EDA using Pandas Dataframes
- SQL queries with SparkSQL.
- Visualization with Matplotlib.
The dataset used for this project can be found through the following link:
Contains over a million and a half unique car postings between the months of September and November 2018 on Craiglist.com
.
Contains all relevant information on car sales including columns like price, condition, manufacturer, latitude/longitude, and 16 other categories.
Contributors - Devesh Sharma
For any questions, please contact me - devsharma.work@gmail.com
.