Car-Price-Prediction

Access Interactive Web Blog: PySpark Vehicles.

MIT License

This Project is a work in progress.

This project engineers Pyspark to load millions of used car sale records made across the United States (>500MB) and process them to understand various predictors of listed price through an ETL Pipeline.

The project is focused on Data Engineering - the goal is to develop a safe and reliable ETL Pipeline built on PySpark that can be deployed for Machine Learning tasks. It is divided into three stages -

1. Problem Definition

Identify the Problem Features, Aims and Variables.
Configure Tools and Packages required.
Determine Input and Output formats.
Installation Setup:
- Environment Configuration
- Python Packages
- Apache Spark

2. ETL (Extract, Transform, Load)

Various Stages of the ETL Pipeline:

Extract:
- Data Collection (Kaggle)
- Data Validation
- Data Cleaning
- Caching Data on S3
Transform:
- Cleaning Data for Project Specifications
- Feature Engineering
- Sampling Data
- Exploratory Data Analysis using Pandas, Matplotlib and Seaborn
- Data Visualization
- Caching Data on S3
Load:
- Data Preprocessing for Learning Model

3. Machine Learning

Model Selection.
Feature & Target Preparation.
Model Deployment.

Contents covered in this notebook include:

Environment configuration: Jupyter Notebook, UNIX, Python and PySpark.
Management of a Spark Session.
Data Collection, Cleaning and Transformation.
Data Analysis using Pyspark Dataframes
EDA using Pandas Dataframes
SQL queries with SparkSQL.
Visualization with Matplotlib.

Dataset

The dataset used for this project can be found through the following link:

https://www.kaggle.com/austinreese/craigslist-carstrucks-data

About the Dataset

Contains over a million and a half unique car postings between the months of September and November 2018 on Craiglist.com. Contains all relevant information on car sales including columns like price, condition, manufacturer, latitude/longitude, and 16 other categories.

Contributors - Devesh Sharma

For any questions, please contact me - devsharma.work@gmail.com.

Technocolabs100 / Car-Price-Prediction