Technocolabs100 / Car-Price-Prediction

Jupyter Notebook processing online car sales records using Pyspark (SparkSQL and Dataframes)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Car-Price-Prediction

Access Notebook: PySpark Vehicles.

Access Interactive Web Blog: PySpark Vehicles.

MIT License

Copyright (c) 2019 Devesh Sharma

This Project is a work in progress.

This project engineers Pyspark to load millions of used car sale records made across the United States (>500MB) and process them to understand various predictors of listed price through an ETL Pipeline.

The project is focused on Data Engineering - the goal is to develop a safe and reliable ETL Pipeline built on PySpark that can be deployed for Machine Learning tasks. It is divided into three stages -

1. Problem Definition

  • Identify the Problem Features, Aims and Variables.
  • Configure Tools and Packages required.
  • Determine Input and Output formats.
  • Installation Setup:
    • Environment Configuration
    • Python Packages
    • Apache Spark

2. ETL (Extract, Transform, Load)

Various Stages of the ETL Pipeline:

  • Extract:
    • Data Collection (Kaggle)
    • Data Validation
    • Data Cleaning
    • Caching Data on S3
  • Transform:
    • Cleaning Data for Project Specifications
    • Feature Engineering
    • Sampling Data
    • Exploratory Data Analysis using Pandas, Matplotlib and Seaborn
    • Data Visualization
    • Caching Data on S3
  • Load:
    • Data Preprocessing for Learning Model

3. Machine Learning

  • Model Selection.
  • Feature & Target Preparation.
  • Model Deployment.

Contents covered in this notebook include:

  • Environment configuration: Jupyter Notebook, UNIX, Python and PySpark.
  • Management of a Spark Session.
  • Data Collection, Cleaning and Transformation.
  • Data Analysis using Pyspark Dataframes
  • EDA using Pandas Dataframes
  • SQL queries with SparkSQL.
  • Visualization with Matplotlib.

Dataset

The dataset used for this project can be found through the following link:

About the Dataset

Contains over a million and a half unique car postings between the months of September and November 2018 on Craiglist.com. Contains all relevant information on car sales including columns like price, condition, manufacturer, latitude/longitude, and 16 other categories.


Contributors - Devesh Sharma

For any questions, please contact me - devsharma.work@gmail.com.

About

Jupyter Notebook processing online car sales records using Pyspark (SparkSQL and Dataframes)

License:MIT License


Languages

Language:Jupyter Notebook 99.5%Language:Shell 0.5%