saras152 / TaxiDataPipeLine

Project to create a simple data pipeline using Yellow Taxis trip data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TaxiDataPipeLine - Data Pipeline Project

Description

TaxiDataPipeLine is an application to demonstrate creation of a simple data pipeline using Yellow Taxis trip data.

Build Status

Build & Test
Windows x64 Build & Test
Linux x64 Build & Test

Folder Structure

  • docs - Project documentation
  • src - Python source code
  • test - Unit test

Getting Started

Follow these instructions to get the source code and run it on your local machine.

Prerequisites

You need Python 3.7.3 (Official download link) to run this project.

Clone repository

git clone https://github.com/write2sushma/TaxiDataPipeLine.git

Set-up development environment

Navigate to source folder

cd TaxiDataPipeLine

Create a virtual environment

In Linux OS

python3 -m venv env
source env\bin\activate

In Windows OS

python -m venv env
env\Scripts\activate

Install project dependencies

Project dependencies are listed in requirements.txt file. Use below command to install them -

pip3 install -r requirements.txt

If there is any issue in installing dask using requirements.txt file, use the below commands in command prompt/terminal window:

pip3 install “dask[complete]”

pip3 install dask distributed

How to run

Navigate to TaxiDataPipeLine\taxidata folder and run data_processor.py

python data_processor.py

How to Unit Test

Unit tests are written using Python's UnitTest library. Tests can be run using below command:

pytest

or 

python -m unittest test\test_data_processor.py

How to check coverage

Run below command to check code coverage:

python -m coverage run test\test_data_processor.py

And, then we can see coverage and can generate coverage report in html format

coverage report
coverage html

Data Source

Here is the list of data source urls used for creating data Pipe Line -

https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv
https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv

Automated build setup

Azure DevOp Pipeline is used to set and configure Automated build pipeline

Future Enhancement Plan:

• Optimize performance using dask scheduler to enable faster parallel processing.
    - This is already implemented in 'enhancements' feature branch.
• Scale pipeline to a multiple of the data size that does not fit any more to one machine using multinode clusters in cloud (e.g. AWS)
• Setup performance monitoring 
• Automate deployment using Azure DevOp Pipeline 

About

Project to create a simple data pipeline using Yellow Taxis trip data

License:MIT License


Languages

Language:Python 100.0%