Predicting the time for the NYC yellow taxi trip

The main goal of the project is to develop MLOps skills, that is: Use experiment tracking, workflow orchestration, model registry, and deployment. Another plus is to use unit tests, integration tests, CI/CD. This will be demonstrated by creating and serving a model to predict the time based on the yellow taxi pickup and dropoff location, and other additional features.

Problem statement

Humans are always concerned with time most of the time, mainly when we're commuting. It is common to take taxis when commuting, or simply going somewhere and, as stated before, we want to know how long the trip will take. In this project, I will use the data provided by NYC Yellow Taxi Trip Records to generate a model for predicting the duration of the trip given the pickup and dropoff location.

How to run the code

In order to run the code, you first need to install pipenv, then you can use Makefile. To prepare the environment of the project you can run:

make setup

This will install the dependencies using pipenv and the pre-commit hooks.

Makefile

Each section is one of the makefile commands.

start_mlflow

This will start the local MLFLOW server on host http://127.0.0.1/ and port 5000. This is needed to run the training scripts, and serve the deployed model.

train_lr

This will train the linear regression model on the first month of the 2022, validates on the next month, and logs to MLFLOW.

register_best_model

This will register the best model to the registry on the nyc-yellow-taxi on MLFLOW on model_stage=None.

run_deployed_locally

This will start the flask server locally on port 8000, the flask server has an endpoint /predict which receives a JSON with PULocationID and DOLocationID. You will need to run the start_mlflow in order to have model available for the endpoint, also you need to have to train and register a model by running train_lr and register_best_model.

test_deploy

In order to run this command you have to do two things:

Run the run_deployed_locally command to spin up the local prediction endpoint.
Run the start_mlflow in order to have load the model available.

prefect_start

This will start prefect orion on port 4200.

prefect_deploy

This will deploy the prefect flow for training the logistic regression model.

prefect_run

This will run the prefect flow for training the logistic regression.

Sample workflow for running this project

First, install pipenv

pip install pipenv

Then setup the local environment installing the dependencies and the pre-commit hooks

make setup

Start the mlflow server (now on, this needs to be in a seperated terminal):

make start_mlflow

Train the linear regression model:

make train_lr

make register_best_model

This will register the model to the stage=None on the model registry.

Create the local deployment by starting the flask server:

make run_deployed_locally

Alternatively, you can run the docker-compose command:

docker-compose up -d --build

This will start the flask server on port 8000.

In another terminal, you can test the local deployment:

make test_deploy

This should return a JSON just like this : {'duration': 685.0228426897559, 'model_name': 'nyc-yellow-trip', 'model_version': '2'}

Caveat for the Dockerfile

In the deployment Dockerfile, I import the artifacts folder created from MLFLOW, this is done because I need it to load the model inside the docker container. This could be easily avoided by using a s3 bucket as the artifact store, however, sadly, I do not have the resources to use any cloud infraestructure.

An improvement will be to use localstack to mock the s3 bucket that keeps the artifacts for MLFLOW.

Folder structure

In this project, there are the following folders:

etl: Which is responsible for data preprocessing and loading.
deploy: The script for the flask application that uses the model to make predictions
data: Where the data is stored
models: Where the following models are trained and evaludated:
- Linear Regression
- Random Forest
register: Where the model is registered to the MLFLOW model registry
notebooks: Where exploratory notebooks are used before generating the modules
orchestration/prefect: In this folder, there is the prefect orchestration to train the logistic regression model. This also has the deployment for the training logistic regression prefect flow.
scripts: Where there are scripts for the CI and initialization of MLFlow
tests: Where there are tests for other modules

MLOps Zoomcamp: Peer Review Criteria

Problem description
- 0 points: Problem is not described
- 1 point: Problem is described but shortly or not clearly
- 2 points: Problem is well described and it's clear what the problem the project solves
Cloud
- 0 points: Cloud is not used, things run only locally
- 2 points: The project is developed on the cloud OR the project is deployed to Kubernetes or similar container management platforms
- 4 points: The project is developed on the cloud and IaC tools are used for provisioning the infrastructure
Experiment tracking and model registry
- 0 points: No experiment tracking or model registry
- 2 points: Experiments are tracked or models are registred in the registry
- 4 points: Both experiment tracking and model registry are used
Workflow orchestration
- 0 points: No workflow orchestration
- 2 points: Basic workflow orchestration
- 4 points: Fully deployed workflow
Model deployment
- 0 points: Model is not deployed
- 2 points: Model is deployed but only locally
- 4 points: The model deployment code is containerized and could be deployed to cloud or special tools for model deployment are used
Model monitoring
- 0 points: No model monitoring
- 2 points: Basic model monitoring that calculates and reports metrics
- 4 points: Comprehensive model monitoring that send alerts or runs a conditional workflow (e.g. retraining, generating debugging dashboard, switching to a different model) if the defined metrics threshold is violated
Reproducibility
- 0 points: No instructions how to run code at all
- 2 points: Some instructions are there, but they are not complete
- 4 points: Instructions are clear, it's easy to run the code, and the code works. The version for all the dependencies are specified.
Best practices
- There are unit tests (1 point)
- There is an integration test (1 point)
- Linter and/or code formatter are used (1 point)
- There's a Makefile (1 point)
- There are pre-commit hooks (1 point)
- There's a CI/CD pipeline (2 points)

nahumsa / nyc-yellow-tripdata