stoufa / build_ml_pipeline_for_short_term_rental_prices

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build an ML Pipeline for Short-Term Rental Prices in NYC

This is the capstone project of the "Building a Reproducible Model Workflow" Udacity course within the "Machine Learning DevOps Engineer" nanodegree. Project instructions can be found in the NOTES.md file.

🎯 Project Overview

You are working for a property management company renting rooms and properties for short periods of time on various platforms. You need to estimate the typical price for a given property based on the price of similar properties. Your company receives new data in bulk every week. The model needs to be retrained with the same cadence, necessitating an end-to-end pipeline that can be reused.

In this project you will build such a pipeline.

🛠️ Setup

I used a Taskfile to document all the tasks of this project in one YAML file, read more about Taskfiles here.
As follows, the list of tasks included:

Task Description
create_env create the nyc_airbnb_dev conda environment
remove_env remove the nyc_airbnb_dev conda environment
list_envs list all available conda environments
create_mlflow_step create a new mlflow step under the src/ directory using the template in the cookie-mlflow-step/ folder
install_precommit_hooks install the pre-commit git hooks defined in .pre-commit-config.yaml
list_mlflow_envs list all the environments having the mlflow prefix
remove_mlflow_envs remove **all** the environments having the mlflow prefix
download run the download step of the pipeline
eda run the EDA step of the pipeline
data_clean run the data cleaning step of the pipeline
data_check run the data testing step of the pipeline
data_split run the data splitting step of the pipeline
train_model run the model training step of the pipeline
opt_hyper_params run the hyper-parameters optimization step of the pipeline
prod_model_test run tests against the model we wish to deploy in the prod environment
train_model_new_data trains the released models on new data
  • to execute a particular task, you need to run the following command: task <task_name>, e.g. task create_env
  • to list all tasks, run task --list
  • to get more details on a particular task, run task --summary <task_name>, e.g. task --summary create_env

ℹ️ Notes

  • Before running the "Data Testing" step, make sure to add the reference tag on the clean_sample.csv W&B artifact.

  • In order to find the best model,sort the runs by ascending MAE (Mean Absolute Error) and you'll find the best result at the top

  • Now that we have identified the best model, click on its name, then go to its artifacts and select the "model_export" output artifact, and add a prod tag to it to mark it as "production ready".

  • when your pipeline is ready for release, its artifacts graph should look like the following:

  • once released, your releases page should be similar to the following:

    • GitLab

    • GitHub

note that I used GitLab to release the pipeline and pushed the same repo to both GitLab and GitHub. For some reason, GitLab's releases are shown as Tags in GitHub (not Releases).

About

License:Other


Languages

Language:Jupyter Notebook 99.6%Language:Python 0.4%