machine-learning ml mlflow data-science ai deployment regression model-serving tabular-data jupyter-notebooks

Machine Learning Workflow - From EDA to Production

Introduction

This repo tries to study & apply the least minimal steps involved in machine learning workflow the right way. It was compiled during the first cohort of "Machine Learning Zoomcamp" course instructed by amazing @alexeygrigorev.

Problem Description

The problem we will study was held as a competition on Kaggle titled as "Allstate Claims Severity". The data was provided by "Allstate", a personal insurer in the United States. They were looking for ML-based methods to reduce the cost of insurance claims. The objective of the problem is to predict 'loss' value for a claim, which makes it a regression problem. The submissions for test data are evaluated on the Mean Absolute Error (MAE) between the predicted loss and the actual loss. All the data column values and names in provided dataset are obfuscated for the privacy reasons. Thus, we'll have no "Domain Knowledge" over this problem.

About the Dataset

The dataset used in this repo is a "Tabular" one, meaning the data is represented in rows and columns, corresponding to samples and features respectively.
Data columns (features) are in both categorical and numerical types. Train and test datasets contain 188,318 and 125,546 rows (samples) respectively with 130 columns as features, plus two more columns representing "claim id" and "target" named as 'id' and 'loss'.

Regarding the dataset's availability, it is already provided in repo's scripts/data folder and you don't need to download it separately.

Requirements

This entire repo relies on python programming language for demonstrating a machine learning workflow. Workflow steps are explained and can be run in "Jupyter Notebook" documents, allowing us to execute them in an interactive environment. The easiest way to install python and many popular libraries for data science is through setting up Anaconda. You may refer to virtual-env folder in this repo to review a quick guide on how to setup a virtual environment without running into conflicts with your current setup.

I'd also recommend you to go through the notebooks one by one, and orderly.

Important Note

For the sake of simplicity, we assume that "Data Collection" step is already done since we're going to use a publicly available Kaggle dataset. Please note though, this is not the case in real-world scenarios. Most of the time, tabular data is collected by querying multiple database tables and running pretty complex SQL commands. If you're planning to become a machine learning engineer, make sure you understand and know a good deal about databases and writing efficient SQL queries; that, ladies and gents, turns out to be an essential & invaluable asset to possess for a ML engineer.

The main intention in this workflow is not achieving the best benchmark score for the subject dataset, and by no way it claims to contain the most complete sub-steps.

Given the above line you might ask, what's the focus here? I can summarize the answer with following lines:

To take a quick look at the minimal required steps involved in a machine learning problem, from EDA to production.
Trying to avoid common slips, and conducting each step the right way.

A good machine learning solution involves many steps; the ml algo for instance, is just tip of the iceberg. Hopefully, the material here gives you a nice view of what you should expect in your journey 😉.

Pull requests are super welcome. Please don't hesitate to contribute if you think something is missing or needs an improvement. Hope you find the content useful, and don't forget to show your support by hitting the ⭐.

About

A hands-on case study for demonstrating the stages involved in a machine learning project, from EDA to production.

machine-learning ml mlflow data-science ai deployment regression model-serving tabular-data jupyter-notebooks

Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%Language:Dockerfile 0.0%