Introduction

Quick basic practice to data science workflow. Part of the goal for this exercise is to have a quick reference for data science methods and to create a reusable data science framework. This repo demonstrates the following

Data Science

Data Analysis
Feature Engineering
Short Text Analysis
Long text analysis (unavailable via the UCI dataset)
Model development
- Supervised methods - Linear model
- Unsupervised methods
- Neural Networks
- Model interpretability

MLOps

Utility libraries and pipeline development under src
Operationalization
Dashboarding

TODOs

After importing the dataset, it would automatically feature engineer the columns and test using a custom series of models. This requires the following functions:

Automatic feature engineering based on column types
- normalization, categorical expansion, imputation strategies
Creates model files
Takes in model file and generates custom reports/plots for easy comparison
- This checks for prediction error, overfitting, Mean absolute error (if applicable), AUC, AP
Model interpretability

Logs

7/5: Define project skeleton, create import function w/ basic preprocessing
7/6: Basic EDA via pairplot visualization. Additional preprocessing/feature engineering.
7/9: Fuzzy matching and implementation into import pipeline. Train baseline model
7/13: OLS Assumptions testing, as referenced from JeffMaculuso's Cookbook
7/XX: Implement train/metric report pipeline through scikit-learn, xgboost, catboost. Comparisons for each model
7/XX: Train neural network, encoding methodologies, compare performance.
7/XX: Recreate paper1 & paper2.

Finish

VictoryJin / uci_autompg

Introduction

Data Science

MLOps

TODOs

Logs

About

Languages