VictoryJin / uci_autompg

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

Quick basic practice to data science workflow. Part of the goal for this exercise is to have a quick reference for data science methods and to create a reusable data science framework. This repo demonstrates the following

Data Science

  • Data Analysis
  • Feature Engineering
  • Short Text Analysis
  • Long text analysis (unavailable via the UCI dataset)
  • Model development
    • Supervised methods - Linear model
    • Unsupervised methods
    • Neural Networks
    • Model interpretability

MLOps

  • Utility libraries and pipeline development under src
  • Operationalization
  • Dashboarding

TODOs

After importing the dataset, it would automatically feature engineer the columns and test using a custom series of models. This requires the following functions:

  • Automatic feature engineering based on column types
    • normalization, categorical expansion, imputation strategies
  • Creates model files
  • Takes in model file and generates custom reports/plots for easy comparison
    • This checks for prediction error, overfitting, Mean absolute error (if applicable), AUC, AP
  • Model interpretability

Logs

  • 7/5: Define project skeleton, create import function w/ basic preprocessing
  • 7/6: Basic EDA via pairplot visualization. Additional preprocessing/feature engineering.
  • 7/9: Fuzzy matching and implementation into import pipeline. Train baseline model
  • 7/13: OLS Assumptions testing, as referenced from JeffMaculuso's Cookbook
  • 7/XX: Implement train/metric report pipeline through scikit-learn, xgboost, catboost. Comparisons for each model
  • 7/XX: Train neural network, encoding methodologies, compare performance.
  • 7/XX: Recreate paper1 & paper2.

Finish

About


Languages

Language:Jupyter Notebook 98.9%Language:Python 1.1%