semanurkps / ML_Pipeline

Summary of weekly ML Pipeline sessions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automated Machine Learning Pipeline Class


Session 1 - From Modelling to Production

Intro to ML modelling

  • DS Modelling

  • DS Life Cycle

  • DS principles

  • ML pipeline

  • Production of ML - ML development

  • Production of ML - tasks for apply

  • ML Pipeline - Target

  • Directed Acyclic Graph

  • ML Pipeline - Production ML Infrastructure

  • Orchestration References:

    • Executive Data Science: A Guide to Training and Managing the Best Data Scientists (Brian Caffo, Jeff Leek, Roger Peng)
    • The Practical Guide to Managing Data Science at Scale (Domino)
    • Executive Data Science: Coursera-Johns Hopkins University
    • Building Machine Learning Pipelines by Hannes Hapke, Catherine Nelson

πŸ’  From Modelling to Production Video


Session 2 - Software Engineering for ML

  • Application Life Cycle
    • Software Development Life Cycle
  • Data Science Life Cycle

πŸ’  Software Engineering for ML Video


Session 3 - Toolkit: Git

  • Github
  • Gitbash

πŸ’  Toolkit Git Video


Session 4 - Toolkit: Colab & Python

  • Google Colab
  • Install Python

πŸ’  Toolkit: Colab & Python Video


Session 5 Toolkit: Python Environments

In this session Thom Ives will explain how to build python virtual environment ...

  • Python 3.x
  • Virtual environment wrapper
  • System Variables
  • Health Informatics Intro (starts 36:14)

πŸ’  Toolkit: Python Environments & Health Informatics Intro (starts 36:14) Video

Ghaith Sankari will show one example about integrating Python project with .net core web api project using vitual studio.

VS Video 1 | VS Video 2 | VS Video 3


Session 6: Data Set, Data Sample, Data Issues

What is the importance of Data in ML process, what is the sampling and why issues might appears and what is the most important issues

  • Feature Space
  • Data Samples
  • Data Issues
  • Data Drift
  • Concept Drift

Assignment: just explanation based: You take random samples of the same size from a large population and compute the mean of those samples and distribute those samples, what will form from that distribution?

Central Limit Theorem

Resource: ML Data and Concept Drift

πŸ’  Data Issues Video


Session 7 - Create Fake Data (is Fun!)

How to create fake data with Python.

Assignment: what is heteroskedasticity. Why is it a challenge, illustrate in notebook.

  • Send DM to Thom, correct answers can share with group.

import matplotlib.pyplot as plt
import random

X = [x/10.0 for x in range(100)]

Y = [2.0 * x + (random.random() - 0.5) * 0 + 5 for x in X]

plt.scatter(X, Y)
plt.title('This Is The Title')
plt.xlabel('These Are The X Values')
plt.ylabel('These Are The Y Values')
plt.show()

added Colab Workbook for heterskedasticity here

πŸ’  Fake Data is Fun Video


Session 8 - Linear Regression with Fake Data

Assignment Play with the models, ❗ (Please repull the repo)

  1. First run the Fake Data Creations .py.
    1. Fake_Single_Feature_Linear_Data.py
    2. Fake_Single_Feature_NonLinear_Data.py
    3. Fake_Double_Feature_Linear_Data.py
    4. Fake_Double_Feature_NonLinear_Data_with_Functional_Noise.py
  2. Thise will create 5 different .csv files of data
  3. Next run each of the files, in the folder Intro_to_Regression_Modeling and explore and play and understand the functionality of the script. look at the fake data creation.

πŸ’‘ you can import sys, and enter the follow code sys.quit() in the script to force stop, so you not running the complete script.

  1. General_Toolls.py: this file is a module that you can call from with your scirpt, has function to calculate:
    1. print('Mean Square Error --> MSE
    2. print('Root Mean Square Error --> RMSE
    3. print('Mean Absolute Error --> MAE
    4. print('Median Absolute Error --> MeDAE
    5. print('R^2 --> r2
    6. print('Adjusted R^2 --> r2_adj

Regression Analysis

Regression Statistics

πŸ’  **Linear Regression with Fake Data Video **


Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation

Convolutional neural networks (CNN)

Summary of session
  • Convolutional Layer
  • Effect of Filter Size (Kernel Size)
  • Max Pool
  • Average Pool
  • Batch Sizing
  • Padding
  • Epochs

πŸ‘‡ Here are some links that have some visual explanations and a playground to experiement.

RegeX : Regular Expression import re What is regex?

πŸ’  Deeplearning Scenario, intro to tensorflow, data Preparation Video


Session 10 - Data Augmenting

Data Augmenting Techniques

  • Mirroring

    • Flip Horizontal / Vertical
    • Flip Random
  • Cropping

  • Rotate

  • Recolor

  • PCA, Principal Component Analysis (topic for later lesson)

Ghaith

Here is some notes about data augmenting session:

Data augmentation techniques used in deep learning, but it is still part of data preparation. according to this fact, data augmentation mechanisms will be customized to create important part of ML pipeline.

I wanted to start with data quantity issues solving then we will back to the more fancy and funny part related to data quality. the assignment for next week is answering the following questions:

  • how to perform customized rotation(any value of degree not only 90), code in python is required, and i wish to find presenting volunteers, this task can be performed is many ways and cooperation with other family members to cover many ways to solve the assignment is allowed and appreciated.

  • is it possible to re-color the grayscale image, and how: for this question we are not looking for coding examples, we are just looking for explaining and proofing about the answer, you can consider as research task, also brave presenter are highly appreciated.

tensorflow data augmentation tutorial

πŸ’  Data Augmenting Video


Session 11 - Data Imbalance

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes

  • created imbalannced data set, by taking sample from each set infected, uninfected

    • using .sample from Pands library
  • Data set, Name of image, folder name, label

    • label: 1 = infected, 0 = uninfected

monai good to use on medical sets , with predefined tools for 2D, 3D images. explanition in video at 14:00 minutes.

  • Example of batches sizes 10 showing the class imbalancing.

Assignment

To see the impact of oversampling, how the distribution of Data will change.

  • Experiment with Batch Sizes
  • Experiment with import sample sizes
  • What other techniques are there to solve imbalancing with changing number of sample in each batch

Always good to see other tools and share our findings in our pipeline_class_chat

MORE TOGETHER!

πŸ’  Notebook πŸ’  Data Imbalance Video


Session 12 - Data balancing & training effect

  • Weight Computation for Oversmapling & Penalization
  • Use of Pre-Trained Models.
    • Trained on label samples
    • Image net (1million images, split into 1000 catergoires)
    • uses of Resnet18, There are others and different varieties can be used.
  • Training and Validation
    • train using randflipd, randrotae90d, RandGassuanNoised
    • validation, no transformations
  • Training Vs Test Accuarcy

Confusion Matrix

Positive (1) Negative (0)
Postitive (1) TP FP
Negative (0) FN TN

*True Positive, True Negative, False Positive: (Type 1 Error), False Negative: (Type 2 Error)

  • Recall = TP / (TP + FN)
  • Precision = TP / (TP+FP)
  • F-Score = 2* Recall * Precision / Recall + Precision (used to compate models)

πŸ’  Data balancing & training effect Video


Session 13 - Collecting Data From Storage

Creating Data with SQL, Microsoft SQL Server Managment Studio (SSMS)

  • Collection of data for Timeseries Analysis
  • Randomize data collection
  • Using While < 10000 to collect 10000 samples
  • Using Date to randomize patient transactions for collection
  • Create a Procedure that can be called for example in Python,
  • Example of creating the ERD (Entity Relationship Diagram, in SSMS

πŸ’  Collecting Data From Storage Video


Session 14 - SQL & Python

  • sqlalchemy
  • sqlalchemy engine
  • Define functions for server and db connection
  • Functions for
    • Checking table exists
    • Create_table
    • Drop_table
    • Insert Dataframes as Table
    • Update DB

Examples of:

  • SQL query pull and convert to Pandas DF.
  • Pandas DF to SQL Table.
  • Checksum, for detecting errors

πŸ’  More on Data SQL & Python Video

Session 15 - Fake Data Creation Part 2

  • Reference to Khuyen Tran, Faker Article

  • Fake Data for Regression.

    • Functions to define featuers / Noise / Model
    • Plotting Model
  • Create Fake Classification Data

    • Functions for Clusters, and Labels
    • Plott Model

Libraries: pandas, numpy, json, matplotlib.pyplot

πŸ’  Fake Data Creation Part 2

About

Summary of weekly ML Pipeline sessions


Languages

Language:Jupyter Notebook 100.0%