automated-pipeline ci-cd machine-learning-algorithms

Wafer Fault Detection

Aim is to detect fault in a Wafer sensor by looking at the data that is generated by the sensor and then clasifying them into Good Wafer (-1) or Faulty/Bad Wafer (1).
Each Wafer sensor has 590 sensors, each of which will send a value, thus for one record i.e., for one wafer sensor, there will be 590 columns of data.

Training Pipeline -
Training Pipeline has been created such that once trigerred, It will fetch data from the Files from Training Data folder from Cloud, Validate the files, if all acceptance criteria has been fulfilled, the data from all those files will be inserted into the Mongo DB database (Mongo DB Atlas). Once Inserted, The Training process will start, which will fetch the data from DB as one dataframe containing all the training data, Preprocess it, Perfrom Clustering on the data and for each Cluster it will Trigger Model Builder.
Model Builder class is built such that it will train 5 ML Classification Models on the provided dataon various set of parameters using GridSearchCV and will find the Best Parameters for each Model, Then again train the 5 models on these obtained best Parameters, Once trained, Roc-Auc score will be used for Finding the best performing model out of these 5 models. The Selected model will then be saved on the Cloud, and Corresponding Cluster-Model mapping will be stored in prediction_schema file which will then be saved to Cloud to be used by the prediction Process.

Prediction Pipeline -
As the data will be received quite frequently, an automated pipeline has been created which will trigger the prediction process as soon as a new file has been received in the Cloud storage.
The Automated Pipeline will fetch file from Cloud, Validate the file based on certain criterias and if it fulfills all the criterias of acception of file, it will insert the data into the database. For the Predictions, the data will be fetched from the Database as one Dataframe containing all the data, which will then be preprocessed and Cluster prediction will be done, Then for the Predicted cluster, Corresponding Model will be fetched from the cloud and will be used to make predictions, the predictions will then be stored in the database with Wafer id to identify the Wafer sensor.

Frontend -
A Dashboard has been created which will contain the Functionality to Start Training Process, Manually Start Prediction Process, View the Trained model Statistics and check training/prediction/file_validation Logs.

Repository Structure

main
└─── .github
|     └─── workflows
|           |── ci-cd.yaml
|
└─── config
|     |── prediction_config.json
|     |── training_config.json
|
└─── src
|     |──  __init__.py
|     |── cloud_connect.py
|     |── create_clusters.py
|     |── custom_exceptions.py
|     |── custom_logger.py
|     |── db_connect.py
|     |── models_utils.py
|     |── prediction.py
|     |── prediction_preprocessor.py
|     |── prediction_validation.py
|     |── prepare_data.py
|     |── prepare_prediction_data.py
|     |── preprocessor.py
|     |── training.py
|     |── validator.py
|
└─── webapp
|     |
|     └─── static
|     |     └─── css 
|     |     └─── script
|     |
|     └─── templates
|           |── 404.html
|           |── index.html
|           |── logs.html
|           |── metrics.html
|           |── prediction_completed.html
|           |── training_completed.html
|
|── Procfile
|── README.md
|── app.py
|── requirements.txt

About

The aim is to detect a fault in a Wafer sensor by looking at the data that is generated by the sensor and then classifying them into Good Wafer (-1) or Faulty/Bad Wafer (1). Used Automated Process for Training and Predictions Process. For the training Process, the Program automatically Selects the Best performing out of 5 different Classification Algorithms using GridSearchCV and ROC-AUC Score.

automated-pipeline ci-cd machine-learning-algorithms

Languages

Language:Python 82.2%Language:HTML 17.8%