KrishArul26/Defect-detection-in-Air-Pressure-system-APS-of-a-truck-using-ML

confusion-matrix f1-score gridsearchcv hyperparameter-optimization knn-classification roc sampling-methods svm-classifier

Defect-detection-in-Air-Pressure-system-APS-of-a-truck-using-ML with Minimum Cost

Introduction

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that is utilized in various functions in a truck, such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. So, I created a model where It can able to detect whether the APS is going to fail or Not. Further, The attribute names of the data have been anonymized for proprietary reasons. Challenge metric:

Missing Values, PCA, SMOTE, KNN, SVM, Random Forest, Hyperparametrs Tuning, Pickle, F1-Score and AUC - Values.

Cost-metric of miss-classification

          Predict|          True class          |
                 |-------------------|----------|
                 |Positive           |Negative  |
          |------|-------------------|----------|
          | Pos  |    0              | Cost_1   |
          |------|-------------------|----------|
          |  Neg |    Cost_2         |  0       |
          |------|-------------------|----------|

Cost_1 = 10 and cost_2 = 500

The total cost of a prediction model the sum of 'Cost_1' multiplied by the number of Instances with type 1 failure and 'Cost_2' with the number of instances with type 2 failure, resulting in a 'Total_cost'.
In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown.

Total_cost = Cost_1No_Instances + Cost_2No_Instances

The training set contains 60000 examples in total in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples.
Number of Attributes: 171

Attribute Information: The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions. Typically the histograms have open-ended conditions at each end. For example if we measuring the ambient temperature 'T' then the histogram could be defined with 4 bins where:

    •	bin 1 collect values for temperature T < -20
                                                  
    •	bin 2 collect values for temperature T >= -20 and T < 0
                                                             
    •	bin 3 collect values for temperature T >= 0 and T < 20
                                                            
    •	bin 4 collect values for temperature T > 20

Files Descriptions

Data Preprocessing Folder: This folder only contains preprocessing file which is needed for remove_columns, separate label feature, replace invalid values with Null, finding the null values present in the dataset, encodes all the categorical values in the training set to numeric values, the method finds the missing values, method scales the numerical values, dealing with Dimensanility reduction techniques using PCA, the method finds out the columns which have a standard deviation of zero, The methods handle the Imbalance the dependent variables using SMOTE.
EDA: This folder only contains the Jupyter Notebook and the sample data files.
Model_Train Folder: This folder only contains model_training.py and tuner.py files.

a. tuner.py: This file contains all necessary coding for initializing the SVM, KNN parameters in order to

get the best prediction with good accuracy using GridsearchCV. Finally, It will return the best model.

b. model_training.py: It has all coding for trained the model and saves the model in the working directory.

Models: Only contain the KNN model which we used to train the model.
Files_models.py: This file contains coding for the save the trained model and loads the model.
predictionFrom_Model.py: This file is only for predicting the unseen data
Main.py: For predicting the Unseen data whether the truck’s APS is going to fail or not

Data Preprocessing

Data Preprocessing of the raw data Google Colab For EDA Vist, Here

           1.	Remove Unnecessary Columns

           2.	Separate Label Feature

           3.	Replace Invalid Values With Null

           4.	Finding The Null Values Present In The Dataset

           5.	Encodes All The Categorical Values In The Training Set To Numeric Values

           6.	Missing Values Replace With Mean

           7.	Scales The Numerical Values Using Standardscaler

           8.	Dimensionality Reduction Using  PCA

           9.	Remove Columns Which A Standard Deviation Of Zero

           10.	Balance The Dependent Variables Using SMOTE

Model Selection

We are using two algorithms, "SVM" and "KNN". The data is trained on both the algorithms with the hyper parameters for each algorithoms which is derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score.

Prediction

Data Export from Db - The data in the stored database is exported as a CSV file to be used for prediction.
Data Preprocessing

a) Replace the invalid values with numpy "nan" so we can use imputer on such values.

b) Check for null values in the columns. If present, impute the null values using the mean.
Prediction - The respective model is loaded and is used to predict the data.
Once the prediction is made, the predictions along with the original names before label encoder are saved in

a CSV file at a given location and the location is returned to the working directory.

Result of KNN Model

Confusion Matrix of KNN

Precision - Recall Curve For KNN

ROC Curve For KNN

Result of SVM Model

Confusion Matrix of SVM

Precision - Recall Curve For SVM

ROC Curve For SVM

Result of RandomForest Model

Confusion Matrix of RandomForest

Precision - Recall Curve For RandomForest

ROC Curve For RandomForest

Cost for Prediction with False Negative(FN) and Flase Possitive(FP)

Model	FN	FP	Total Cost
KNN	77	20	38, 700
----------------	-----	------	----------
SVM	35	38	17,780
----------------	-----	------	----------
Random Forest	23	45	11,950
----------------	-----	------	----------

Conclusion

Among KNN, SVM and Random Forest, The **Random Forest model** provide the least cost for prediction. It means It has a higher F1 score and AUC values. Since the total cost is equal to 500 x FN and 100 x FP, to have a minimal cost it is required to have a lower FN. This can be achieved by having a proper precision-recall tradeoff

About

The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that is utilized in various functions in a truck, such as braking and gear changes. The dataset’s positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. So, I created a model where It can able to detect whether the APS is going to fail or Not.

confusion-matrix f1-score gridsearchcv hyperparameter-optimization knn-classification roc sampling-methods svm-classifier

Languages

Language:Jupyter Notebook 95.8%Language:Python 4.2%