This repository contains a loan status prediction pipeline built using TensorFlow Extended (TFX),Kubeflow and Airflow.
The loan status prediction pipeline aims to predict whether a loan applicant will default or not based on various features such as credit score, income, loan amount, etc. The pipeline consists of the following main components:
- Data Ingestion: Raw data is ingested from various sources and stored in a centralized data store (e.g., Google Cloud Storage, Amazon S3).
- Data Validation and Transformation: Data is validated for inconsistencies and transformed into a format suitable for model training.
- Model Training: Several machine learning models are trained using the preprocessed data.
- Model Evaluation: The trained models are evaluated on a held-out validation dataset to assess their performance.
- Model Deployment: The best-performing model is deployed to a serving infrastructure to make predictions.
TFX provides several components that are used in the pipeline:
- ExampleGen: Responsible for ingesting and splitting the input data into training and evaluation sets.
- StatisticsGen: Computes statistics over the dataset for visualization and data analysis.
- SchemaGen: Generates a schema based on the computed statistics to define the expected data format.
- ExampleValidator: Validates the data against the generated schema to detect and fix anomalies.
- Transform: Performs feature engineering and preprocessing on the dataset.
- Trainer: Trains machine learning models on the preprocessed data.
- Evaluator: Evaluates the trained models using a held-out evaluation dataset.
- Pusher: Deploys the best-performing model to a serving infrastructure.
```shell
C:.
├───.idea
│ └───inspectionProfiles
├───.ipynb_checkpoints
├───airflow
│ ├───data
│ │ └───loan-data
│ ├───logs
│ │ └───dag_id=Loan-Airflow-train
│ │ ├───run_id=manual__2024-03-25T213610.489433+0000
│ │ │ ├───task_id=preprocess
│ │ │ ├───task_id=train
│ │ │ └───task_id=update
│ │ ├───run_id=manual__2024-03-25T213611.693321+0000
│ │ │ ├───task_id=preprocess
│ │ │ ├───task_id=train
│ │ │ └───task_id=update
│ │ ├───run_id=manual__2024-03-25T213612.553556+0000
│ │ │ ├───task_id=preprocess
│ │ │ ├───task_id=train
│ │ │ └───task_id=update
│ │ ├───run_id=manual__2024-03-25T213613.347592+0000
│ │ │ ├───task_id=preprocess
│ │ │ ├───task_id=train
│ │ │ └───task_id=update
│ │ └───run_id=manual__2024-03-25T213614.109691+0000
│ │ ├───task_id=preprocess
│ │ ├───task_id=train
│ │ └───task_id=update
│ └───storage
├───data
├───images
├───keras_tuner
│ └───tf-loan-model
│ ├───trial_0
│ ├───trial_1
│ ├───trial_2
│ ├───trial_3
│ └───trial_4
├───notebooks
│ └───.ipynb_checkpoints
├───packaging-ml-model
│ ├───prediction_model
│ │ ├───config
│ │ │ └───__pycache__
│ │ ├───datasets
│ │ ├───processing
│ │ │ └───__pycache__
│ │ ├───trained_models
│ │ └───__pycache__
│ └───tests
└───tfx
This project contains a machine learning pipeline for predicting loan status. The pipeline includes preprocessing steps, feature engineering, and a logistic regression classifier.
We have implemented several custom transformers for preprocessing the data:
- Fills missing values with the mean of the column.
- Fills missing values with the mode of the column.
- Drops specified columns from the DataFrame.
- Modifies domain-specific features by adding values from another column.
- Encodes categorical variables with custom logic.
- Applies a logarithmic transformation to specified columns.
- Performs custom transformations on specified columns, such as replacing '3+' with '3'.
The pipeline is configured with the following steps:
- Domain processing
- Mean imputation for numerical features
- Mode imputation for categorical features
- Dropping unnecessary features
- Custom column transformation
- Label encoding for categorical features
- Log transformation for numerical features
- Min-Max scaling
- Logistic regression classification
The perform_training
function is responsible for training the model using the pipeline and saving the trained model to disk.
Utility functions are provided for loading datasets and saving/loading the serialized model.
To train the model, navigate to the model's directory and run the training_pipeline.py
script:
python training_pipeline.py
This project is licensed under the MIT License - see the LICENSE file for details.