Loan Status Prediction Pipeline

This repository contains a loan status prediction pipeline built using TensorFlow Extended (TFX),Kubeflow and Airflow.

Overview

The loan status prediction pipeline aims to predict whether a loan applicant will default or not based on various features such as credit score, income, loan amount, etc. The pipeline consists of the following main components:

Data Ingestion: Raw data is ingested from various sources and stored in a centralized data store (e.g., Google Cloud Storage, Amazon S3).
Data Validation and Transformation: Data is validated for inconsistencies and transformed into a format suitable for model training.
Model Training: Several machine learning models are trained using the preprocessed data.
Model Evaluation: The trained models are evaluated on a held-out validation dataset to assess their performance.
Model Deployment: The best-performing model is deployed to a serving infrastructure to make predictions.

Pipeline Components

1. TFX Components

TFX provides several components that are used in the pipeline:

ExampleGen: Responsible for ingesting and splitting the input data into training and evaluation sets.
StatisticsGen: Computes statistics over the dataset for visualization and data analysis.
SchemaGen: Generates a schema based on the computed statistics to define the expected data format.
ExampleValidator: Validates the data against the generated schema to detect and fix anomalies.
Transform: Performs feature engineering and preprocessing on the dataset.
Trainer: Trains machine learning models on the preprocessed data.
Evaluator: Evaluates the trained models using a held-out evaluation dataset.
Pusher: Deploys the best-performing model to a serving infrastructure.

Loan Status Prediction Model Tree

```shell
C:.
├───.idea
│   └───inspectionProfiles
├───.ipynb_checkpoints
├───airflow
│   ├───data
│   │   └───loan-data
│   ├───logs
│   │   └───dag_id=Loan-Airflow-train
│   │       ├───run_id=manual__2024-03-25T213610.489433+0000
│   │       │   ├───task_id=preprocess
│   │       │   ├───task_id=train
│   │       │   └───task_id=update
│   │       ├───run_id=manual__2024-03-25T213611.693321+0000
│   │       │   ├───task_id=preprocess
│   │       │   ├───task_id=train
│   │       │   └───task_id=update
│   │       ├───run_id=manual__2024-03-25T213612.553556+0000
│   │       │   ├───task_id=preprocess
│   │       │   ├───task_id=train
│   │       │   └───task_id=update
│   │       ├───run_id=manual__2024-03-25T213613.347592+0000
│   │       │   ├───task_id=preprocess
│   │       │   ├───task_id=train
│   │       │   └───task_id=update
│   │       └───run_id=manual__2024-03-25T213614.109691+0000
│   │           ├───task_id=preprocess
│   │           ├───task_id=train
│   │           └───task_id=update
│   └───storage
├───data
├───images
├───keras_tuner
│   └───tf-loan-model
│       ├───trial_0
│       ├───trial_1
│       ├───trial_2
│       ├───trial_3
│       └───trial_4
├───notebooks
│   └───.ipynb_checkpoints
├───packaging-ml-model
│   ├───prediction_model
│   │   ├───config
│   │   │   └───__pycache__
│   │   ├───datasets
│   │   ├───processing
│   │   │   └───__pycache__
│   │   ├───trained_models
│   │   └───__pycache__
│   └───tests
└───tfx

Overview

This project contains a machine learning pipeline for predicting loan status. The pipeline includes preprocessing steps, feature engineering, and a logistic regression classifier.

Custom Transformers

We have implemented several custom transformers for preprocessing the data:

MeanImputer

Fills missing values with the mean of the column.

ModeImputer

Fills missing values with the mode of the column.

DropColumns

Drops specified columns from the DataFrame.

DomainProcessor

Modifies domain-specific features by adding values from another column.

CustomLabelEncoder

Encodes categorical variables with custom logic.

LogTransformer

Applies a logarithmic transformation to specified columns.

ColumnTransformer (Custom)

Performs custom transformations on specified columns, such as replacing '3+' with '3'.

Pipeline Configuration

The pipeline is configured with the following steps:

Domain processing
Mean imputation for numerical features
Mode imputation for categorical features
Dropping unnecessary features
Custom column transformation
Label encoding for categorical features
Log transformation for numerical features
Min-Max scaling
Logistic regression classification

Model Training and Serialization

The perform_training function is responsible for training the model using the pipeline and saving the trained model to disk.

Data Handling

Utility functions are provided for loading datasets and saving/loading the serialized model.

Usage

To train the model, navigate to the model's directory and run the training_pipeline.py script:

python training_pipeline.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

IAMPathak2702 / Loan_status_prediction

Loan Status Prediction Pipeline

Overview

Pipeline Components

1. TFX Components

Loan Status Prediction Model Tree

Overview

Custom Transformers

MeanImputer

ModeImputer

DropColumns

DomainProcessor

CustomLabelEncoder

LogTransformer

ColumnTransformer (Custom)

Pipeline Configuration

Model Training and Serialization

Data Handling

Usage

License

Snapshots of the projects

About

Languages