Income Qualification Project

A machine learning project to identify level of Income Qualification for families in South America

Project Overview · Documentation · Report Bug · Request Feature

📔 Table of Contents

About the Project
Getting Started
Usage
Roadmap
Contributing
- Code of Conduct
FAQ
License
Contact
Acknowledgements

🌟 Problem Statement

Many social programs have a hard time making sure the right people are given enough aid. It's tricky when a program focuses on the poorest segment of the population. This segment of population can't provide the necessary income and expense records to prove that they qualify.

🧐 Proposed solution

In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. With PMT, agencies use a model that considers a family's observable household attributes like the material of their walls and ceiling or the assets found in their homes to classify them and predict their level of need. While this is an improvement, accuracy remains a problem as the regions population grows and poverty declines.

The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT's performance.

📈: EDA Plots

Households missing Rent Payments

'Other' - households that missed rent payments .

Target vs Total count

1 - Extreme poverty
2 - Moderate poverty
3 - Vulnerable households
4 - Non-vulnerable households

👾 Tech Stack

📔: Libraries and Tools

Data Manipulation

Machine Learning

Visualization

🎯 ML Pipeline

EDA and Data Analysis
Fitting the model; Hyperparameter Tuning
Cross-validation and Model Evaluation

🎨 Machine Learning

ML Tool	Implementation
Library	Sci-Kit Learn
Classifier	Random Forest Classifier
Hyperparameters Tuning	Grid Search CV
Metrics	Confusion Matrix

🔑 Environment Libraries

To run this project, you will need to import the following to your .ipynb file

sklearn.ensemble

sklearn.model.selection sklearn.preprocessing sklearn.pipiline

pandas numpy matplotlib.pyplot seaborn

🧰 Getting Started

‼️ Prerequisites

This project uses Python 3.9++

 pip install -U scikit-learn

⚙️ Installation

Install my-project

  pip install pandas
  pip install numpy
  cd my-project

🧪 Running Tests

To run tests, run the following command

  yarn test test

🏃 Run Locally

Clone the project

  git clone https://github.com/JohnTan38/project-income-qualification.git

Go to the project directory

  cd my-project

Install libraries and dependencies

  pip install -r requirements.txt

Start jupyter notebook

  C:\Users\...\AppData\Local\Programs\Python\Python311\Scripts>jupyter notebook

🚩 Deployment

To deploy this project run

👀 Usage

Use this space to tell a little more about your project and how it can be used. Show additional screenshots, code samples, demos or link to other resources.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X_train,X_test,Y_train,Y_test=train_test_split(X_data_1,Y_data,test_size=0.25,stratify=Y_data,random_state=10)

rfc=RandomForestClassifier(random_state=10)

RFC=best_.best_estimator_
Model=RFC.fit(X_train,Y_train)
pred=Model.predict(X_test)

🧐 Bias in machine learning datasets

When collecting data to build a training set for Machine Learning solutions, it is important to understand the breadth and depth of the data available to you. In the Duke-Margolis research, they call out that “...there are geographic biases to much of the data used to train AI. If the tools being built are deployed in more rural or varied regional populations, the representation of the data may not overlay in the same way and can lead to unexpected outcomes based on the machine learning biased data sets. The population structure of the source data can be also weighted based on who is included or excluded.

Missing Feature Values If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.

✅:gem: Some suggestions to address bias

Question the preconceptions: A machine learning model learns from historical decisions and their intent, where the intent is known. At every stage of training data preparation, it is important to question where the data is coming from, whose perceptions affected earlier decisions, and what changes need to be made in the data accordingly to clean it for training purposes.

Continuous development / testing: An algorithm that works for one set of data will not likely work with an extended version of the same data. It might if we keep testing the system with challenger models and verify its predictive accuracy, transparency, and improvement rate.

🧭 Roadmap

To conduct further research in Proxy Means Testing using multivariate regression to correlate proxies such as assets and household characteristics with poverty and income
[ ]

👋 Contributing

Contributors ✨

Contributing 🍯:

Contributions are always welcome! This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!