Many social programs have a hard time making sure the right people are given enough aid. It's tricky when a program focuses on the poorest segment of the population. This segment of population can't provide the necessary income and expense records to prove that they qualify.
In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income qualification. With PMT, agencies use a model that considers a family's observable household attributes like the material of their walls and ceiling or the assets found in their homes to classify them and predict their level of need. While this is an improvement, accuracy remains a problem as the regions population grows and poverty declines.
The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT's performance.
'Other' - households that missed rent payments .
1 - Extreme poverty
2 - Moderate poverty
3 - Vulnerable households
4 - Non-vulnerable households
Data Manipulation
Machine Learning
Visualization
- EDA and Data Analysis
- Fitting the model; Hyperparameter Tuning
- Cross-validation and Model Evaluation
ML Tool | Implementation |
---|---|
Library | Sci-Kit Learn |
Classifier | Random Forest Classifier |
Hyperparameters Tuning | Grid Search CV |
Metrics | Confusion Matrix |
To run this project, you will need to import the following to your .ipynb file
sklearn.ensemble
sklearn.model.selection
sklearn.preprocessing
sklearn.pipiline
pandas
numpy
matplotlib.pyplot
seaborn
This project uses Python 3.9++
pip install -U scikit-learn
Install my-project
pip install pandas
pip install numpy
cd my-project
To run tests, run the following command
yarn test test
Clone the project
git clone https://github.com/JohnTan38/project-income-qualification.git
Go to the project directory
cd my-project
Install libraries and dependencies
pip install -r requirements.txt
Start jupyter notebook
C:\Users\...\AppData\Local\Programs\Python\Python311\Scripts>jupyter notebook
To deploy this project run
Use this space to tell a little more about your project and how it can be used. Show additional screenshots, code samples, demos or link to other resources.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X_train,X_test,Y_train,Y_test=train_test_split(X_data_1,Y_data,test_size=0.25,stratify=Y_data,random_state=10)
rfc=RandomForestClassifier(random_state=10)
RFC=best_.best_estimator_
Model=RFC.fit(X_train,Y_train)
pred=Model.predict(X_test)
When collecting data to build a training set for Machine Learning solutions, it is important to understand the breadth and depth of the data available to you. In the Duke-Margolis research, they call out that “...there are geographic biases to much of the data used to train AI. If the tools being built are deployed in more rural or varied regional populations, the representation of the data may not overlay in the same way and can lead to unexpected outcomes based on the machine learning biased data sets. The population structure of the source data can be also weighted based on who is included or excluded.
Missing Feature Values If your data set has one or more features that have missing values for a large number of examples, that could be an indicator that certain key characteristics of your data set are under-represented.
Question the preconceptions: A machine learning model learns from historical decisions and their intent, where the intent is known. At every stage of training data preparation, it is important to question where the data is coming from, whose perceptions affected earlier decisions, and what changes need to be made in the data accordingly to clean it for training purposes.
Continuous development / testing: An algorithm that works for one set of data will not likely work with an extended version of the same data. It might if we keep testing the system with challenger models and verify its predictive accuracy, transparency, and improvement rate.
- To conduct further research in Proxy Means Testing using multivariate regression to correlate proxies such as assets and household characteristics with poverty and income
- [ ]
Contributions are always welcome! This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!
Please read the Code of Conduct
-
Question 1
- Answer 1
-
Question 2
- Answer 2
Distributed under the MIT License. See LICENSE.txt for more information.
Email ✉️ - vieming@gmail.com
Project Link 🌐 https://github.com/JohnTan38/project-income-qualification.git
Useful resources and libraries in project.