adiag321 / Adult-Census-Income-Prediction

In this project, we will predict the income of an adult based on 1194 Census Bureau database.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adult's Income Prediction

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)).
The prediction task is to determine whether a person makes over $50K a year.

DATA DEFINATION

The Adult's Income dataset is collected from kaggle using this https://www.kaggle.com/uciml/adult-census-income.
This dataset has 48842 rows × 15 columns.

  1. age: continuous.
  2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  3. fnlwgt: continuous.
  4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  5. education-num: continuous.
  6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  7. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  8. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  10. sex: Female, Male.
  11. capital-gain: continuous.
  12. capital-loss: continuous.
  13. hours-per-week: continuous.
  14. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

METHODS

In this project, we have employed the following steps:

  1. Data Collection
  2. Data Cleaning
  3. Data Analysis
  4. Data Preprocessing(Preparation)
  5. Model Training
  6. Performance Evaluation

We have implemented the following models:

  1. SVM
  2. Logistic Regression
  3. Naive Bayes

PERFORMANCE EVALUATION

We have calculated the performance by calculating Precision, Recall , True positive/negative for all the models and built a Confusion Matrix for the same.

RESULTS

  1. Gender Distribution based on Income

gender

  1. Education Count

Education_Count

  1. Age Vs Marital Status

AgeVsMarital_Status

MODEL EVALUTATION

Here is the tabular view of Accuracy and Time taken to run the model -

Model Accuracy Time
Logistic Reg - Oversampled 83.2% 824.2s
Logistic Reg - Undersampled 82.5% 102.9s
Logistic Reg - SMOTE 83.7% 848s
SVM - Oversampled 83% 303.9s
SVM - Undersampled 81% 22.89s
SVM - SMOTE 84% 300.4s
Naive Bayes - Oversampled 76% 23.41s
Naive Bayes - Undersampled 76% 3.69s
Naive Bayes - SMOTE 85% 23.67s

CONCLUSION

  1. Only a few of the traits are closely associated to income projection, according to the correlation table.
  2. The choice of such columns has a big impact on the model's accuracy. Random Forest is also the best fit for this dataset. Males have a higher rate of incorrect predictions than females.
  3. The dataset does not have to be perfect or small all of the time. As a result, we must define a method, a method that can be applied to every dataset.
  4. Here is a step-by-step explanation of the workflow, which you may apply to your own dataset.
  5. Also, different types of visualizations effectively portray data based on a combination of attribute types (numerical, categorical) and a classification technique that is totally reliant on the dataset.

About

In this project, we will predict the income of an adult based on 1194 Census Bureau database.


Languages

Language:Jupyter Notebook 100.0%