leix28 / ballet-predict-census-income

A feature engineering pipeline for income prediction using the Ballet framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ballet project chat

Predict Census Income

This is a collaborative predictive modeling project built on the ballet framework.

This project contains a feature engineering pipeline and associated models that can be used to predict personal income from raw survey responses to the US Census American Community Survey. The model built from features submitted by the community can then be used to optimize administration of the survey, direct public policy interventions, and assist empirical researchers.

Join the collaboration

Are you interested in joining the collaboration?

Getting started

First, get acquainted with the Ballet framework if you are not yet familiar.

  1. Look over the Ballet Contributor Guide
  2. Look over the Ballet Feature Engineering Guide

Once you have done so, you can check out the features that are currently part of this project, in the contributed features directory (src/predict_census_income/features/contrib).

Your task

Your task is to create and submit one feature to the project.

  1. The easiest way to get started is to launch an interactive Jupyter Lab session to hack on this repository. You can read more about this development workflow here.

  2. Alternately, you can use your preferred tools and development environment to create and submit a feature from your own machine. You can read about the local development workflow here.

Dataset

Input data

The input data is the raw survey responses to the 2018 US Census American Community Survey (ACS). This is known as the "Public Use Microdata Sample" because otherwise most numbers from the ACS are reported in aggregate.

  • The data documentation can be viewed here
  • The data dictionary can be viewed here in PDF form, or here in CSV form.
  • The dataset is created by merging the "household" and "person" parts of the survey. Thus one row of the dataset contains the responses for one person to both the household and person surveys. A person is identified by a unique SERIALNO. A set of "reasonable" rows is filtered as follows: (1) individuals older than 16 (2) personal income greater than $100 (3) hours worked in a typical week greater than 0.

The full script that minimally prepares the data is here.

The resulting training dataset has 30085 rows (people) and 494 columns (raw).

Prediction target

The prediction target is whether an individual respondent will earn more than $84,770 in 2018. Though a bit contrived, this comes from adapting the classic ML "census" dataset to the modern era. The original prediction target is to

determine whether a person makes over 50K a year.

Thus we adjust for inflation from 1994 to 2018.

Getting help

About

A feature engineering pipeline for income prediction using the Ballet framework


Languages

Language:Python 51.7%Language:Jupyter Notebook 45.6%Language:Shell 2.7%