Fragile Families Collaboration

This is a collaborative predictive modeling project built on the ballet framework.

The Fragile Families Challenge (FFC) is a recent attempt to better connect to the social science research community to new tools in data science and machine learning. This challenge aimed to spur the development of predictive models for life outcomes from data collected as part of the Fragile Families and Child Wellbeing Study (FFCWS), which collects detailed longitudinal records on a set of disadvantaged children and their families. Organizers released anonymized and merged data on a set of 4,242 families with data collected from the birth of the child until age 9. Participants in the challenge were then tasked with predicting six life outcomes of the child or family at age 15: child grade point average, child grit, household eviction, household material hardship, primary caregiver layoff, and primary caregiver participation in job training. The FFC was run over a four month period in 2017 and received 160 submissions from social scientists, machine learning practitioners, students, and others.

In this project, we ask, by collaborating rather than competing, can we develop impactful solutions to the FFC? Participants in the FFC were competing against each other to produce the best performing models, at the expense of collaboration across teams.

Your task is to create and submit feature definitions to our shared project that help us in predicting these key life outcomes.

Join the collaboration

Are you interested in joining the collaboration?

Apply for access to the dataset and then register yourself with us.
Read/skim the Ballet Contributor Guide.
Read/skim the Ballet Feature Engineering Guide.
Learn more about the Fragile Families dataset.
1. Read/skim the data documentation.
2. Skim additional resources.
Browse the currently accepted features in the contributed features directory (src/fragile_families/features/contrib).
Launch an interactive Jupyter Lab session to hack on this repository:

Data access

The data underlying the Fragile Families Challenge, which we are using in this collaboration, is sensitive and requires registration to access. More details are upcoming about how to access this data.

If you are already authorized to access the data, you can look over Data Documentation below.

Apply for access and registration

You must apply to Princeton's Office of Population Research (OPR) for access to the Fragile Families Challenge dataset.

✉️ Follow instructions here to apply for access

Once you have been granted access to the data from Princeton's Office of Population Research (OPR), you must register with us to join the collaboration. (This is step 7 in the instructions above, so don't repeat it if you already filled out the form.)

✋ Register here!

Authentication

Your AWS access key ID/secret will be automatically detected from standard locations (such as environment variables or credentials files).

If you are working in a notebook without access to other methods of configuration (such as using Assemblé) you can do the following in a code cell:

import os
os.environ['AWS_ACCESS_KEY_ID'] = 'your access key id'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your secret access key'

Data documentation

The full challenge dataset contains a "background" table of 4,242 rows (one per child in the training set) and 12,942 columns.

Train split

The "train" split contains 2,121 rows (half of the background set) and 7 additional columns:

challengeID: A unique numeric identifier for each child.
Six outcome variables (each variable name links to a blog post about that variable)
1. Continuous variables: grit, gpa, materialHardship
2. Binary variables: eviction, layoff, jobTraining

These six outcome variables are the outcomes that we are trying to predict.

💡 For the purpose of validating feature contributions, we will focus on the materialHardship prediction problem. However, we want our feature definitions to be useful for all six prediction problems.

You can load the train split as follows:

from ballet import b
X_df, y_df = b.api.load_data()

Leaderboard and test splits

The other half of the rows are reserved for the "leaderboard" and "test" splits. We will use the leaderboard split to validate feature contributions. We will not look at the test split until the end of the collaboration.

Background variables

(This section is adapted from here)

To use the data, it may be useful to know something about what each variable (column) represents. (See also the full documentation.)

Waves and child ages

The background variables were collected in 5 waves.

Wave 1: Collected in the hospital at the child's birth.
Wave 2: Collected at approximately child age 1
Wave 3: Collected at approximately child age 3
Wave 4: Collected at approximately child age 5
Wave 5: Collected at approximately child age 9

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number. Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question a4.

The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child's age (baby's age).
m1, m2, m3, m4, m5: Questions asked of the child's mother in wave 1 through wave 5.
f1, ..., f5: Questions asked of the child's father in wave 1 through wave 5
hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
p5: Questions asked of the primary caregiver in wave 5.
k5: Questions asked of the child (kid) in wave 5
ffcc: Questions asked in various child care provider surveys in wave 3
kind: Questions asked of the kindergarten teacher
t5: Questions asked of the teacher in wave 5.

micahjsmith / ballet-fragile-families