leix28 / ballet-fragile-families

Collaborating to solve the Fragile Families Challenge using the Ballet framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ballet slack Join the chat at https://gitter.im/ballet-project/fragile-families

Fragile Families Collaboration

This is a collaborative predictive modeling project built on the ballet framework.

The Fragile Families Challenge (FFC) is a recent attempt to better connect to the social science research community to new tools in data science and machine learning. This challenge aimed to spur the development of predictive models for life outcomes from data collected as part of the Fragile Families and Child Wellbeing Study (FFCWS), which collects detailed longitudinal records on a set of disadvantaged children and their families. Organizers released anonymized and merged data on a set of 4,242 families with data collected from the birth of the child until age 9. Participants in the challenge were then tasked with predicting six life outcomes of the child or family at age 15: child grade point average, child grit, household eviction, household material hardship, primary caregiver layoff, and primary caregiver participation in job training. The FFC was run over a four month period in 2017 and received 160 submissions from social scientists, machine learning practitioners, students, and others.

In this project, we ask, by collaborating rather than competing, can we develop impactful solutions to the FFC? Participants in the FFC were competing against each other to produce the best performing models, at the expense of collaboration across teams.

Your task is to create and submit feature definitions to our shared project that help us in predicting these key life outcomes.

Join the collaboration

Are you interested in joining the collaboration?

  1. Apply for access to the dataset and then register yourself with us.
  2. Read/skim the Ballet Contributor Guide.
  3. Read/skim the Ballet Feature Engineering Guide.
  4. Learn more about the Fragile Families dataset.
    1. Read/skim the data documentation.
    2. Skim additional resources.
  5. Browse the currently accepted features in the contributed features directory (src/fragile_families/features/contrib).
  6. Launch an interactive Jupyter Lab session to hack on this repository:

Data access

The data underlying the Fragile Families Challenge, which we are using in this collaboration, is sensitive and requires registration to access.

If you are already authorized to access the data, you can look over Data Documentation below.

Apply for access and registration

You must apply to Princeton's Office of Population Research (OPR) for access to the Fragile Families Challenge dataset.

✉️ Follow instructions here to apply for access

The Fragile Families Challenge dataset contains sensitive information. You should keep this dataset secure and protect the privacy of the individuals, and abide by the data access agreement which requires you not to share your copy of the dataset.

You must register with us to join the collaboration, once you have been granted access to the data from Princeton OPR (or if you had already had access to the data from prior research). (This is step 7 in the instructions above, so don't repeat it if you already filled out the form.)

Register here!

Authentication

Your AWS access key ID/secret will be automatically detected from standard locations (such as environment variables or credentials files).

If you are working in a notebook without access to other methods of configuration (such as using Assemblé) you can do the following in a code cell:

import os
os.environ['AWS_ACCESS_KEY_ID'] = 'your access key id'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your secret access key'

Alternatively, if you are working locally, you can create a new AWS profile in ~/.aws/credentials:

[bff]
aws_access_key_id = your-access-key-id
aws_secret_access_key = your-secret-access-key

Then you can use this profile when you are developing features for this project, by exporting the environment variable AWS_PROFILE=bff (or using the os.environ approach similar to above).

Data documentation

The full challenge dataset contains a "background" table of 4,242 rows (one per child in the training set) and 12,942 columns.

Train split

The "train" split contains 2,121 rows (half of the background set) and 7 additional columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables (each variable name links to a blog post about that variable)
    1. Continuous variables: grit, gpa, materialHardship
    2. Binary variables: eviction, layoff, jobTraining

These six outcome variables are the outcomes that we are trying to predict.

💡 For the purpose of validating feature contributions, we will focus on the materialHardship prediction problem. However, we want our feature definitions to be useful for all six prediction problems.

You can load the train split as follows:

from ballet import b
X_df, y_df = b.api.load_data()

Leaderboard and test splits

The other half of the rows are reserved for the "leaderboard" and "test" splits. We will use the leaderboard split to validate feature contributions. We will not look at the test split until the end of the collaboration.

Background variables

(This section is adapted from here)

To use the data, it may be useful to know something about what each variable (column) represents. (See also the full documentation.)

Waves and child ages

The background variables were collected in 5 waves.

  • Wave 1: Collected in the hospital at the child's birth.
  • Wave 2: Collected at approximately child age 1
  • Wave 3: Collected at approximately child age 3
  • Wave 4: Collected at approximately child age 5
  • Wave 5: Collected at approximately child age 9

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number. Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question a4.

  1. The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child's age (baby's age).
  2. m1, m2, m3, m4, m5: Questions asked of the child's mother in wave 1 through wave 5.
  3. f1, ..., f5: Questions asked of the child's father in wave 1 through wave 5
  4. hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
  5. p5: Questions asked of the primary caregiver in wave 5.
  6. k5: Questions asked of the child (kid) in wave 5
  7. ffcc: Questions asked in various child care provider surveys in wave 3
  8. kind: Questions asked of the kindergarten teacher
  9. t5: Questions asked of the teacher in wave 5.

Metadata search

We wrap the ffmetadata API for our own use in feature development. See here for details on the filter operations.

import fragile_families.analysis.metadata as metadata
metadata.info('m1a4')
metadata.search({'name': 'label', 'op': 'like', 'val': '%school%'})
# can use metadata.searchinfo to combine the two methods

Feature validation

In this project, feature contributions are validated to ensure that they are positively contributing to our shared feature engineering pipeline. One part of this validation is called "feature acceptance" validation, that is, does the performance of our ML pipeline improve when the new feature is added? We run the feature through two feature accepters: the MutualInformationAccepter and the VarianceThresholdAccepter. Based on the parameters set in our ballet.yml configuration file, a feature definition is accepted if it meets two criteria:

  • the variance of each its feature column values is greater than a threshold (set to 0.05), i.e. Var(z_i) > 0.05 ∀ z_i ∈ z where z_i are columns of feature z.
  • the mutual information of the feature values with the target on the held out leaderboard dataset split is greater than a threshold (set to 0.001), i.e. I(z ; y) > 0.001.

Discussion and help

Want to chat about the project, compare ideas, or debug features with other collaborators? Join either of our two chat rooms:

  • Slack: slack (forgive me if the invite link is ever temporarily expired)
  • Gitter: Join the chat at https://gitter.im/ballet-project/fragile-families

If you think a question might have been answered before, check out the Ballet FAQ.

If you think you found a bug with Ballet, please open an issue and mention that you are working on the ballet-fragile-families project.

Additional resources

About

Collaborating to solve the Fragile Families Challenge using the Ballet framework


Languages

Language:Python 63.3%Language:Jupyter Notebook 35.3%Language:Shell 1.4%