vertuli / college-yield-gap

A dive into college admissions data to understand the yield rate gender gap.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Yield Rate Gender Gap

This is an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!


I had been exposed to college admissions data for years while managing the team at Newton, my education startup in Shanghai, but only after building a primitive shell script data scraper to pull college data to build a “match” tool did I realize there seemed to be a systemic gender gap in yield rates – the ratio of students accepting offer letters of admission to total offers given out by the school.

After brushing up on probability, statistics, and linear algebra, I starting learning the fundamental tools of modern data science and felt that analyzing the old admissions data I used at Newton would be a perfect candidate for a personal project to help cement my learning. This project has provided me hundreds of hours of experience to become comfortable using the following technologies and I now feel confident in my ability to:

  • with Jupyter notebooks (soon upgrading to Jupyter Lab),
    • run jupyter notebooks remotely (and securely) from a server.
    • work on notebooks from my iPad at the cafe!
  • with BeautifulSoup,
    • along with requests, retrieve large numbers of pages without supervision.
    • isolate and extract targeted values in web pages.
  • with PostgreSQL / SQLite,
    • install database software on a server.
    • use complex SQL queries to select specific data.
  • with pandas,
    • perform advanced manipulation of data.
    • comfortably use MultiIndexing.
    • efficiently clean text data using string methods and regular expressions.
  • with difflib / fuzzywuzzy,
    • join tables with close but not identical string keys.
  • with matplotlib / seaborn / bokeh,
    • visualize distributions with boxplots, ECDFs, etc.
    • finely control customization of figures.
    • create user interactive visualization tools.
  • with scipy.stats,
    • conduct statistical hypothesis testing.
    • normalize data with Box-Cox transforms.
  • with scikit-learn,
    • split training and testing data and cross validate.
    • build linear, lasso, ridge, and isotonic regression models.
    • use logistic regression models for classification.

In addition, I have a pretty solid grasp on how multiple imputation by chained equations (MICE) works and have utilized the iterative imputation approach to handle missing data in my project.

About

A dive into college admissions data to understand the yield rate gender gap.


Languages

Language:HTML 56.9%Language:Jupyter Notebook 42.6%Language:Python 0.5%