raquelrguima / diversity_innovation_paradox

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Diversity-Innovation Paradox in Science

This repository contains code and data associated with “The Diversity-Innovation Paradox in Science.” arXiv preprint and PDF can be found here.

If you use any of the code or ideas presented here, please cite our paper:

  • Hofstra, Bas, Vivek V. Kulkarni, Sebastian Munoz-Najar Galvez, Bryan He, Dan Jurafsky, & Daniel A. McFarland. (2020). The Diversity Innovation Paradox in Science. arXiv, arXiv:1909.02063.

In a nutshell

By analyzing data from nearly all US PhD-recipients and their dissertations across three decades, this paper finds demographically underrepresented students innovate at higher rates than majority students, but their novel contributions are discounted and less likely to earn them academic positions. The discounting of minorities’ innovations may partly explain their underrepresentation in influential positions of academia.

picture Figure 1. The introduction of innovations and their subsequent uptake.

Code

With the provided code the novelty, impactful novelty, and distal novelty metrics can be constructed from the ProQuest dissertation abstract data.

  • stms_estimate_at_K.R: Runs Structural Topic Models at specified range of K (50-1000 in the paper).
  • concepts_k500_50.R: Extracts concepts from the structural topic model output, the number of words, topics, and FREX weighing can be adjusted in the code to get at the differend K/FREX scenarios.
  • novelty and impactful novelty:
  • proquest-skipgrams.py: Code to learn the concept embeddings to find out which are distal or proximal linkages.

Data

For the concepts extracted for the K = 500 Structural Topic Model where we equally balance frequency and exclusivity (which we extract in concepts_k500_50.R), please see k500_wordcouds_n_to_n.zip for visualizations or frexconcepts_k500_50.rda for the data (second element in the list).

For raw data of ProQuest or the Web of Science:

For inferring gender and race associated with names:

About


Languages

Language:R 62.7%Language:Python 37.3%