Thank you for your interest in our poster! This repo contains code to reproduce the analysis and visualization.
Some content in this repo is not ours, and the MIT license does not apply to that content. Please see the directory LICENSED_CONTENT for identification of licensed content and to read their respective licenses.
If all you want to do is see how to make the poster, skip to the visualization section.
Otherwise, for the analysis, you will need Python 3.5+ because
I used print as a function and probably other things. If you use Python 3.4,
you can't call help()
on some things in SQLAlchemy because of a thing about
inspect.py
that's gone in 3.5+. I didn't realize that until cleaning
up this repo for sharing though, so everything worked OK on 3.4.
pipenv --three
pipenv install --skip-lock
pipenv shell
# and `exit` to exit...
Enter each directory to do the relevant work for each step.
# data
This directory will contain the database (it's 30MB so it's on Dropbox not GitHub),
plus the SQLAlchemy ORM. You don't need to directly run anything in here; the path
to database.py
is prepended to the Python path in both acquisition
and analysis
.
# acquisition
This directory contains a script run_all_acquisitions.py
to run the data
acquisition or download the database from Dropbox; it gives an interactive
choice. It will put the database in data/PyCons.db
.
(The interactive choice is just to run all the scraping code or to curl
from here: https://www.dropbox.com/s/3muutb5uw15g5tp/PyCons.db?dl=1 if
you'd rather do that manually.)
Scraping is partly manual to deal with different spellings of names, so expect to spend an hour or two answering 'Y' or 'n' to questions like 'is Enthought' the same as 'Enthought, LLC.'?
The analysis is done in a Jupyter notebook, and shows attempts at simple word frequency,
clustering, and Latent Dirichlet Allocation. In the end, it was clear manual labeling
would be the best option. The Excel file in data/all_talks_byhand.xlsx
contains
the manual labels. It was converted to a JSON, then annotated to add the captions in
visualization/data/topic_graph_byhand.json
.
This directory is independent of the rest of the project.
If all you want to do is reproduce the poster, go
there
and follow the instructions. You do not need to pipenv install
anything.