Look at the final result of this project! If you want to read the blog post for this project take a look here.
There is more information about the motivation for this project and the process on the blog so I'll try to keep this short.
project_fletcher_cleaning.py
- this file contains the scripts that I used to scrape the information from The American Presidency Project.
- The process of scraping was more or less the same for each of the different types of speeches (state of the union, inaugural address, etc.)
- scrape the links for each of the speeches of a certain type (state of the union, inaugural address, etc.)
- use those links to scrape the individual speech page for the speech text, president name, date, and title
- create a dictionary for a particular speech
- loop through all the links to create a list of dictionaries (one for each speech)
- put everything into MongoDB
project_fletcher_processing.py
- this file contains the scripts and functions used to generate topics via latent dirichlet allocation (LDA) and similarity scores using the document matrix after latent semantic indexing (LSI)
- I commented out the code that I wasn't using but I wanted to keep the code in the files because I wanted to have a record of what I did
president.html
- The eventual product was a webpage with a couple visualizations created using D3.js.