The LDA topic modelling workflow includes several steps:
- Creating a list of govuk links for the experiment
- Looking up these urls on govuk api to return url, text file in csv format
- Performing LDA on AWS EC instance (local url, text data upload and download)
- (optional - if hierarchy) Splitting output to multiple url, text csvs based on which is the most probable topic in the url
- (optional -if hierarchy) LDA on each split url, text csv
- Cleaning LDA tagged output (removing parentheses etc)
- LDA performance evaluation compared to user research-derived taxons
This repo includes scripts to perform steps 4 (split_tier_to_urltext_for_lda.py) and 6 (clean_LDAoutput_forR.py)
These scripts were written and tested in a Python 2.7 environment.
To include packages required:
pip install -r requirements.txt
To perform step 4 using the example data provided call the script from your console as follows:
python split_tier_to_urltext_for_lda.py --out_path ../DATA/output --preLDA_fpath example_preLDA_input.csv --tagged_fpath example_tag_input.csv
To perform step 6 using the example data provided
python clean_lda_out.py --out_path ../DATA/education/clean_lda_output/clean_tier1_educ.csv --taxonfile ../DATA/education/educ_link_taxonpath.csv --raw_lda ../DATA/education/raw_lda_tag_output/example_tag_input.csv
pytest will be used to run tests once they've been written!
These tests will be unit tests of each function and tests of the output format
- Ellie King - Initial work - ellieking17
- Nicky Zachariou - debugging -myst3ria
- Matt Upson - helpful suggestions for debugging and git muddles
- Andrea Grandi -suggestions for filtering dfs
- David Read -looping help