Create an index of most interesting words from a collection of text files
Code can be found written inside three classes:
- DocumentReader.py [main]
- InterestingWords.py
- WordProcessor.py
-
Uncompress the zip file provided
-
install requirements:
pip install -r requirements.txt
-
Run
DocumentReader.py
from command-line using the following: -
Example usage:
python DocumentReader.py -i ./data/ -o ./output/interesting_and_frequent.html -s iaf
Inputs switches: -i - Path to folder containing input text files, use ./data/ -o - [optional] Path to the html file output, use my_file.html -s - [optional] The sorting order, see below Possible value sorting order can take are the following: iaf - lists words by importance and frequency, fai - lists words by frequency and importance, by default words get listed by frequency only`
The ./output
folder already contains HTMLs generated by the code. Each HTML presents the words sorted in a different order:
frequent_and_interesting.html
- sorted as most frequent and then by their tf-idf scoreinteresting_and_frequent.html
- sorted by their tf-idf score and then as most frequentmost_frequent.html
- sorted in most to least frequent
The code is documented with docstrings.
A Sphinx generated documentation can be found by browsing to: ./documentation/index.html
Rashed Karim,
rashed.karim@gmail.com