Introduction

Basic information

Create an index of most interesting words from a collection of text files

Code

Code can be found written inside three classes:

DocumentReader.py [main]
InterestingWords.py
WordProcessor.py

Instructions to run:

Uncompress the zip file provided
install requirements:
pip install -r requirements.txt
Run DocumentReader.py from command-line using the following:

Example usage: python DocumentReader.py -i ./data/ -o ./output/interesting_and_frequent.html -s iaf

  Inputs switches: 
  -i - Path to folder containing input text files, use ./data/  
  -o - [optional] Path to the html file output, use my_file.html  
  -s - [optional] The sorting order, see below

  Possible value sorting order can take are the following:  
  iaf - lists words by importance and frequency,  
  fai - lists words by frequency and importance,  
  by default words get listed by frequency only`

List of most frequent interesting words as HTML:

The ./output folder already contains HTMLs generated by the code. Each HTML presents the words sorted in a different order:

frequent_and_interesting.html - sorted as most frequent and then by their tf-idf score
interesting_and_frequent.html - sorted by their tf-idf score and then as most frequent
most_frequent.html - sorted in most to least frequent

Sphinx Documentation

The code is documented with docstrings.
A Sphinx generated documentation can be found by browsing to: ./documentation/index.html

Contact:

    Rashed Karim, 
    rashed.karim@gmail.com

About

Creates a word index from a big collection of text documents

Languages

Language:HTML 99.7%Language:Python 0.3%