drkarim / Word-Indexer

Creates a word index from a big collection of text documents

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

Basic information

Create an index of most interesting words from a collection of text files

Code

Code can be found written inside three classes: 

  • DocumentReader.py   [main]
  • InterestingWords.py 
  • WordProcessor.py

Instructions to run: 

  • Uncompress the zip file provided

  • install requirements: 
    pip install -r requirements.txt 

  • Run DocumentReader.py from command-line using the following: 

  • Example usage:  python DocumentReader.py -i ./data/ -o ./output/interesting_and_frequent.html -s iaf

      Inputs switches: 
      -i - Path to folder containing input text files, use ./data/  
      -o - [optional] Path to the html file output, use my_file.html  
      -s - [optional] The sorting order, see below
    
      Possible value sorting order can take are the following:  
      iaf - lists words by importance and frequency,  
      fai - lists words by frequency and importance,  
      by default words get listed by frequency only`
    

List of most frequent interesting words as HTML: 

The ./output folder already contains HTMLs generated by the code. Each HTML presents the words sorted in a different order:  

  • frequent_and_interesting.html  - sorted as most frequent and then by their tf-idf score 
  • interesting_and_frequent.html  - sorted by their tf-idf score and then as most frequent 
  • most_frequent.html  - sorted in most to least frequent 

Sphinx Documentation 

The code is documented with docstrings.
A Sphinx generated documentation can be found by browsing to:  ./documentation/index.html

Contact:

    Rashed Karim, 
    rashed.karim@gmail.com

About

Creates a word index from a big collection of text documents


Languages

Language:HTML 99.7%Language:Python 0.3%