local-document-index

An implementation of local document index similar to elasticsearch, which can handle large number of documents and terms without using excessive memory or disk I/O.

The project can be separated in 4 parts: tokenizing, indexing, merging and retrieving.

Tokenizing

Dataset

This project uses trec_ap89 corpus that can be found from TREC.
The total size of the dataset is 284.1 MB.
Total number of documents is 84678.

Clearning files

Use Jsoup to read the HTML like formated files from the dataset.
For each file, replace characters that matches the regular expression "[.]+[^A-Za-z0-9]" with " " and replace "[.]+[^A-Za-z0-9]" with " ".
After that only words, numbers and terms like "127.0.0.0" are left.

Tokenizing

Convert each term in a document to a (term_id, doc_id, position) tuple.
Store term and term_id in a catalogue file.

For example, in document with doc_id 20:

The car was in the car wash.

After tokenizing it is converted to :

(1, 20, 1), (2, 20, 2), (3, 20, 3), (4, 20, 4), (1, 20, 5), (2, 20, 6), (5, 20, 7)

In catalogue:

1: the 
2: car 
3: was 
4: in 
5: wash

Stopping and stemming

Load this stop list in hash table and skip those words.
Use PorterStemmer from Open NLP library to stem terms.

Indexing

Record tokens from all documents and save them in the termlist file with following format(df for document frequency, ttf for total term frequency, tf for term frequency):

term_id df ttf;doc_id1 tf position1 position2 ...;doc_id2 tf position1 posiont2 ...; ...

For example, 1 1 1;47407 1 258 means term_id 1 appears once in the whole dataset and appears in only one document. In document_id 47407 it appears once in position 258.

Record term_id, term and the offset in termlist for all terms to catalogue file.

For exapmle, 2 flashier 33 72 means term number 2 is "flashier", and it is store in position 33 with length 72 in the termlist file.

Record document name, document_id and document length in docInfo file.

Merging

Retrieving

License

This project is under the MIT license.

About

An implementation of local document index similar to elasticsearch, which can handle large number of documents and terms without using excessive memory or disk I/O.

MIT License

Languages

Language:Java 100.0%