simpleindex
A simple inverted index for javascript. An Index is used to store and retrieve objects by one or more of the terms in the object.
Indexing arbitrarilly formatted objects in 3 steps
Use these steps to index an object, an xml document, a web page, or whatever else you can put in an array.
- Build a document with DocumentBuilder
- Invert the document - Build a term vector with DocumentInverter
- Index the object - Add the object with its term vector to the Index.
Building Documents
For our purposes, a document is an object where the key is the field name and the value is a string ready for tokenization and filtering, or a pre-tokenized term vector, like this:
document = {name:"Red delicious", color:["Red"]}
Documents can be built with the DocumentBuilder and inverted (turned into a token vector) with DocumentInverter.
DocumentBuilder
The DocumentBuilder builds a dictionary object of field to value pairs, where the value is a string that is ready to be inverted.
DocumentBuilder Example
# Objects to put in index
apples = [
{
variety: "Golden Delicious"
identified: 1914
color: "Yellow"
description: "The Golden Delicious is a cultivar of apple with a yellow color..."
},
{
variety: "Red Delicious"
identified: 1880
color: "Red"
description: "The Red Delicious is a clone of apple cultigen..."
}
]
# This converter defines the fields and where to get them from the object.
converter =
name: (d) -> d.variety
body: (d) -> d.description
year: (d) -> d.identified.toString()
color: (d) -> [d.color] # a vector is treated as pre-tokenized terms
# Builds a document object - a simple dictionary of field=value
# (where value is the string to be inverted).
db = new DocumentBuilder converter
documents = [db.build a for a in {apples}]
DocumentInverter
The DocumentInverter takes a document object or string and converts it to a term vector. By default, DocumentInverter will use Filters to normalize terms into lower case and remove duplicate terms.
DocumentInverter Example
docInv = new DocumentInverter new DedupFilter new LowerCaseFilter()
apple = variety: "Red Delicious", identified: 1880, color: "Red"
terms = docInv.invertSync db.build apple
# terms = ["name:red", "name:delicious", "year:1880", "color:Red"]
Indexing an Object
Now that your object has been described with a term vector, it is ready to be added to the index.
Index
An Index is used to store and retrieve objects by one or more of the terms representing the object.
Indexing Example
index = new Index()
apple = variety: "Red Delicious", identified: 1880, color: "Red"
index.addSync apple, ["name:red", "name:delicious", "year:1880", "color:Red"]
Advanced
Using Filters
Filters transform a term stream to prepare it for indexing. Filters have
a .filter
method, which accepts and returns an array or array-like object.
Standard Filters
These filters ought to get you started.
DedupFilter - Removes duplicate terms from the term stream
new DedupFilter()
new DedupFilter(subfilter)
LowerCaseFilter - Yields terms converted to lowercase
new LowercaseFilter()
new LowercaseFilter(subfilter)
StopWordFilter - Yields terms that are not in the configurable list of stopwords
new StopWordFilter(stopwordsArray)
new StopWordFilter(stopwordsArray, subfilter)
PrefixFilter - Yields terms prepended with a string
new PrefixFilter(prefix)
new PrefixFilter(prefix, subfilter)
# Example:
new PrefixFilter("tag:").filter(['salad', 'breakfast'])
# yields ['tag:salad', 'tag:breakfast']
Filter Chaining
Most filters can be chained together so that the output of one is the input of the next, thus working inside-out.
For example, this combination converts each term to lower, then removes duplicates:
new DedupFilter(new LowerCaseFilter()).filter(["APPLE","apple", "Orange"])
# yields ["apple", "orange"]
Searching
Searching an index with an IndexSearcher and a Query
An IndexSearcher lets you query an index. A query finds all the matches in an index and returns a BitArray representing the matching doctors.
lunchButNotSaladQuery = new Query (index) ->
hits = index.getIndexesForTermSync 'tag:salad'
hits = hits.copy() # don't edit original
hits.not()
hits.and index.getIndexesForTermSync 'tag:lunch'
return hits
searcher = new IndexSearcher index
hits = searcher.search lunchButNotSaladQuery
documents = index.getItemsSync hits