getify / WordGraphs

A librery containing Trie, DAWG with a strong search functionality.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

#Word Graphs Word graph is a NodeJs/Browser Typescript friendly library providing fast implementations of the trie and Minimal word graph (AKA DAWG, DAFSA) while also trying to provide you with an easy to use API for querying data from the graphs.

#Table of contacts

Tries and DAWGS:

Tries and DAWGS are final state automatons for solving problem's in many fields such as linguistics and bioinformatics The main difference between the two is the size of the end result, In a quick test I ran a Trie containing 0.5M words had around the 1.5M states while the equivalent DAWG contained only 20 thousand.

Installation

WordGraphs can be downloaded via NodeJs

npm i word-graphs --save-dev

WordGraphs uses a UMD module so you can use it in NodeJs or in the browser.

Typescript or es 6 modules
import {Trie} from 'word-graphs';

let trie = new Trie();

trie.add('Simple');
trie.add('Example');

console.log(trie.containsAny(['mp'])); // Will print ['Simple', 'Example']
Script tag
import {Trie} from 'word-graphs';

let trie = new Trie();

trie.add('Simple');
trie.add('Example');

console.log(trie.containsAny(['mp'])); // Will print ['Simple', 'Example']

Preformance

On my pc

Some simple performance tests for the librery, ran on 0.5Mil words. My PC (i5-6600, 16GB ram windows 10 64bit).

    Add 0.5Mil words to trie: toke 1160.777 Ms or 1.160777 Sec
    Add 0.5Mil words to minimalWordGraph: toke 3162.078 Ms or 3.162078 Sec

As can be seen a trie constructs much faster then a DAWG.

    Trie size: 1060026
    Compute trie size: toke 3015.024 Ms or 3.015024 Sec
    Mwg size: 222231
    Compute mwg size: toke 1476.980 Ms or 1.47698 Sec

But a DAWG is much smaller then a trie.

    Look up long word in trie: toke 0.637 Ms or 0.000637 Sec
    Look up long word in mwg: toke 0.093 Ms or 0.000093 Sec
    Trie Starts with: toke 1.387 Ms or 0.001387 Sec
    Mwg Starts with: toke 1.010 Ms or 0.00101 Sec
    Trie Ends with: toke 1785.083 Ms or 1.785083 Sec
    Mwg Ends with: toke 1752.311 Ms or 1.752311 Sec
    QueryBuilder Trie startsWith, endsWith, containsAny: toke 55.003 Ms or 0.055003 Sec
    QueryBuilder Mwg startsWith, endsWith, containsAny: toke 51.406 Ms or 0.051406 Sec

As you can see a DAWG may be a bit faster then a trie.

    Minimize trie, affectivly transforming it to a dawg.: toke 2428.354 Ms or 2.4283539999999997 Sec

This is not linear and will take MUCH more time as the size grows.

 Edit distance Trie.: toke 5439.492 Ms or 5.439492 Sec
 Edit distance MWG.: toke 5241.073 Ms or 5.241073 Sec
 Edit distance Trie, Max results set to 3, searching words similar to "pop".: toke 2.090 Ms or 0.00209 Sec
 Edit distance MWG. Max results set to 3, searching words similar to "pop".: toke 1.860 Ms or 0.00186 Sec

As you can spellchecking can be really fast or really slow, it depends on the word, limiting the number of results to 3 may help allot.

Running the tests

To run the tests simply:

  1. Clone or fork this repository.
  2. Run npm i in the root folder.
  3. cd ./src/PerformanceTests
  4. node node performanceTests.js

Good luck :)

Usage

Trie

Trie can accept words in any order:

    // Creating a new Trie.
    let trie = new Trie();
    
    // Add some words
    trie.add('Cat');
    trie.add('Bat');
    trie.add('Tool');
    trie.add('David');
    
    // Lookup word
    trie.lookup('Cat') // true
    trie.lookup('Human') // false
    trie.lookup('cat') // false, graphs are case insensitive (autoConvert Option will be added in future).

MinimalWordGraph Aka DAGW

MinimalWordGraph or DAGW needs words to be inserted in ascending alphabetical order:

    // Creating a new MinimalWordGraph.
    let mwg = new MinimalWordGraph();
    
    // Add some words
    mwg.add('Cat');
    mwg.add('Bat'); // Will error!
    
    mwg = new MinimalWordGraph();
    mwg.add('Cat');
    mwg.add('Catnip');
    mwg.add('Turnip');
    
    // Lookup word
    mwg.lookup('Cat') // true
    mwg.lookup('Human') // false
    mwg.lookup('cat') // false, graphs are case insensitive (Option will be added in future).

####Making the DAGW immutable It is highly recommanded that after you have finished building the dawg you call

   mwg.makeImmutable();

This will:

  • Make the dawg immutable - calling the add() method will result in an exception.
  • Free up unneeded resources.
  • Allow the size of the dawg to be callculated once and saved thus making calls to size() O(1).

Edit distance

Edit distance is an algorithm that helps us determine the amount of operations we need to make to transform a word A into word B. I have extended the algorithm linked above to take advantage of that fact we are using an automaton.

Using the edit distance feature of wordGraphs is just as simple as the rest of tha api:

    // Lets say we have a word graph: graph and we want to search for words like 'banna' in hope to find 'banana'
    graph.similarTo('banna'); // In my dictionery we get ['Ban', 'Bana', 'Banat', 'Banba', 'Banco', 'Banda', 'Bandana'] + 4030 more items....
    
    // To many results ? What if we want to change the maximus error amount or max results?
        graph.similarTo('banna', {maxDistance:1, maxResults : 5}); // [ 'banana', 'Canna', 'Danna', 'Hanna', 'Janna' ]

Searching

WordGraphs tries you give you a powerfull yet quick search api, you can either use the built in search methods or the more advanced queryBuilder for complex queries.

Using built in methods of wordGraph

    let mwg = new MinimalWordGraph();

    mwg = new MinimalWordGraph();
    mwg.add('Cat');
    mwg.add('Catnip');
    mwg.add('taC');
    mwg.add('Turnip');
    
    mwg.startsWith('Cat'); // ['Cat', 'Catnip']
    mwg.endsWith('nip'); // ['Catnip', 'Turnip']
    mwg.containsAll('C', 'nip'); // ['Catnip']
    mwg.containsOnly('t', 'a', 'C'); // ['Cat', 'taC']
    

Using QueryBuilder

    let mwg = new MinimalWordGraph();

    mwg = new MinimalWordGraph();
    mwg.add('Cat');
    mwg.add('Catnip');
    mwg.add('taC');
    mwg.add('Turnip');
    
    let queryBuilder = new QueryBuilder(minimalWordGraph);
    queryBuilder
        .startsWith('C')
        .maxLength(3)
        .build()(); // ['Cat']
        
    queryBuilder
        .minLength(4)
        .endsWith('ip')
        .containsAny('t')
        .build()(); // ['Catnip', 'Turnip']
    

Credits

About

A librery containing Trie, DAWG with a strong search functionality.

License:MIT License


Languages

Language:TypeScript 83.9%Language:JavaScript 16.1%