NorthGuard / dasg

Implements a DAG (or DAWG or DASG) which can be used for querying a set of sequences. The graph can quickly look up existence and search for sequences using edit-distance measure.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Directed Acyclic Sequence Graph

Implements a DAG which can be used for querying a set sequences. After creating a graph for a set of sentences the class can quickly look up existence and search for sequences using edit-distance measure.

Also called a Directed Acyclic Word Graph (although this implementation is not restricted to strings) or a Deterministic Acyclic Finite State Automaton.

Basic usage

The DASG has its advantage when working with large sets of sequences where one wants to query for related sequences using edit-distance. The system is used as follows:

# 'the_sequences' is a list of sequences.
# 'query' is a single sequence.
the_dasg = DirectedAcyclicSequenceGraph(the_sequences)
related_sequences = the_dasg.edit_distance_search(query_sequence=query, max_cost=2, sort=True)

Visual Example

Comparison with editdistance

editdistance is a Python-library which can compute edit-distance between strings fairly fast. Using this library to search for related sequences requires computing the distance from the query to all other sequences, which scales linearly. A DASG can perform this search faster by travelling a graph. This search has logarithmic complexity for exact search but has larger complexity when the edit-distance is set to higher values.

Below is a comparison of using a set to hold ones vocabulary and using editdistance to find similar words vs. using a DASG. The setup is built on a quite large txt-file (1.1GB). The DASG is fastest if more that about 10 queries are to be made (due to build time).

Analysing text-file.

---------------------------------------------------------------------------
DASG

Creating DirectedAcyclicSequenceGraph...
Creation of DirectedAcyclicSequenceGraph with 2712289 sequences took 159.042s.

Read 2712289 words into 1459827 nodes and 3243435 edges

'pass' is in the DASG.

Words closest to 'chill':
[('chill', 0), ('"hill', 1), ("'chill", 1), ("'hill", 1), ('Chill', 1), ('Schill', 1), ('Whill', 1), ('chiel', 1), ('child', 1), ('chile', 1), ('chili', 1), ('chill!', 1), ('chill"', 1), ("chill'", 1), ('chill)', 1), ('chill,', 1), ('chill.', 1), ('chill:', 1), ('chill;', 1), ('chill?', 1), ('chill_', 1), ('chills', 1), ('chilly', 1), ('chilt', 1), ('hill', 1), ('shill', 1), ('thill', 1), ('whill', 1), ('"Bill', 2), ('"Chill', 2), ('"Fill', 2), ('"Gill', 2), ('"Hill', 2), ('"Jill', 2), ('"Kill', 2), ('"Lill', 2), ('"Mill', 2), ('"Nill', 2), ('"Pill', 2), ('"Till', 2), ('"Vill', 2), ('"Will', 2), ('"_ill', 2), ('"bill', 2), ('"chil"', 2), ('"child', 2), ('"chill"', 2), ('"fill', 2), ('"hall', 2), ('"hell', 2), ('"hills', 2), ('"ill', 2), ('"kill', 2), ('"till', 2), ('"vill', 2), ('"will', 2), ("'Bill", 2), ("'Chill", 2), ("'Fill", 2), ("'Gill", 2), ("'Hill", 2), ("'Kill", 2), ("'Mill", 2), ("'Till", 2), ("'Will", 2), ("'bill", 2), ("'child", 2), ("'chill'", 2), ("'fill", 2), ("'haill", 2), ("'hall", 2), ("'hell", 2), ("'ill", 2), ("'kill", 2), ("'till", 2), ("'will", 2), ('(Bill', 2), ('(Hill', 2), ('(Kill', 2), ('(Mill', 2), ('(Till', 2), ('(Will', 2), ('(bill', 2), ('(child', 2), ('(ill', 2), ('(kill', 2), ('(till', 2), ('(will', 2), ('-Will', 2), ('-ill', 2), ('-will', 2), ('Achill.', 2), ('Achille', 2), ("Ah'll", 2), ('Aill', 2), ('Athill', 2), ('Bhil', 2), ('Bhil,', 2), ('Bhil.', 2), ('Bhil;', 2), ('Bhils', 2), ('Bill', 2), ('Brill', 2), ('Chil', 2), ('Chil!', 2), ('Chil,', 2), ('Child', 2), ('Chile', 2), ('Chili', 2), ('Chill,', 2), ('Chills', 2), ('Chilly', 2), ('Chilo', 2), ('Dill', 2), ('Drill', 2), ('Duill', 2), ('Fill', 2), ('Frill', 2), ('Ghyll', 2), ('Gill', 2), ('Grill', 2), ('Hill', 2), ('Ichil', 2), ('Jhil', 2), ('Jhil.', 2), ('Jill', 2), ('Khil;', 2), ('Kill', 2), ('Lill', 2), ('McGill', 2), ('Mill', 2), ('Mohill', 2), ('Neill', 2), ('Nihill', 2), ('Nill', 2), ('Ochil', 2), ('Ochils', 2), ('Phial', 2), ('Phil', 2), ('Phil!', 2), ('Phil,', 2), ('Phil.', 2), ('Phil:', 2), ('Phil;', 2), ('Phil?', 2), ('Phile', 2), ('Philly', 2), ('Philo', 2), ('Philp', 2), ('Pill', 2), ('Quhill', 2), ('Quill', 2), ('Rill', 2), ('STill', 2), ('Schall', 2), ('Schill!', 2), ('Schill,', 2), ('Scholl', 2), ('Shall', 2), ('Shell', 2), ('Shiel', 2), ('Shill,', 2), ('Shrill', 2), ('Sill', 2), ('Skill', 2), ('Spill', 2), ('Still', 2), ('Swill', 2), ('Thall', 2), ('Thrill', 2), ('Thull', 2), ('Till', 2), ('Twill', 2), ('Uphill', 2), ('Vill', 2), ('Weill', 2), ('While', 2), ('Whilk', 2), ('Whirl', 2), ('Will', 2), ('Yuill', 2), ('Zhall', 2), ('[Bill', 2), ('[Till', 2), ('[Will', 2), ('[hills', 2), ('[till', 2), ('[will', 2), ('_Bill', 2), ('_Chill', 2), ('_Hill', 2), ('_Jill', 2), ('_Mill', 2), ('_Rill', 2), ('_Till', 2), ('_Will', 2), ('_bill', 2), ('_chili', 2), ('_chill_', 2), ('_hill_', 2), ('_ill', 2), ('_kill', 2), ('_till', 2), ('_will', 2), ('`fill', 2), ('`will', 2), ("ah'll", 2), ('baill', 2), ('bill', 2), ('caill,', 2), ('caill.', 2), ('call', 2), ('ceil', 2), ('cell', 2), ("ch'al", 2), ('chail', 2), ('chail.', 2), ('chaille', 2), ('chal', 2), ('chale', 2), ('chalk', 2), ('chals', 2), ('chawl', 2), ("che'l", 2), ('cheild', 2), ('cheils', 2), ('chel', 2), ('chela', 2), ('chi', 2), ("chi'", 2), ('chi,', 2), ('chi-', 2), ('chic', 2), ('chic!', 2), ('chic)', 2), ('chic,', 2), ('chic.', 2), ('chic;', 2), ('chic?', 2), ('chica', 2), ('chice', 2), ('chick', 2), ('chid', 2), ('chid!', 2), ('chid,', 2), ('chid.', 2), ('chid;', 2), ('chide', 2), ('chief', 2), ('chiel!', 2), ('chiel"', 2), ("chiel'", 2), ('chiel,', 2), ('chiel.', 2), ('chiel;', 2), ('chield', 2), ('chiels', 2), ('chien', 2), ('chiep', 2), ("chil'd", 2), ('child!', 2), ('child"', 2), ("child'", 2), ('child)', 2), ('child,', 2), ('child-', 2), ('child.', 2), ('child:', 2), ('child;', 2), ('child?', 2), ('child]', 2), ('child_', 2), ('childe', 2), ('childly', 2), ('childs', 2), ('chile!', 2), ('chile,', 2), ('chile.', 2), ('chile;', 2), ('chile?', 2), ('chilen', 2), ('chiles', 2), ('chilis', 2), ('chill!"', 2), ("chill!'", 2), ('chill";', 2), ("chill'd", 2), ("chill's", 2), ('chill),', 2), ('chill,"', 2), ("chill,'", 2), ('chill--', 2), ('chill."', 2), ("chill.'", 2), ('chill.]', 2), ('chill._', 2), ('chill?"', 2), ("chill?'", 2), ('chilled', 2), ('chillen', 2), ('chiller', 2), ('chills!', 2), ('chills,', 2), ('chills.', 2), ('chills:', 2), ('chills;', 2), ('chills?', 2), ('chillum', 2), ('chillun', 2), ('chilly!', 2), ('chilly,', 2), ('chilly.', 2), ('chilly:', 2), ('chilly;', 2), ('chilly?', 2), ('chilt,', 2), ('chime', 2), ('chimla', 2), ('chimly', 2), ('chin', 2), ('chin!', 2), ('chin)', 2), ('chin,', 2), ('chin-', 2), ('chin.', 2), ('chin:', 2), ('chin;', 2), ('chin?', 2), ('chin]', 2), ('china', 2), ('chine', 2), ('ching', 2), ('chink', 2), ('chins', 2), ('chiny', 2), ('chip', 2), ('chip!', 2), ('chip,', 2), ('chip.', 2), ('chip;', 2), ('chip?', 2), ('chips', 2), ('chirk', 2), ('chirp', 2), ('chirr', 2), ('chise', 2), ('chisel', 2), ('chist', 2), ('chit', 2), ('chit!', 2), ('chit)', 2), ('chit,', 2), ('chit-', 2), ('chit.', 2), ('chit;', 2), ('chit?', 2), ('chits', 2), ('chiv', 2), ('chive', 2), ('chivy', 2), ('choilt', 2), ('cholla', 2), ('choly', 2), ('chowl', 2), ('churl', 2), ('chyle', 2), ('ciel', 2), ('cil', 2), ('cirl', 2), ('coil', 2), ('coil!', 2), ('coil,', 2), ('coil.', 2), ('coil:', 2), ('coil;', 2), ('coil?', 2), ('coils', 2), ('coll', 2), ('cull', 2), ('dholl', 2), ('dhrill', 2), ('dill', 2), ('drill', 2), ('ehall', 2), ('evill', 2), ('ewill', 2), ('fill', 2), ('frill', 2), ('ghila', 2), ('ghyll', 2), ('gill', 2), ('grill', 2), ('h-ll', 2), ('haill', 2), ('hall', 2), ('hell', 2), ("hi'll", 2), ('hil', 2), ('hil,', 2), ('hild', 2), ('hile', 2), ('hill!', 2), ('hill"', 2), ("hill'", 2), ('hill)', 2), ('hill,', 2), ('hill-', 2), ('hill.', 2), ('hill:', 2), ('hill;', 2), ('hill?', 2), ('hill]', 2), ('hills', 2), ('hilly', 2), ('hilp', 2), ('hilt', 2), ('holl', 2), ('hull', 2), ('ill', 2), ('jhil,', 2), ('jhils', 2), ('kill', 2), ('leill', 2), ('lill', 2), ('mill', 2), ('nill', 2), ('phial', 2), ('pill', 2), ('quhill', 2), ('quill', 2), ('rill', 2), ('schall', 2), ("sh'll", 2), ('shall', 2), ('shell', 2), ('shiel', 2), ('shild', 2), ('shill,', 2), ('shill:', 2), ('shill;', 2), ('shilly', 2), ('shily', 2), ('shll', 2), ('sholl', 2), ('shrill', 2), ('shtill', 2), ('shull', 2), ('sill', 2), ('skill', 2), ('spill', 2), ('still', 2), ('swill', 2), ('taill', 2), ('thall', 2), ('thell', 2), ('thilk', 2), ('thill,', 2), ('thrill', 2), ('thtill', 2), ('till', 2), ('toill', 2), ('trill', 2), ('twill', 2), ('uphill', 2), ('vhile', 2), ('vill', 2), ('weill', 2), ('while', 2), ('whilk', 2), ('whilly', 2), ('whirl', 2), ('whull', 2), ('will', 2), ('wrill', 2), ('yill', 2), ('zhall', 2), ('{will', 2)]

Querying of DASG: 0.351s

---------------------------------------------------------------------------
Set and editdistance

Creating vocabulary set

'pass' is in the vocabulary set.

Words closest to 'chill':
[('chill', 0), ('"hill', 1), ("'chill", 1), ("'hill", 1), ('Chill', 1), ('Schill', 1), ('Whill', 1), ('chiel', 1), ('child', 1), ('chile', 1), ('chili', 1), ('chill!', 1), ('chill"', 1), ("chill'", 1), ('chill)', 1), ('chill,', 1), ('chill.', 1), ('chill:', 1), ('chill;', 1), ('chill?', 1), ('chill_', 1), ('chills', 1), ('chilly', 1), ('chilt', 1), ('hill', 1), ('shill', 1), ('thill', 1), ('whill', 1), ('"Bill', 2), ('"Chill', 2), ('"Fill', 2), ('"Gill', 2), ('"Hill', 2), ('"Jill', 2), ('"Kill', 2), ('"Lill', 2), ('"Mill', 2), ('"Nill', 2), ('"Pill', 2), ('"Till', 2), ('"Vill', 2), ('"Will', 2), ('"_ill', 2), ('"bill', 2), ('"chil"', 2), ('"child', 2), ('"chill"', 2), ('"fill', 2), ('"hall', 2), ('"hell', 2), ('"hills', 2), ('"ill', 2), ('"kill', 2), ('"till', 2), ('"vill', 2), ('"will', 2), ("'Bill", 2), ("'Chill", 2), ("'Fill", 2), ("'Gill", 2), ("'Hill", 2), ("'Kill", 2), ("'Mill", 2), ("'Till", 2), ("'Will", 2), ("'bill", 2), ("'child", 2), ("'chill'", 2), ("'fill", 2), ("'haill", 2), ("'hall", 2), ("'hell", 2), ("'ill", 2), ("'kill", 2), ("'till", 2), ("'will", 2), ('(Bill', 2), ('(Hill', 2), ('(Kill', 2), ('(Mill', 2), ('(Till', 2), ('(Will', 2), ('(bill', 2), ('(child', 2), ('(ill', 2), ('(kill', 2), ('(till', 2), ('(will', 2), ('-Will', 2), ('-ill', 2), ('-will', 2), ('Achill.', 2), ('Achille', 2), ("Ah'll", 2), ('Aill', 2), ('Athill', 2), ('Bhil', 2), ('Bhil,', 2), ('Bhil.', 2), ('Bhil;', 2), ('Bhils', 2), ('Bill', 2), ('Brill', 2), ('Chil', 2), ('Chil!', 2), ('Chil,', 2), ('Child', 2), ('Chile', 2), ('Chili', 2), ('Chill,', 2), ('Chills', 2), ('Chilly', 2), ('Chilo', 2), ('Dill', 2), ('Drill', 2), ('Duill', 2), ('Fill', 2), ('Frill', 2), ('Ghyll', 2), ('Gill', 2), ('Grill', 2), ('Hill', 2), ('Ichil', 2), ('Jhil', 2), ('Jhil.', 2), ('Jill', 2), ('Khil;', 2), ('Kill', 2), ('Lill', 2), ('McGill', 2), ('Mill', 2), ('Mohill', 2), ('Neill', 2), ('Nihill', 2), ('Nill', 2), ('Ochil', 2), ('Ochils', 2), ('Phial', 2), ('Phil', 2), ('Phil!', 2), ('Phil,', 2), ('Phil.', 2), ('Phil:', 2), ('Phil;', 2), ('Phil?', 2), ('Phile', 2), ('Philly', 2), ('Philo', 2), ('Philp', 2), ('Pill', 2), ('Quhill', 2), ('Quill', 2), ('Rill', 2), ('STill', 2), ('Schall', 2), ('Schill!', 2), ('Schill,', 2), ('Scholl', 2), ('Shall', 2), ('Shell', 2), ('Shiel', 2), ('Shill,', 2), ('Shrill', 2), ('Sill', 2), ('Skill', 2), ('Spill', 2), ('Still', 2), ('Swill', 2), ('Thall', 2), ('Thrill', 2), ('Thull', 2), ('Till', 2), ('Twill', 2), ('Uphill', 2), ('Vill', 2), ('Weill', 2), ('While', 2), ('Whilk', 2), ('Whirl', 2), ('Will', 2), ('Yuill', 2), ('Zhall', 2), ('[Bill', 2), ('[Till', 2), ('[Will', 2), ('[hills', 2), ('[till', 2), ('[will', 2), ('_Bill', 2), ('_Chill', 2), ('_Hill', 2), ('_Jill', 2), ('_Mill', 2), ('_Rill', 2), ('_Till', 2), ('_Will', 2), ('_bill', 2), ('_chili', 2), ('_chill_', 2), ('_hill_', 2), ('_ill', 2), ('_kill', 2), ('_till', 2), ('_will', 2), ('`fill', 2), ('`will', 2), ("ah'll", 2), ('baill', 2), ('bill', 2), ('caill,', 2), ('caill.', 2), ('call', 2), ('ceil', 2), ('cell', 2), ("ch'al", 2), ('chail', 2), ('chail.', 2), ('chaille', 2), ('chal', 2), ('chale', 2), ('chalk', 2), ('chals', 2), ('chawl', 2), ("che'l", 2), ('cheild', 2), ('cheils', 2), ('chel', 2), ('chela', 2), ('chi', 2), ("chi'", 2), ('chi,', 2), ('chi-', 2), ('chic', 2), ('chic!', 2), ('chic)', 2), ('chic,', 2), ('chic.', 2), ('chic;', 2), ('chic?', 2), ('chica', 2), ('chice', 2), ('chick', 2), ('chid', 2), ('chid!', 2), ('chid,', 2), ('chid.', 2), ('chid;', 2), ('chide', 2), ('chief', 2), ('chiel!', 2), ('chiel"', 2), ("chiel'", 2), ('chiel,', 2), ('chiel.', 2), ('chiel;', 2), ('chield', 2), ('chiels', 2), ('chien', 2), ('chiep', 2), ("chil'd", 2), ('child!', 2), ('child"', 2), ("child'", 2), ('child)', 2), ('child,', 2), ('child-', 2), ('child.', 2), ('child:', 2), ('child;', 2), ('child?', 2), ('child]', 2), ('child_', 2), ('childe', 2), ('childly', 2), ('childs', 2), ('chile!', 2), ('chile,', 2), ('chile.', 2), ('chile;', 2), ('chile?', 2), ('chilen', 2), ('chiles', 2), ('chilis', 2), ('chill!"', 2), ("chill!'", 2), ('chill";', 2), ("chill'd", 2), ("chill's", 2), ('chill),', 2), ('chill,"', 2), ("chill,'", 2), ('chill--', 2), ('chill."', 2), ("chill.'", 2), ('chill.]', 2), ('chill._', 2), ('chill?"', 2), ("chill?'", 2), ('chilled', 2), ('chillen', 2), ('chiller', 2), ('chills!', 2), ('chills,', 2), ('chills.', 2), ('chills:', 2), ('chills;', 2), ('chills?', 2), ('chillum', 2), ('chillun', 2), ('chilly!', 2), ('chilly,', 2), ('chilly.', 2), ('chilly:', 2), ('chilly;', 2), ('chilly?', 2), ('chilt,', 2), ('chime', 2), ('chimla', 2), ('chimly', 2), ('chin', 2), ('chin!', 2), ('chin)', 2), ('chin,', 2), ('chin-', 2), ('chin.', 2), ('chin:', 2), ('chin;', 2), ('chin?', 2), ('chin]', 2), ('china', 2), ('chine', 2), ('ching', 2), ('chink', 2), ('chins', 2), ('chiny', 2), ('chip', 2), ('chip!', 2), ('chip,', 2), ('chip.', 2), ('chip;', 2), ('chip?', 2), ('chips', 2), ('chirk', 2), ('chirp', 2), ('chirr', 2), ('chise', 2), ('chisel', 2), ('chist', 2), ('chit', 2), ('chit!', 2), ('chit)', 2), ('chit,', 2), ('chit-', 2), ('chit.', 2), ('chit;', 2), ('chit?', 2), ('chits', 2), ('chiv', 2), ('chive', 2), ('chivy', 2), ('choilt', 2), ('cholla', 2), ('choly', 2), ('chowl', 2), ('churl', 2), ('chyle', 2), ('ciel', 2), ('cil', 2), ('cirl', 2), ('coil', 2), ('coil!', 2), ('coil,', 2), ('coil.', 2), ('coil:', 2), ('coil;', 2), ('coil?', 2), ('coils', 2), ('coll', 2), ('cull', 2), ('dholl', 2), ('dhrill', 2), ('dill', 2), ('drill', 2), ('ehall', 2), ('evill', 2), ('ewill', 2), ('fill', 2), ('frill', 2), ('ghila', 2), ('ghyll', 2), ('gill', 2), ('grill', 2), ('h-ll', 2), ('haill', 2), ('hall', 2), ('hell', 2), ("hi'll", 2), ('hil', 2), ('hil,', 2), ('hild', 2), ('hile', 2), ('hill!', 2), ('hill"', 2), ("hill'", 2), ('hill)', 2), ('hill,', 2), ('hill-', 2), ('hill.', 2), ('hill:', 2), ('hill;', 2), ('hill?', 2), ('hill]', 2), ('hills', 2), ('hilly', 2), ('hilp', 2), ('hilt', 2), ('holl', 2), ('hull', 2), ('ill', 2), ('jhil,', 2), ('jhils', 2), ('kill', 2), ('leill', 2), ('lill', 2), ('mill', 2), ('nill', 2), ('phial', 2), ('pill', 2), ('quhill', 2), ('quill', 2), ('rill', 2), ('schall', 2), ("sh'll", 2), ('shall', 2), ('shell', 2), ('shiel', 2), ('shild', 2), ('shill,', 2), ('shill:', 2), ('shill;', 2), ('shilly', 2), ('shily', 2), ('shll', 2), ('sholl', 2), ('shrill', 2), ('shtill', 2), ('shull', 2), ('sill', 2), ('skill', 2), ('spill', 2), ('still', 2), ('swill', 2), ('taill', 2), ('thall', 2), ('thell', 2), ('thilk', 2), ('thill,', 2), ('thrill', 2), ('thtill', 2), ('till', 2), ('toill', 2), ('trill', 2), ('twill', 2), ('uphill', 2), ('vhile', 2), ('vill', 2), ('weill', 2), ('while', 2), ('whilk', 2), ('whilly', 2), ('whirl', 2), ('whull', 2), ('will', 2), ('wrill', 2), ('yill', 2), ('zhall', 2), ('{will', 2)]

Querying of set and editdistance: 18.4s

About

Implements a DAG (or DAWG or DASG) which can be used for querying a set of sequences. The graph can quickly look up existence and search for sequences using edit-distance measure.

License:MIT License


Languages

Language:Python 100.0%