glts/etoken

etoken
------

*UPDATE* The following problem in the Perl module has since been resolved.

An existence-table-based tokenizer implemented with hashes of hashes.

And that's it with the fancy words from me. I am not a C programmer. This
little program is a solution to the problem posed at

https://github.com/glts/Lingua-Deva/blob/master/lib/Lingua/Deva.pm#L187

Given a list of token definitions, ie. the existence table,

a
ae
abc

a hash of hashes is constructed so as to serve as a model of the structure of
all valid tokens. At second thought some kind of tree structure might have
been more appropriate.

                           a*
                          / \
                         e*  b
                            / \
                           c*  d*

This structure is then traversed repeatedly during the tokenization of some
input. Since possible endpoints are marked specially in the hash of hashes all
and only valid tokens are recognized. For the hash of hashes pictured above,
while "abc" would be a valid token, "ab" would not.

The Perl function given above has the flaw that it requires all possible
prefixes of a token to be tokens too: A token "abcd" requires tokens "abc",
"ab", "a" to exist. Etoken does not have this limitation. On the other hand,
the Perl version does Unicode case folding; it would be foolish to reimplement
that in C, so etoken only does primitive ASCII case folding (optional).

Check out the example run in main.c, which will read the token definitions in
"exampledef.txt" and print the separate tokens in "example.txt":

make
./etoken

The Perl script etoken.pl does the same thing. For large data, the C version
is almost 20 times as fast on my machine.
glts / etoken

About

Languages