mischasan / aho-corasick

A-C implementation in "C". Tight-packed (interleaved) state-transition matrix -- as fast as it gets, as small as it gets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

callback function text position value

alonbg opened this issue · comments

commented

cb(int strnum, int textpos, )
call back function passed to acism_more/scan is called with textpos value being the length of the input string. This length is known in advance, if desired, it can be passed as part of the context.
It would be much more useful if textpos value would be the suffix position matched (i.e. the length of the string represented by strnum).

Regards,
Alon

Sorry, disagree on a few counts ...

The engine does not store string lengths, nor can it calculate them from the machine, without major effort. it's not just a matter of counting forward hops from the root. As soon as the search traverses a backlink, it has no idea of the depth of that node (and doesn't need to).

You'll notice that the callback is given the end offset of the string in the match buffer. With acism_more(), the start of the matched string might not even be in the current input text buffer; so knowing the pattern-string length might be useless.

To do what you ask for, acism_create() would have to store the prefix length in every node that's the target of a backlink ... but that reduces how much data can be compiled, in the 32bit model. There are a lot of applications that only need to know the string number; so I didn't burden the engine with what can be handled using the string number ...

The caller provides the array of <length, data> pairs to acism_create(), so it knows the lengths. The caller can create a vector of lengths, and use the strnum as an O(1) lookup. That length vector can be (or be part of) the context object that's passed to the callback. That's what you need, right?

I don't understand strlen being more useful than textpos; they are two different independent pieces of information about the match.

HTH
Mischa

commented

Appreciate you took the time to answer and I stand corrected :)
Thank you for this piece of software. Performance of the related model is x20 faster.

I have ~10^6 unique suffixes. ~150 of them have additional handling.
I need only the longest match and use acism_scan.
I ended doing what you suggested - placing them first in the input to acism_create ( In a know in advance order. Postion 0 is reserved for an invalid input)
if ( strnum > 0 && strnum <= max ) v[strnum](...);
Fact is I had to replace strings hash lookup with simple integer lookup (strnum). It is fast !!

Thanks,
Alon

Cool. BTW I'm always interested in what people use acism for ...

On 6 January 2016 at 10:40, alonbg notifications@github.com wrote:

Appreciate you took the time to answer and I stand corrected :)
Thank you for this piece of software. Performance of the related model is
x20 faster.

I have ~10^6 suffixes. ~150 of them have additional handling.
I ended doing what you suggested - placing them first in the input to
acism_create ( In a know in advance order. Postion 0 is reserved for an
invalid input)
If ( strnum > 0 && strnum <= max ) vstrnum http://...;
Fact is I had to replace strings hash lookup with simple integer lookup
(strnum). It fast !!

Thanks,
Alon


Reply to this email directly or view it on GitHub
#14 (comment)
.