Bug in extractor-enum.js with original text indexes

Question

Bug in extractor-enum.js with original text indexes

alberchou opened this issue a year ago · comments

Good afternoon,
I was having an issue with repeated tokens (I want to recognize operations over a query) and I think that the function extract(srcInput) on extractor-enum.js has a little bug, the originalTextIndex is being increased by token length but not by the separators.

For example:

You have the following entity to be recognized: sum
You process the following sentence: I want the sum of something1, sum of something2, sum of something3... , sum of something10
When the number of split characters (space or ,) is not taken into account, it causes that there are values repeated in the originalPositionMap dictionary.

I'm using version 4.27.0:
npm list node-nlp
`-- node-nlp@4.27.0

It's happening in extractor-enum.js line 306 to 322 (async extract(srcInput))

Best regards.

alberchou · Answer 1 · Wed Jul 05 2023 23:54:51 GMT+0800 (China Standard Time)

I think that changing this:
originalTextIndex += tokenizeResult.tokens[i].length;

to this:
originalTextIndex = originaltextPos + tokenizeResult.tokens[i].length;

may solve the problem