samuelmarina / is-even

Is a number even?

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Request to add support for misspelled+colloquially-phrased numbers

timdefrag opened this issue · comments

Support for long form numbers written out is great! But library cannot yet handle cases like "three hudnred" or "tree hunnit".

Implementing misspelling detection should be as easy as looping through all supported numbers and calculating the Levenshtein distance between the input string and each number's textual representation, then selecting the entry with the lowest Levenshtein distance (which should be 0 in the case of an exact match.) This would require looping through all supported numbers, but I don't think it will cause a big performance issue, since JS engines are pretty fast these days.

Implementing colloquially-phrased number support might be a little more work, but the added functionality would be worth it. A LSTM recursive ANN could be trained on the 375,000 string inputs to resolve them into 375,000 output classes which map to the set of supported numbers, and for each of the 375,000 input numbers a training set of a few hundred colloquial spellings could be built and used to train the classifier. I assume this would only take a couple days with PyTorch. The trained classifier could then be cross-compiled into JS source code and included in this library's implementation file.

Just some suggestions for the future direction of the work, thanks for all your great work so far supporting this community!

This would be a great feature to have indeed.

However I do see problems with the Levenshtein approach: results may be ambiguous. Consider the input "fxxx" which has the same distance to "four" as it has to "five".

For this reason, I suggest adding a simple AI to the library that respects the context of the input to make the correct decision. For example if the question is "How many fingers do you see?" then "five" is more likely to be the correct output, whereas if the question is "How many wheels does your car have?" then "four" is more likely.