spencermountain / compromise

modest natural-language processing

Home Page:http://compromise.cool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible Issue with Root Matching

calware opened this issue · comments

commented

I am having to search through bodies of text for specific words which may be non-normalized; which is to (perhaps incorrectly) say they have the possibility of being plural, singular, or conjugated in some odd way. This idea is also true of the search query that is being compared against each word in the target body of text. I would like to use the compromise library to solve this problem by perhaps normalizing both the target processed word, along with the query word, and then check if they are the same in their most basic form.

On the examples for root matches, it seems like this would be where my issue would be solved, but the following code does not yield the expected results (a positive match):

{
 let doc = nlp("Palatability") 
 doc.compute('root')
 let m = doc.match('{palate}')
 return m.text()
}

The expected output would be "Palatability", but the above produces no search results found.

Am I doing something wrong with my implementation?
Thank you for your time, and I do hope this message finds you well.

Edit:
I ran the above "palatability" through a variety of online stemmers, and found it correctly correlated to the resulting "palat", but code such as the below snippet would not produce this result. The same is true with "goodness" being incorrectly left in it's non-root form, wherein the root form would then be "good".

nlp('palatability').text('root') // produces "palatability", should be "palat"
nlp('goodness').text('root') // produces "goodness", should be "good"

Hey Cal - yep, you're right. There's a soft-spot with this 'noun-ing' of verbs and adjectives, that I've gone back and forth about, a few times.
The problem is not the conjugation, but that some percentage of these just sound silly, and it's hard to machine-learn which ones.
You can see we kept the +'ness' adjective conjugation here, which produces some strangeness itself.

I think the verb+'ability' form may be the same. Browse through our verb-list and try to guess which percent are good-sounding, like 'walkability', and what percent are awkward-enough to be wrong, like 'backfire', 'baffle'. I don't know, It's a odd problem.

That said, maybe the root lookup should quietly generate these, in order to grab the true-positives, like 'palatability'. It wouldn't be hard, as I think it is a pretty-simple conjugation.

Maybe it would help to find, or generate some data, on how big of a problem this is. If there are only 100 cases, we could hard-code them. If it effects half of verbs, maybe we could look at their suffixes for patterns. Otherwise, if verb+'ability' is okay 90%, I can just add it in.

Would love some advice, or help
cheers