AlexPoint / OpenNlp

Open source NLP tools (sentence splitter, tokenizer, chunker, coref, NER, parse trees, etc.) in C#

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No head rule defined for INC

dawazualex opened this issue · comments

Not sure what the proper fix is exactly, but for sentence fragments, occasionally I get this error - No head rule defined for INC using in INC-244

There are 2 spaces after using because this.getClass() is commented out

commented

Thanks for the feedback but I haven't looked at this project for some time now.
Could you give me a sentence to reproduce this bug?
And could you point me to the class raising the exception?
Thanks

Hey Alex, I really like the work you have done because I've been able to
integrate it directly into SQL server via the assemblies. I could never
quite get the java versions of NLP software to work with SQL server due to
cyclic dependencies in IKVM.

It hits in the AbstractCollinsFinder -> DetermineNonTrivialHead when
getting the typed dependencies -

      var tlp = new PennTreebankLanguagePack();
      var gsf = tlp.GrammaticalStructureFactory();
      var tree = new ParseTree(p);
      var gs = gsf.NewGrammaticalStructure(tree);
      var dependencies = gs.TypedDependencies();

Here is the sample sentence -
Had non-contrast MRI abdomen that was unrevealing and ERCP on 11/23
showing marked dilatation of the CBD with tight stricture and filling
defect in distal 1/3 with worry for pancreatic head mass.

It is weird that "non-contrast" gets split into 6 tokens "non-", "c", "o",
"n", "t", "rast"

Same odd splitting happens with this -

Spoke to patient's wife
(TOP (NP (NP (NNP Spoke)) (PP (TO to) (NP (NP (NN pat) (NN ient) (POS 's))
(NN wife)))))

"patient's" gets split into "pat", "ient" and "'s"

On Mon, Jun 8, 2015 at 8:41 AM, Alex notifications@github.com wrote:

Thanks for the feedback but I haven't looked at this project for some time
now.
Could you give me a sentence to reproduce this bug?
And could you point me to the class raising the exception?
Thanks


Reply to this email directly or view it on GitHub
#4 (comment).

commented

I had exactly the same issues with IKVM (in addition to the fact that its huge and shipping it could be a pain)

I look into the problem as soon as I have the time but it seems that the problem comes from the tokenization (sometimes, it does some really weird stuff and I couldn't figure why).
What you can do for now is replace the used tokenizer by EnglishRuleBasedTokenizer in your example. I'm pretty sure it will solve this problem.

commented

Having the exact same issue here, using the EnglishRuleBasedTokenizer. Something's off.

Examples of sentences (these are from movies, don't blame me for them):

  • The rest of you, we're gonna drop in on Heidekker.
  • 'Cause last time I checked, work doesn't reassure you that liking a finger up your ass doesn't make you gay.
  • A system of mass incarceration that, once again, strips millions of poor people, overwhelmingly poor people of color, of the very rights supposedly won in the civil rights movement

I get the following 3 errors:

  • No head rule defined for INC using SemanticHeadFinder in INC-13
  • No head rule defined for INC using SemanticHeadFinder in INC-23
  • No head rule defined for INC using SemanticHeadFinder in INC-34
commented

This seems to happen when the parsed tree is incomplete, i.e. when tree.Type == "INC". When things go right, we have tree.Type == "TOP". Manually setting the tree type to "TOP" works, but I'm not sure what consequences that has on the computed dependencies... !

commented

Did some tests - I can confirm that manually setting the tree type to "TOP" yields terrible results and is not an option.