opencog / link-grammar

The CMU Link Grammar natural language parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error parsing specific text within double quotes.

brochington opened this issue · comments

I am running into a parsing discrepancy between very similar sentences:

Parsed: "Create a folder named "Right Here"."
Doesn't Parse "Create a folder named "Something Here"."

Note the only difference is replacing the word "Right" with "Something".

Thoughts?

This happens because "Something" is not marked in en/4.0.dict with <marker-common-entity>.
@linas, Will it be an improvement if words within quotes become "common-entity"?

Sorry for late reply. The correct fix is not obvious.

One possibility is would to run optional de-capitalization on the first word after an open-quote (because it is often the case that quoted text is a full-fledged sentence, including capitalization.)

However, in this case, "something Here" (with lower-case s but upper-case H) is not a valid sentence -- it seems that the entire sequence was meant to be a named entity. Usually, named entities have names like "Great Southern and Northern Railroad Bank Corporation" -- all caps, nouns and adjectives, and "Something Here Banking Corporation LLC" doesn't quite fit that pattern.

My calendar is very busy for the next few weeks. I'm not sure I can think clearly about this just right now. Perhaps one fix is to add an UNKNOWN_CAP_WORD regex, which would use the common-entity disjunct class. Perhaps this is the best fix? That way, anything that consists of All Cap Words and Other Stuff automatically parses as a single entity.

(My main desktop computer died ... I can't do any work just right now; I can't even run LG right now.)

@linas Thanks for the response. For my use, basically determining a touch "Something Here", the text capitalization is not too relevant, and everything within the two double parentheses should be captured as a whole. I'm wondering if it's better to go off of a preceding called | named | titled.

Thinking a little ahead: Is there a way to pass in perhaps a "dictionary extension" when parsing occurs? I have a list of proper nouns that I would like to be identified, and that list can change based on context external of the parser. However, I would like to not have to dictionary_create on every call. Thoughts?

Without an extension to the library, maybe you can use the following solution (or maybe "solution"), that needs manipulation of the text before parsing:

  1. Identify words within double quotes (e.g. by regex).
  2. Replace the blanks with a special character not in the character set used in your text (e.g. a letter from another language).
    You can do this conditionally according to the particular words.

If needed, you can add in the 4.0.regex file a regex to identify strings with the said special character, and add its name to <UNKNOWN-WORD.a> and UNKNOWNB_WORD.n (or even add an additional entry for it with <marker-common-entity>).