Error parsing specific text within double quotes.

Question

Error parsing specific text within double quotes.

brochington opened this issue 2 years ago · comments

Broch Stilley commented 2 years ago

I am running into a parsing discrepancy between very similar sentences:

Parsed: "Create a folder named "Right Here"."
Doesn't Parse "Create a folder named "Something Here"."

Note the only difference is replacing the word "Right" with "Something".

Thoughts?

Amir Plivatsky · Answer 1 · Sat Jul 23 2022 18:30:51 GMT+0800 (China Standard Time)

This happens because "Something" is not marked in en/4.0.dict with <marker-common-entity>.
@linas, Will it be an improvement if words within quotes become "common-entity"?

Linas Vepštas · Answer 2 · Thu Aug 04 2022 20:30:27 GMT+0800 (China Standard Time)

Sorry for late reply. The correct fix is not obvious.

One possibility is would to run optional de-capitalization on the first word after an open-quote (because it is often the case that quoted text is a full-fledged sentence, including capitalization.)

However, in this case, "something Here" (with lower-case s but upper-case H) is not a valid sentence -- it seems that the entire sequence was meant to be a named entity. Usually, named entities have names like "Great Southern and Northern Railroad Bank Corporation" -- all caps, nouns and adjectives, and "Something Here Banking Corporation LLC" doesn't quite fit that pattern.

My calendar is very busy for the next few weeks. I'm not sure I can think clearly about this just right now. Perhaps one fix is to add an UNKNOWN_CAP_WORD regex, which would use the common-entity disjunct class. Perhaps this is the best fix? That way, anything that consists of All Cap Words and Other Stuff automatically parses as a single entity.

(My main desktop computer died ... I can't do any work just right now; I can't even run LG right now.)

Broch Stilley · Answer 3 · Fri Aug 05 2022 08:58:14 GMT+0800 (China Standard Time)

@linas Thanks for the response. For my use, basically determining a touch "Something Here", the text capitalization is not too relevant, and everything within the two double parentheses should be captured as a whole. I'm wondering if it's better to go off of a preceding called | named | titled.

Thinking a little ahead: Is there a way to pass in perhaps a "dictionary extension" when parsing occurs? I have a list of proper nouns that I would like to be identified, and that list can change based on context external of the parser. However, I would like to not have to dictionary_create on every call. Thoughts?

Amir Plivatsky · Answer 4 · Fri Aug 05 2022 14:35:43 GMT+0800 (China Standard Time)

Without an extension to the library, maybe you can use the following solution (or maybe "solution"), that needs manipulation of the text before parsing:

Identify words within double quotes (e.g. by regex).
Replace the blanks with a special character not in the character set used in your text (e.g. a letter from another language).
You can do this conditionally according to the particular words.

If needed, you can add in the 4.0.regex file a regex to identify strings with the said special character, and add its name to <UNKNOWN-WORD.a> and UNKNOWNB_WORD.n (or even add an additional entry for it with <marker-common-entity>).