opencog / link-grammar

The CMU Link Grammar natural language parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

classic_parse: Sentence disjunct count 108279 exceeded limit 105123

ampli opened this issue · comments

echo 'And yet he should be always ready to have a perfectly terrible scene, whenever we want one, and to become miserable, absolutely miserable, at a moment’s notice, and to overwhelm us with just reproaches in less than twenty minutes, and to be positively violent at the end of half an hour, and to leave us for ever at a quarter to eight, when we have to go and dress for dinner when, after that, one has seen him for really the last time, and he has refused to take back the little things he has given one, and promised never to communicate with one again, or to write one any foolish letters, he should be perfectly broken-hearted, and telegraph to one all day long, and send one little notes every half-hour by a private hansom, and dine quite alone at the club, so that every one should know how unhappy he was.' | lg -lim=10000 -sh=20 -v=2
%link-grammar-5.12.0-312-g4cdf9c562
%/tmp/link-grammar/master
%Dictionary directory: /usr/local/src/link-grammar-devel/master
+ time -o /tmp/link-grammar/master/time /tmp/link-grammar/master/link-parser/.libs/link-parser -lim=10000 -sh=20 -v=2
verbosity set to 2
link-grammar: Info: Dictionary found at ./data/en/4.0.dict
limit set to 10000
short set to 20
link-grammar: Info: Dictionary version 5.12.1, locale en_US.UTF-8
link-grammar: Info: Library version link-grammar-5.12.1. Enter "!help" for help.
#### Finished tokenizing (174 tokens)
++++ Finished expression pruning                 0.00 seconds
++++ Built disjuncts                             0.09 seconds
++++ Eliminated duplicate disjuncts              0.05 seconds
++++ Encoded for pruning                         0.08 seconds
++++ power pruned (for 0 nulls)                  0.07 seconds
++++ Built mlink_table                           0.00 seconds
++++ power pruned (for 0 nulls)                  0.01 seconds
++++ pp pruning                                  0.00 seconds
++++ power pruned (for 0 nulls)                  0.01 seconds
++++ Built mlink_table                           0.00 seconds
++++ power pruned (for 0 nulls)                  0.01 seconds
++++ Encoded for parsing                         0.00 seconds
++++ Initialized fast matcher                    0.00 seconds
++++ Counted parses (0 w/0 nulls)                0.93 seconds
++++ Finished parse                              0.00 seconds
No complete linkages found.
++++ Finished expression pruning                 0.00 seconds
++++ Built disjuncts                             0.08 seconds
++++ Eliminated duplicate disjuncts              0.05 seconds
++++ Encoded for pruning (one-step)              0.15 seconds
++++ power pruned (for 1 null)                   0.08 seconds
++++ Built mlink_table                           0.00 seconds
++++ power pruned (for 1 null)                   0.02 seconds
++++ pp pruning                                  0.00 seconds
++++ power pruned (for 1 null)                   0.01 seconds
++++ Built mlink_table                           0.00 seconds
++++ power pruned (for 1 null)                   0.01 seconds
++++ Encoded for parsing                         0.01 seconds
++++ Initialized fast matcher                    0.00 seconds
Trace: classic_parse: Sentence disjunct count 108279 exceeded limit 105123
++++ Finished parse                              0.00 seconds
++++ Time                                        0.42 seconds (0.42 total)
Bye.

As far as I recall long sentences may have in the order of 1M disjunct, especially when parsing with nulls (Russian maybe more).
It also seems that many spell guesses may vastly increase the number of disjuncts.

I recommend to remove the default limit (not even set a higher one). If needed, we can use -test=disjunct-limit:105123 or add a parse-options API call.

Even when not parsing with nulls (removed quite so it parses):
And yet he should be always ready to have a perfectly terrible scene, whenever we want one, and to become miserable, absolutely miserable, at a moment’s notice, and to overwhelm us with just reproaches in less than twenty minutes, and to be positively violent at the end of half an hour, and to leave us for ever at a quarter to eight, when we have to go and dress for dinner when, after that, one has seen him for really the last time, and he has refused to take back the little things he has given one, and promised never to communicate with one again, or to write one any foolish letters, he should be perfectly broken-hearted, and telegraph to one all day long, and send one little notes every half-hour by a private hansom, and dine alone at the club, so that every one should know how unhappy he was.

Trace: classic_parse: Sentence disjunct count 108279 exceeded limit 105123

And this sentence has only 174 tokens. Longer sentences may have many more disjuncts.

This somehow has to be fixed since it silently totally skips parsing of some long sentences.
No error is reported in that case (not even `+++++ error' on batches, so the skipped sentences are considered correct.)

If this limit is important for the Atomese dict usage, I propose to implement one of the following:

  1. pasre_options_*_disjunct_limit()
  2. Set it using parse_options_set_test("disjuncts_limit:1234456").

In any case I propose that a default of 0 -1 (means unset), and if the setting exceeded, produce a parse error (instead of just a debug message like now).

Or alternatively, enable it only for the Atomese dict!

This limit can be disabled.

There were several bugs & mis-designs that lead to the introduction of this limit, all of which have been fixed. Some comments:

  • In this figure: #1402 (comment) it appears that pride-n-prejudice never needs more than about 300K disjuncts, which one of the ways I selected this limit.
  • I never measured Russian, or the long-sentence cases.
  • The intent was that an error would be visible, as there would be zero parses. sent->num_linkages_found = 0; -- Clearly, I didn't test enough to see if this was true.

For now, I no longer need this check. I don't know what might happen in the future.

We can disable this by setting the default to -1. Completely chopping out that code is OK, too.

300K disjuncts, which one of the ways I selected this limit.

But it was set to ~100K.

In any case, note that for parsing with null_count>1, there are many more disjuncts than in null_count==0, because the pruning is much less effective for higher null_counts (and hence it is done per null_count).

This limit can be disabled.

We can disable this by setting the default to -1. Completely chopping out that code is OK, too.

If the code is corrected to produce an error (and without a try to then parse with more nulls), we can set it to -1 permanently. However, I propose not to make any efforts in fixing it and just remove it.

Closed because resolved in #1447

But it was set to ~100K.

It was meant to be 1M