opencog / link-grammar

The CMU Link Grammar natural language parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Get rid of max-cost

linas opened this issue · comments

There is a long, old discussion about max-cost in #783 which suggests that getting rid of the concept of max-cost is a good idea. Newer evidence, in #1449 confirms this.

My memory of max-cost was that, originally, it was related to panic-parsing, and that setting it to a higher value would somehow make panic parsing "find something, anything". But this seems untrue. In the English dict, there is only one thing with three brackets: [[[@Xc-]]] that would be chopped by max-cost. (There is nothing with [FOO+]3.0 because the panic-parse stuff predates the floating point costs.)

Rather than continuing conversations in the already-long #783, lets just do it here. Or rather, lets just .. get rid of it, and be done with it.

As a side-effect, maybe get rid of panic-parse, also. Maybe.

one thing with three brackets: [[[@Xc-]]]

There are also things that get a cost >= 4 through 4.0.dialect, the first one had originally 4 brackets:

  1. <no-det-null>: [()]headline;
  2. then.#than: [[than.e]0.65]bad-spelling; etc.

get rid of it

  1. Do you mean that we should get rid of parse_options_set_disjunct_cost()?
  2. What is so bad about this concept?
  3. Without a disjunct cost limit, how do you implement dialects?

As a side-effect, maybe get rid of panic-parse, also. Maybe.

The current panic parse parameters are indeed not useful to find a parse in a shorter time.
But its implementation infrastructure may be used to find a full parse in case a full parse has not been found (disregarding the parse time, by using new facilities that take more time). But the details may be considered as not belonging to this discussion...

we should get rid of parse_options_set_disjunct_cost()?

Yes. For backwards compat, it would be a no-op.

What is so bad about this concept?

The discussion and confusion in #783 and about what Dennis had done indicate that its a poorly formed concept, at least for the English dict. The English parses seem to work better, if it is set to infinity.

For the Atomese dicts, it is perhaps ... not such a bad concept, as there, some outrageously large costs show up, and could be discarded.

Without a disjunct cost limit, how do you implement dialects?

!? Hmm. Interesting question. I'm not sure it's a good idea to turn dialects on and off, based on some cost threshold.


OK, how about a proposal B:

  • Keep the cost_cutoff idea
  • Get rid of the cost_max computations in the code
  • Increase English max to 5 or 10 or unbounded.

Arghh. Why is this so hard? I just set #define max-disjunct-cost 8.7; in the English dict, and now corpus-basic has 102 errors instead of 88 ... clearly, there is something going on that I do not understand. Its like .. the more I use it, the less I understand the subtle interactions. This is frustrating.

OK, here's what happened. After removing costs, this sentence from corpus-basic parses:

*He is the character of person who would do that

Why? Because person.n 4.000 Js- R+ Bs+ How? Because person.n is in words/words.n.1-const and

words/words.n.1-const: <common-const-noun>

and

<common-const-noun>: <no-det-null> & <noun-rel-s> & <noun-main-s>;
<no-det-null>: [()]headline;
<noun-main-s>: Js-;
<noun-rel-s>: {R+ & Bs+};

and apparently headline must be 4.0. I conclude that the correct way to disable dialects is to set the cost to 500 ...

Doing this gives

Error: en/4.0.dialect:54: After "headline": Invalid cost "500".

Then utilities.c has strtofC() ... why the heck is this so complicated? What's wrong with the <stdlib.h> version of strtof() ??? What the heck? Is this some crazy Windows or Apple nuttiness?

Setting the headline dialect to 99 allows corpus-basic to parse with only one additional error. That error is due to [[[@Xc-]]] on commas.

I will remove [[[@Xc-]]] in just a few minutes. I will change dialects to have a cost of 99 when disabled, soon after that. And as to strtofC() I'm controlling a very strong urge to nuke that, but you probably have some valid reason to keep it ...

So strtofC() appears in #1231 ... it's due to locales. Friggin locales. My impression is that, in unix, they were mis-designed. I have never had a situation where locales made something easier, instead of more complicated.

The discussion and confusion in #783 and about what Dennis had done indicate that its a poorly formed concept, at least for the English dict.

Have you looked again at the document I referred you to there, that talks about the reason for such a cost computation?
(I'm not saying its author is right.)

The English parses seem to work better, if it is set to infinity.

Does it include the "headline" parsing? As you commented out in the old dict, it was commented out because allows often bad passing. But with "infinite cost" it will be enabled permanently. But on the other hand, if you remove it, how one could enable headline parsing (that you may want to try if there is no parsing otherwise)?

!? Hmm. Interesting question. I'm not sure it's a good idea to turn dialects on and off, based on some cost threshold.

Just now they can be turned on/off individually just by mentioning their name.
Why is it important how it is internally implemented?
Is using it as a binary expression switch better?

OK, how about a proposal B:

  • Keep the cost_cutoff idea

Of course, it seems to me fine...

  • Get rid of the cost_max computations in the code

I guess this may be fine too. But you may need to adjust some dictionary costs accordingly.
It would even slightly speed up build_clause() because it would be able to immediately skip clauses that get over cost_cutoff (currently it cannot because the total cost may be reduced when it proceeds, so it discards such disjuncts only at the end).

  • Increase English max to 5 or 10 or unbounded.

What should we do with headline and bad-spelling, which would get enabled then if we do nothing?

The increase in cost cutoff is intended for more sentences would get parsed. However, when sentences marked with * are then getting parsed this is considered an error!

And indeed:

$ link-parser < data/en/corpus-basic.batch > /tmp/b0
...
88 errors.

$ link-parser -cost-max=8.7 < data/en/corpus-basic.batch > /tmp/b1
...
102 errors.

$ systest/lgerror /tmp/b0 /tmp/b1
<  *A tax on income increase may be necessary
<  *Alfred Baird, man I know, has been described as good gardener
<  *But my presents to win his heart have failed
<  *Clinton is expected to return to Thursday Washington office
<  *He is more black to go than to stay
<  *He is more likely than to stay
<  *He is the character of person who would do that
<  *Joan Smith is tourist
<  *Many Croat who had fled their homes are now returning to them
<  *Many people attended that they spilled over into several neighboring fields
<  *Many person were angered by the hearings
<  *That is the man, in Joe's opinion, we should hire
<  *The data on project will be used for the file at program
<  *Until initially, these fossils were believed to belong to different species
Error: en/4.0.dialect:54: After "headline": Invalid cost "500".

Then utilities.c has strtofC() ... why the heck is this so complicated? What's wrong with the <stdlib.h> version of strtof() ??? What the heck? Is this some crazy Windows or Apple nuttiness?

This is very simple. strtof() is locale depended - this is mandatory by POSIX. It means you need to temporarily change the locale to C on each call and return to the previous locale. This is not thread friendly (it changes for all threads).
I fixed it by introducing my function, which also has the slight benefit of being slightly faster.

The 99.9999 is just an arbitrary limit and I can increase it to 999.99 (if you do it, change max_int_digits to 3).

BTW, internally, ~10000 costs are used for dialect handling. From dialect.h:

link-grammar/dict-common/dialect.h:	#define DIALECT_COST_MAX         9999.0F    /* Less than that is a real cost */
link-grammar/dict-common/dialect.h:	#define DIALECT_COST_DISABLE    10000.0F
link-grammar/dict-common/dialect.h:	#define DIALECT_SUB             10001.0F    /* Sub-dialect (a vector name) */
link-grammar/dict-common/dialect.h:	#define DIALECT_SECTION         10002.0F    /* Section header (a vector name) */

Have you looked again at the document I referred you to there

This one? https://www.semanticscholar.org/paper/Predicate-Expression-Cost-Functions-to-Guide-Search-Bottaci/52f5feca45ea50dc2eaba649a885ff66af575234 It does not seem to be relevant to our situation. I'm not sure if I should provide a long reply to this, but you seem to enjoy long replies, so what the heck.

That paper is about evolutionary search for test data. It seems to describe what we now call "program fuzzing". Although that paper uses the terms "disjunct", "conjunct" it does not seem to be relevant to what we do with LG.

Let's review the concept of natural language parsing.

  • In the 1960's, 1970's we get this idea that the grammars of natural languages can be encoded on computers, and that computer programs can parse natural language.
  • These early definitions of parsing and grammar are strict: either a sentence is grammatical, or it is not grammatical, and there is no in-between.
  • You can define a "good grammar" as one that parses only grammatical sentences, and rejects the ungrammatical ones.
  • There are two problems with this naive conception: (1) writing good grammars is wildly complicated, and (2) must humans do not speak or write with good grammar; yet we still want to "understand" what they are trying to say.
  • Early LG did not have a cost system.
  • But if you try to write a grammar without a cost system, it becomes very complicated: simple rules become "do this except when that unless something else or a third case" and it keeps getting worse, and then these complicated rules start generating incorrect parses for simple sentences (a problem that the simple grammars did not have).
  • One solution is to write: "use simple rule, unless there are no parses, in which case try the more complicated rule".
  • This solution leads to two new ideas: (a) parsing with nulls: use the simple rule, and then if nulls result, try the complicated rule. (b) parsing with costs: parse with both rules at the same time, but give the simple rule a lower cost.
  • With plan (b), parsing never fails, the lowest cost parse is, by definition, the correct parse, and the higher-cost parses are, by definition "wrong".
  • Costs are integers, which provide a clear ranking or preference. Costs can be 0, 1 or 2, and that's it, except for the weird "panic parse" idea which allowed a cost of 3.

The above was the situation in 1995. Since then, some new ideas:

  • Costs vaguely resemble minus the logarithm of a probability, and so can be understood as providing a likelihood of a parse being correct.
  • In this case, cost should be fractional, floating point.
  • This provides a foot-hold for ideas from information theory to enter the domain of parsing.

At the same time, there are other forces at work:

  • Every human has their own individual "dialect" of what they think is grammatical, and what is not. What is acceptable and understandable, and what is not. What is preferred, and what is not.
  • This style of dialect varies with setting and age: young children love to explore newly learned swear words. Teenagers always want to sound cool, coupling speech gestures with movement and fashionable clothing. Adults talk differently at work and at home.
  • There are regional averages of dialects, where a "region" can be geographical, socio-economic or a specific setting: "newspaper English".
  • There are national averages: English as spoken in the US vs. in Britain.

The current questions are:

  • Can individual dialects be machine-learned? As ChatGPT shows, the answer is yes.
  • Can we use the cost system, interpreting it through the lens of information theory, and use this to model both speaker differences, as well as national averages? Using LG, that is, and not some DL/NN?

My personal answer to the second question is "sure, maybe yes, probably, I guess so, lets try it. eff around and find out". My personal mental picture of the situation is that it looks kind-of-like Morse theory, except in extremely high dimensions. Each human is coming at grammar from a slightly different direction. Each bifurcation is driven by phonetic discrimination, by ability or inability to understand and articulate.

Great. What does this have to do with cost-max, and the paper you reference?

Well, the disjuncts and conjuncts that LG uses are a fragment of what is called "linear logic": its a fragment, because some of the logical operators of linear logic are missing. Also, LG is not left-right symmetric; it is instead an associative algebra, and so resembles that of a pregroup grammar https://en.wikipedia.org/wiki/Pregroup_grammar -- and the resemblence is not accidental; LG appears to be isomorphic to categorial combinatorial grammar. See https://github.com/opencog/atomspace/blob/master/opencog/sheaf/docs/ccg.pdf for details.

What should the cost of.... OK, next comment. The above is too long. I don't know why I wrote it. You already understand this stuff.

Some cost questions:

  • What should the cost of [A-]0.3 & [B+]0.4 be?
  • What should the cost of [A-]0.3 or [B+]0.4 be?

The second question does not make sense, because the two terms results in two different parses. Any suggestion from the Bottaci paper does not apply here.

For the first question, we have five possibilities. The cost of [A-]0.3 & [B+]0.4 is the same as [A- & B+]f(0.3, 0.4) where

  • f(x,y) = max(x,y)
  • f(x,y) = x+y
  • f(x,y) = (x+y)/xy -- this is the Bottaci table 4 suggestion
  • f is some other monotonic (convex) function of x & y
  • f is something else (not monotonic)

The answer to this question serves two purposes:

  • It is a convenience to the author of the dictionary, making it easier to write "correct" dictionaries,
  • It is anchored in some kind of theoretical foundation or framework: the preferred answer is governed by some preferred theory.

If the preferred theory is information theory and that costs are log-probabilities, then f(x,y) = x+y seems like the only reasonable answer.

I cannot think of any good reasons why other formulas could be used, or should be explored. There is nothing in the Bottaci paper that seems apropriate for our current situation: Bottaci is talking about logical operators and DeMorgan laws; the LG operators are NOT logical operators, and they do NOT obey DeMorgan's laws. Bottaci is talking about statistical sampling; we are not doing statistical sampling....

I don't know where else to go with this. Yes, tiny details like this can sometimes be very important; but here, I just don't see it. Looks like a cul-de-sac to me.

  • Costs vaguely resemble minus the logarithm of a probability, and so can be understood as providing a likelihood of a parse being correct.
  • In this case, cost should be fractional, floating point.
    This provides a foot-hold for ideas from information theory to enter the domain of parsing.

BTW, I think that the LG system should also have another type of cost, an integer one that is applied "locally" to disjuncts during tokenization and not summed up (using log or any other function) over the sentence. I tried to make my point several times in the last few years but you have never told me your opinion on that.

because some of the logical operators of linear logic are missing.

But a common thing is that all the connectors of a disjunct must match for the disjunct to be considered matched.

If the preferred theory is information theory and that costs are log-probabilities, then f(x,y) = x+y seems like the only reasonable answer.

I agree that this is preferable for combining the costs of disjuncts. As far as I know, the LG code also computes the disjunct cost by adding together the costs of its connectors.

Here the question is what is the meaning of the cost cutoff - is it applies to disjuncts or connectors? If to connectors, then the max function results. But this has a property that looks improper - disjuncts that have a "high" cost are participating.

BTW, I think that the LG system should also have another type of cost, an integer one that is applied "locally" to disjuncts during tokenization and not summed up (using log or any other function) over the sentence. I tried to make my point several times in the last few years but you have never told me your opinion on that.

My apologies; Maybe I did not notice that you were making a proposal, or asking a question. Maybe I did not understand what you were saying. Maybe I was not able to develop a strong opinion, one way or the other. Maybe I was in a hurry, and just forgot to respond. Open a new discussion/issue to clearly articulate the idea.

I still might not respond, and I might not respond clearly. My general attitude will be "does this make sense from an information theory perspective?" Also, as conversations with Anton Kolonin @akolonin over the last year have made clear, we do not have any good high-quality theories or systems for text segmentation. This includes segmentation of Chinese, splitting of Indoeuropean languages into morphemes+suffix, detection of sentence boundaries, etc. And without some clear, good theory on how this can work, it will be hard to decide if this or that modification to LG tokenization is a good idea.

because some of the logical operators of linear logic are missing

What I meant is that the upside-down ampersand is missing, the question-mark is missing, etc. https://en.wikipedia.org/wiki/Linear_logic and https://plato.stanford.edu/entries/logic-linear/ -- linear logic describes tensor algebras; natural-language is tensor-like, but is not the full thing (LG connectors can be understood as tensor indexes, and LG disjuncts can be understood as tensors. However, not all of the required axioms for tensors apply.) I find this interesting but its another cul-de-sac.

Here the question is what is the meaning of the cost cutoff

I got tired of agonizing over this, and merged #1456 -- that thing did not help English in any way, and it was not needed for Atomese, so I just got rid of it.

As of this time, after merging the last few pull reqs #1456 and #1455, the code base implements "proposal B" of comment #1453 (comment)