opencog / link-grammar

While working on the affix stripping code I noted that subscripted LPUNC tokens are mishandled.
For example, using "any":

linkparser> ...test
Found 1 linkage (1 had no P.P. violations)
	Unique linkage, cost vector = (UNUSED=0 DIS= 0.00 LEN=0)

    +---ANY---+
    |         |
LEFT-WALL ...test[?]

Note that in any/affix-punc, ... appears as ....x (at the end of line 6):

link-grammar/data/any/affix-punc

Lines 6 to 10 in 46a2d31

    
           "(" "{" "[" "<" « 〈 （ 〔 《 【 ［ 『 「 """  `` „ “ ‘ ''.x '.x ….x ....x 
        
           ¿ ¡ "$" 
        
           _ - ‐ ‑ ‒ – — ― ━ ー ～ 
        
           £ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺  ℳ  ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점 
        
           † †† ‡ § ¶ © ® ℗ № "#": LPUNC+;

Several years ago, when I modified the affix stripping code, I also removed the subscripts from the affixes in en/4.0.affix, as the original code didn't use them for dict lookup.
At some point (I didn't check when) the LPUNC code (in strip_left()) didn't check for subscripts anymore, but this went unnoticed because en/4.0.affix didn't include LPUNC subscripts. Similarly, the later-written MPUNC code also doesn't handle subscripts.
The RPUNC code (in strip_right()) still handles subscripts because it shares a common code with UNITS, and units may be subscripted.

As a result, to following languages have some mishandled LPUNC tokens:
id, th, demo-sql, he, vn, demo-atomese, and the aforementioned any.

When I fixed LPUNC to re-consider subscripts, another problem happened in the case of ''.y and ''.x (twice single quote) : mangled results. The problem is that '' is not in the dict (especially not ''.y - as lookups are now done wWasith the subscript), and is thus resolved as UNKNOWN-WORD. It then gets subscripted by a subscript of UNKNOWN-WORD and the code doesn't expect this double-subscript. Was it an unfinished trying to add '' as a synonym to double quotes? (For that, a change is needed in the QUOTES handling, see (1) below).

---> My proposed fix is to disallow subscripted affixes which are not in the dict.

Regarding unsubscripted affixes that are not in the dict, maybe they should be allowed if they match a regex (in that case the code should be modified to add a regex check). For example, '' currently match EMOTICON (but there is no point to strip it as EMOTICON unless we strip everything that matches EMOTICON, a thing that can be done with affix regexes, see (2) below).

---> I will fix these bugs, add subscripts and send a PR.

While looking in the affix list, I got these questions/ideas:

Note that BULLETS include --. However, the relevant code may handle single characters only. If desired, the BALLETS list can be changed to a list of tokens like RPUNC etc. (code modification is needed for that).

In a PR that I would like to send next (originally to support stripping punctuation in amy), I added the ability to specify LPUNC and RPUNC regexes (as /regex/). I used this feature only for amy.
However, I think it may be a good idea to add that also to MPUNC (mostly copy/paste) so it will be able to split on commas/colons using regexes with lookahead/lookbehind to prevent the mentioned pitfalls.

link-grammar/data/en/4.0.affix

Lines 30 to 39 in 46a2d31

    
           % Split words that contain the following tokens in the middle of them. 
        
           % We don't want comma's in this list; it tends to mess up numbers. e.g. 
        
           % "The enzyme has a weight of 125,000 to 130,000" 
        
           % We don't want colon's in this list, it tends to mess up time 
        
           % expressions: "The train arriaves at 13:42" 
        
           % Some kind of fancier technique is needed for tokenizing those. 
        
           % 
        
           % TODO: this list should be expanded with other "typical"(?) junk 
        
           % that is commonly (?) in broken texts. 
        
           -- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;

It needs PCRE2/C++ regexes, but with a POSIX regex library, the lookahead/lookbehind expressions would be character sequences that just "never" match.

I implemented an affix-class tokens dict check, and I get the following:

link-grammar: Error: afdict_init: class { in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 〈 in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class （ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 〔 in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ［ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 「 in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class `` in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ‘ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class '' in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ¿ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ¡ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ﷼ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class _ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ‐ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ‑ in file LPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class } in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 〉 in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ） in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 〕 in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ］ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 」 in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ’’ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class '' in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ？ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ！ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class _ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ‐ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ‑ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ‐ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class 、= in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ™ in file RPUNC: Token "en/4.0.affix" is not in the dictionary
link-grammar: Error: afdict_init: class ℠ in file RPUNC: Token "en/4.0.affix" is not in the dictionary

These tokens will be classified as UNKNOWN-WORD and appear with subscripts .n etc. It doesn't seem useful at all, and is even misleading because sentences may parse fine for the wrong reason.
---> As I wrote above, I propose not to allow such tokens. It seems such tokens are also found in QUOTES and the hard-coded list of capitalized positions. (I have not checked that.)
Note that the 、= in the 2nd to the last line should have been 2 separate characters. As a result, RPUNC =is not separated.
Maybe some of these tokens can be added to the dict, e.g. ™ with ZZZ linkage (like random quotes).
I also propose to add some affix splits to the tests to check the operation of LPUNC, RPUNC, MPUNC, and UNITS.

EDIT: I will of course fix the buggy argument order in this error message.

Question:
What to do on token errors in the affix file, like token not in the dictionary?
Possibility:

Issue a warning, but otherwise do nothing. (It will be get split but handled as an unknown word.)
Issue a warning and ignore the token (remove it from the list).
Fail the dictionary creation after reporting all the errors.

See also:

link-grammar/link-grammar/dict-common/regex-morph.c

Lines 239 to 244 in 46a2d31

    
           /* Check that the regex name is defined in the dictionary. */ 
        
           if ((dict != NULL) && !dict_has_word(dict, rn->name)) 
        
           { 
        
           	/* TODO: better error handing. Maybe remove the regex? */ 
        
           	prt_error("Error: Regex name %s not found in dictionary!\n", 
        
           	          rn->name);

Here it currently issues an error, doesn't remove it from the list, but ignores it if it matches.

I'm for (2) or (3).
(3) is easier to implement, but may not be compatible with existing dictionaries.
I will implement (2) for now.

Was it an unfinished trying to add '' as a synonym to double quotes?

Yes.

My proposed fix is to disallow subscripted affixes which are not in the dict.

OK

unsubscripted affixes that are not in the dict, maybe they should be allowed if they match a regex

No opinion.

What to do on token errors in the affix file, like token not in the dictionary

Option 3 is OK. I can fix the existing dictionaries. I'm surprised by these errors ... I'm looking, now.

Since I added regex support to affix stripping (an upcoming PR), no much need to check the any/amy affixes.
I just replacedwithhem by something like:
… .... "/[[:punct:]]$/": RPUNC+;
(I.e. have left the multi-character tokens and added "/[[:punct:]]$/".
(Using lookahead/lookbehind it is possible to implement context-aware stripping, and I think I have found good applications for that.)

What happens if an affix appears twice in the dict: e.g. 〔《【 which are a kind of parenthesis, but are also used as quotation marks ? I guess they need subscripts?

It seems QUOTES and BALLETS are only used in is_capitalizable() so they don't interfere with strippable affix classes.
BTW:

What about ''.x (2 single quotes)? Was it intended to be in the dict as a synonym for a double quote?
Maybe QUOTES and BALLETS should be regexes. The tokenizer can check if the previous token matches them.

What about ''.x (2 single quotes)? Was it intended to be in the dict as a synonym for a double quote?

You already answered it above, sorry...

I just patched English in #1331 -- its a minimalist fix, I didn't get fancy.

Maybe QUOTES and BALLETS should be regexes. The tokenizer can check if the previous token matches them.

I don't understand this remark.

BTW -- they're "bullets" like "bullet points" I'm not sure, but I think that gun bullets are named after the typographical mark .... (??) from french/latin "bull" or stamp. ("papal bull")

Maybe QUOTES and BALLETS should be regexes. The tokenizer can check if the previous token matches them.

Consider this (see QUOTES and BULLETS at the end of is_capitalizable()):

link-grammar/link-grammar/tokenize/tokenize.c

Lines 1550 to 1583 in ffc6529

    
           /* Return true if the word might be capitalized by convention: 
        
            * -- if its the first word of a sentence 
        
            * -- if its the first word following a colon, a period, a question mark, 
        
            *    or any bullet (For example:  VII. Ancient Rome) 
        
            * -- if its the first word following an ellipsis 
        
            * -- if its the first word of a quote 
        
            * 
        
            * XXX FIXME: These rules are rather English-centric.  Someone should 
        
            * do something about this someday. 
        
            */ 
        
           static bool is_capitalizable(const Dictionary dict, const Gword *word) 
        
           { 
        
           	/* Words at the start of sentences are capitalizable */ 
        
           	if (MT_WALL == word->prev[0]->morpheme_type) return true; 
        
           	if (MT_INFRASTRUCTURE == word->prev[0]->morpheme_type) return true; 
        
           	/* Words following colons are capitalizable. */ 
        
           	/* Mid-text periods and question marks are sentence-splitters. */ 
        
           	if (strcmp(":", word->prev[0]->subword) == 0 || 
        
           		 strcmp(".", word->prev[0]->subword) == 0 || 
        
           		 strcmp("...", word->prev[0]->subword) == 0 || 
        
           		 strcmp("…", word->prev[0]->subword) == 0 || 
        
           		 strcmp("?", word->prev[0]->subword) == 0 || 
        
           		 strcmp("!", word->prev[0]->subword) == 0 || 
        
           		 strcmp("？", word->prev[0]->subword) == 0 || 
        
           		 strcmp("！", word->prev[0]->subword) == 0) 
        
           		return true; 
        
           	if (in_afdict_class(dict, AFDICT_BULLETS, word->prev[0]->subword)) 
        
           		return true; 
        
           	if (in_afdict_class(dict, AFDICT_QUOTES, word->prev[0]->subword)) 
        
           		return true; 
        
           	return false; 
        
           }

'' (two single quotes) are now recognized in the dict as quotes, but they cannot be added to QUOTES because QUOTES is specified as a string and not as distinct tokens.

Similarly, -- is mentioned now in BULLETS, but it cannot be recognized as a BULLET due to the same reason. Also, †† is in LPUNC, but cannot currently be added to BULLETS.

---> Proposal 1: Convert QUOTES/BULLETS to a list of tokens.
In the same occasion:
---> Proposal 2: Convert the hardcoded strcmp() list to an affix class CAPSTART(?) (or a regex if I will find that it is a better implementation).

Or we can ditch QUOTES and BULLETS altogether and use a single name, say CAPSTART:

% quotes
" " « »《 》 【 】 『 』 ` „ “ ” ": CAPSTART+;
 % bullets
" ( ) ¿ ¡ † †† ‡ § ¶ © ® ℗ № # * • ⁂ ❧ ☞ ◊ ※ ○ 。 ゜ ✿ ☆ ＊ ◕ ● ∇ □ ◇ ＠ ◎ – ━ ー — -- - ‧ ": CAPSTART+;
% Additional capitalization start positions
. ... … ?  !  ？！: CAPSTART+;

a list of tokens.
an affix class CAPSTART(?) (or a regex

Ah OK. Yes to all of the above. From what I can tell, there is no difference between QUOTES and BULLETS and that the only thing that these serve is to indicate that the character immediately thereafter might be capitalized. So, yes, these can collapsed into one list CAPSTART.

There is an interesting theoretical problem behind the idea of capitalization. I will think about it some more...

It seems to me it would be a major achievement if an unsupervised algo could find the equivalence of words that start with capital letters to those that are not, without being specifically programmed for that. I mean that it would just be designed to find patterns in the input, and "automagically" would indicate this equivalence.

The current code contains my initial implementation of capitalization parsing using dict definitions, but I stopped developing it when it seemed to me it would need disjunct manipulation because I understood - maybe wrongly, that you dislike this idea (so I just continued with my very long LG todo list).

Is it reasonable to check the existence of the affix tokens in case of a DB/Atomese dict?
I guess this check should be skipped in that case.

case of a DB/Atomese dict?

There is a 4.0.affix used for splitting. I'll remove the subscripts in there.

if an unsupervised algo could find the equivalence of words that start with capital letters to those that are not

I think that it sometimes does this, but not consistently. It can't deduce this as a general rule, right now. To find the general rule, there would need to be work on general tokenization. I'm thinking about this. Deducing the simple regexes is also desirable. I think it's doable in principle; setting up the machinery in practice is ... a lot of work.

I still get this with the en dict:

link-grammar: Warning: afdict_init: Class LPUNC in file en/4.0.affix: Token "``" not in the dictionary!
link-grammar: Warning: afdict_init: Class LPUNC in file en/4.0.affix: Token "‘" not in the dictionary!

I guess they should be added to the dict.
(For dynamic dicts this check is skipped altogether.)

With most of the rest of dicti,onaries there is of course a long list of such errors, as most of the tokens are not found in the dict.
Possible solutions:

Remove them from the affix file.
Add them to the dict, with a null expression. The result will be the same for all dicts but en, since they are resolved to UNKNOWN-WORD which has a special mismatching connector. (For en it will just prevent it to be classified with a subscript.)
Add a directive: #define check_strippable_affixes false; to be used in desired cases.
(A fancy implementation would allow to redefine it. I don't imply ttually needed...)

---> Which one is desired?

Comments:

In the case of amy/ady/amy, I will remove most of them in the PR that implements a /[[:punct:]]/ affix (will leave there only those that are longer than one character.)
A simple implementation of check_strippable_affixes is very easy, as a one-line code is needed for that.
The check for a dynamic dict could be removed then.
It may be somewhat inconvenient to touch th/4.0.affix because it is in an active developing status. Maybe (3) is appropriate in that case.

My proposals:

Remove all the dict-unknown strippable affixes from the various dicts (but the ones that are longer than one character).
Implement #define check_strippable_affixes false;.
Define (2) for th and the dynamic dicts.
Remove all the dict-unknown strippable affixes from the other dictionaries.

Please tell me what is desired and I will send the PR.

I'll do 1 or 2 on a case-by-case basis. I don't like 3 because it just adds complexity and hides a real problem.

In the PR I'm finishing now, I made the message on non-existent strippable affixes to be only a warning.
I also changed it to one long line per affix class (with a list of offending affixes for this class).
(If desired, it is possible to add code to fold the lines to, say, 72 characters.)

I noted that you made some minor changes to most/all 4.0.affix files (e20556a).
However, several dictionaries still report on nonexistence affixes (some for most of the affixes).
I "fixed" them as follows (please tell me if changes are needed and I will change it accordingly):

de: I just changes the affixes to be those that are used in the language and defined them with a null expression.
he: I left some, with null expressions.
vn: I only left comma (the only punctuation that is in the dict) since this demo dict doesn't work anyway.
tr: Only comma is in the dict. I left most of them intact and added them to the dict with a null expression.
id: It uses the affix file from en without changes, and defined only comma. I removed everything but the comma.
kz: I removed everything but the comma.

'th': In the dict, they define several Thai punctuations. But they added only 3 of them to 4.0.affix (in RPUNC)!
Since this is a maintained project, I can think of these possibilities:

Ask them what to do.
Define all the non-existent punctuations with a null expression.
Leave it as is (after applying my PR, link-parser will give 2 long warning lines) (but it will emit warnings in the tests).

lt: I leave it to you...

ru: The current problems are:

link-grammar: Warning: afdict_init: Class LPUNC in file ru/4.0.affix: Token(s) not in the dictionary: "$" "``"
link-grammar: Warning: afdict_init: Class RPUNC in file ru/4.0.affix: Token(s) not in the dictionary: "%" "''" "'"

Since it is an implementation, that is more than just a pure demo, I think we have to preserve the list of punctuation and just define the non-existent ones with a null expression. However, the dict file is a generated one and I guess that such a definition should be added somewhere else (maybe as a new words file).

Maybe a comment should be added to all these modified affix files that most of the punctuations are not handled by their respective dict, and "see en/4.0.affix for a more a complete list of strippable affixes".

Besides needing your input on this post (on what I have changed and also on things that still needed to be fixed), this PR is ready. It also has code commits. Alternatively, I can just submit it, and send fixes according to your comments. It may be convenient to apply it first because it implements the affix existence check.
The rest of the PRs are also mostly ready, but they will need merging and testing after this PR is applied and the needed affix fixes are done.

Go ahead and make the dict changes as you propose. I'll fix the russian dictionary.

For the Thai dictionary, let it emit warnings for now; maybe by tagging @kaamanita we'll get his attention and a pull req containing an appropriate fix.

Hmm. Russian dict does not even offer any handling at all for quotation marks ... I guess they'll need to be unknown-word ... you cn add that, if desired. The current punctuation is at line 104, the unknown-word is at the very bottom.

you cn add that, if desired.

I looked at it and it seems I can add them in 4.0.dict since only the files it includes are generated.
I will do that and will send the complete PR tomorrow. Beside of it, I have several ones in various stages of readiness.

For the Thai dictionary, let it emit warnings for now

I will try to suppress them in tests.py since they are annoying... (by filtering out these warnings).

I forgot to include the warnings from lt:

link-grammar: Warning: afdict_init: Class LPUNC in file lt/4.0.affix: Token(s) not in the dictionary: "(" "$" "``"
link-grammar: Warning: afdict_init: Class RPUNC in file lt/4.0.affix: Token(s) not in the dictionary: ")" "%" ":" ";" "?" "!" "''" "'"

	"(" "{" "[" "<" « 〈（〔《【［『「 """ `` „ “ ‘ ''.x '.x ….x ....x
	¿ ¡ "$"
	_ - ‐ ‑ ‒ – — ― ━ ー～
	£ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺ ℳ ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점
	† †† ‡ § ¶ © ® ℗ № "#": LPUNC+;

	% Split words that contain the following tokens in the middle of them.
	% We don't want comma's in this list; it tends to mess up numbers. e.g.
	% "The enzyme has a weight of 125,000 to 130,000"
	% We don't want colon's in this list, it tends to mess up time
	% expressions: "The train arriaves at 13:42"
	% Some kind of fancier technique is needed for tokenizing those.
	%
	% TODO: this list should be expanded with other "typical"(?) junk
	% that is commonly (?) in broken texts.
	-- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;

	/* Check that the regex name is defined in the dictionary. */
	if ((dict != NULL) && !dict_has_word(dict, rn->name))
	{
	/* TODO: better error handing. Maybe remove the regex? */
	prt_error("Error: Regex name %s not found in dictionary!\n",
	rn->name);

	/* Return true if the word might be capitalized by convention:
	* -- if its the first word of a sentence
	* -- if its the first word following a colon, a period, a question mark,
	* or any bullet (For example: VII. Ancient Rome)
	* -- if its the first word following an ellipsis
	* -- if its the first word of a quote
	*
	* XXX FIXME: These rules are rather English-centric. Someone should
	* do something about this someday.
	*/
	static bool is_capitalizable(const Dictionary dict, const Gword *word)
	{
	/* Words at the start of sentences are capitalizable */
	if (MT_WALL == word->prev[0]->morpheme_type) return true;
	if (MT_INFRASTRUCTURE == word->prev[0]->morpheme_type) return true;

	/* Words following colons are capitalizable. */
	/* Mid-text periods and question marks are sentence-splitters. */
	if (strcmp(":", word->prev[0]->subword) == 0 \|\|
	strcmp(".", word->prev[0]->subword) == 0 \|\|
	strcmp("...", word->prev[0]->subword) == 0 \|\|
	strcmp("…", word->prev[0]->subword) == 0 \|\|
	strcmp("?", word->prev[0]->subword) == 0 \|\|
	strcmp("!", word->prev[0]->subword) == 0 \|\|
	strcmp("？", word->prev[0]->subword) == 0 \|\|
	strcmp("！", word->prev[0]->subword) == 0)
	return true;
	if (in_afdict_class(dict, AFDICT_BULLETS, word->prev[0]->subword))
	return true;
	if (in_afdict_class(dict, AFDICT_QUOTES, word->prev[0]->subword))
	return true;

	return false;
	}

Stripping affix-class tokens