opencog / link-grammar

The CMU Link Grammar natural language parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Empty subword!?

linas opened this issue · comments

I recently experienced a crash (stack trace below) and so I added the check

--- a/link-grammar/tokenize/wordgraph.c
+++ b/link-grammar/tokenize/wordgraph.c
@@ -28,6 +28,7 @@ Gword *gword_new(Sentence sent, const char *s)
 
        memset(gword, 0, sizeof(*gword));
        assert(NULL != s, "Null-string subword");
+       assert(0 != *s, "Empty-string subword");
        gword->subword = string_set_add(s, sent->string_set);
 
        if (NULL != sent->last_word) sent->last_word->chain_next = gword;

As a result of this, tests.py fails. Investigating now.
The crash is this:

assert_failure (cond_str=cond_str@entry=0x7fffecce8be3 "temp_wend>w",
    func=func@entry=0x7fffecce9108 <__func__.10> "strip_right",
    src_location=src_location@entry=0x7fffecce7cb0 "../../link-grammar/tokenize
/tokenize.c:2096",
    fmt=fmt@entry=0x7fffecce8bc6 "Unexpected empty-string word")
    at ../../link-grammar/error.c:467
467                     DEBUG_TRAP;  /* leave stack trace in debugger */
               \

(gdb) bt
#0  assert_failure (cond_str=cond_str@entry=0x7fffecce8be3 "temp_wend>w",
    func=func@entry=0x7fffecce9108 <__func__.10> "strip_right",
    src_location=src_location@entry=0x7fffecce7cb0 "../../link-grammar/tokenize
/tokenize.c:2096",
    fmt=fmt@entry=0x7fffecce8bc6 "Unexpected empty-string word")
    at ../../link-grammar/error.c:467
#1  0x00007fffeccd0f56 in strip_right (sent=sent@entry=0x7fff38411430,
    w=0x7fff382c6f10 "", wend=wend@entry=0x7fff44ff7c78,
    stripped=stripped@entry=0x7fff44ff7cd0,
    n_stripped=n_stripped@entry=0x7fff44ff7c68, p=p@entry=2,
    rootdigit=<optimized out>, classnum=<optimized out>)
    at ../../link-grammar/tokenize/tokenize.c:2096
#2  0x00007fffeccd4ea0 in separate_word (sent=sent@entry=0x7fff38411430,
    unsplit_word=unsplit_word@entry=0x7fff383e22f0, opts=0x7fff381fce10)
    at ../../link-grammar/tokenize/tokenize.c:2577
#3  0x00007fffeccd6155 in separate_sentence (sent=sent@entry=0x7fff38411430,
    opts=opts@entry=0x7fff381fce10)
    at ../../link-grammar/tokenize/tokenize.c:3116
#4  0x00007fffecc810c0 in sentence_split (sent=sent@entry=0x7fff38411430,
    opts=opts@entry=0x7fff381fce10) at ../../link-grammar/api.c:494
#5  0x00007fffecc8158f in sentence_parse (sent=sent@entry=0x7fff38411430,
    opts=opts@entry=0x7fff381fce10) at ../../link-grammar/api.c:679

(gdb) print word
$3 = (Gword *) 0x7fff383e22f0
(gdb) print word->start
$4 = 0x7fff382c6d4b " as he saw the figure nearby walking closer to him."
(gdb) print word->end
$5 = 0x7fff382c6d4b " as he saw the figure nearby walking closer to him."
(gdb) print word->subword
$6 = 0x7fff382c6f10 ""

Note that the subword is the empty string, which seems to be the root cause of the asset. So I thought I would check for the empty string much earlier.

@ampli -- this should be right up your alley -- the crash is in tests.py, here: line 843

def test_he_word_positions(self):
    linkage_testfile(self, Dictionary(lang='he'), ParseOptions(), 'pos')

def test_he_word_positions(self):

In this case the empty-string is intentional, see line 139:

/**
* Add a display wordgraph placeholder for a combined morpheme with links
* that are not discardable.
* This is needed only when hiding morphology. This is a kind of a hack.
* It it is not deemed nice, the "hide morphology" mode should just not be
* used for languages with morphemes which have links that cannot be
* discarded on that mode (like Hebrew).
* Possible FIXME: Currently it is also used by w/ in English.
*/
static Gword *wordgraph_link_placeholder(Sentence sent, Gword *w)
{
Gword *new_word;
new_word = gword_new(sent, "");
new_word->status |= WS_PL;
new_word->label = "PH";
new_word->start = w->start;
new_word->end = w->end;
return new_word;
}

BTW, line 130 starts with a typo. It should be: If it is ...

I think this is the only place an empty-string Gword subword is used.
So if you would like to keep the added assert() (to detect possible other problems) then maybe we can change line 139 to be something like:

new_word = gword_new(sent, "PLACE_HOLDER");

To check the problem with the English sentence, I will need instructions on how to repeat it.

  • Possible FIXME: Currently it is also used by w/ in English.

It is a comment rot since en morphemes are not combined on !morphology=0 (I think such a combination was temporarily done in a development code).

OK I will add the below.

new_word = gword_new(sent, "PLACE_HOLDER");

But

I will need instructions on how to repeat it.

Can't. It's something statistical; it doesn't always happen. I'm seeing two bugs (might be the same bug): using the any language, if the first character is a utf8 quote sometimes (rarely) the gword length on it is set to 1, instead of the full utf8 length. The other bug is that the token at line 772 of tokenize.c comes up as zero length. This is far less frequent, I'm investigating this now. Happened on the first word, which was He’d -- note the apostrophe is utf8.

None of this reproduces easily, I don't know why. Maybe some weird threading issue or weird memory corruption, but its very very specific, always the same symptoms, utf8 related.

I'm confused by zero-length stems. Very rarely, the *affix is zero length at line 607. Because the char** stem argument has an empty string in it. You seem to accept this as normal, just two lines later in a debug print.

A sentence that generates this is "Then." (with the any language) but this does NOT happen in the command-line parser, or even through my library setup, when I try to trigger it on purpose. It only happens occasionally, Last few times, I had to chew through 120K sentences before it triggered.

Current hypothesis is that regex is not being used in a thread-safe fashion.

Current hypothesis is that pcre2_match_data *match_data needs to be per-thread, and its not. Which means threads are racing and clobbering this. Making this per-thread seems like it might need extensive changes.

I'm writing a patch now to use tss_create, tss_get, tss_set for pcre2_match_data. This will leave the non-pcre2 implementations thread-unsafe.

I think it can be fixed by:

  • Using per-regex-lib new functions regex_match_alloc() & regex_match_free() in match_regex()`.
  • Adding regex_match_data * argument to reg_match().

This will add an malloc() call per match. It seems it is also possible to use static TLS for the natch data to save malloc() calls, but I don't know if this is preferable.

  • Using per-regex-lib new functions regex_match_alloc() & regex_match_free() in match_regex()`.

... in matchspan_regex().
Since match_regex() doesn't currently need the match span data, it is most probably possible not to allocate match data at all when it is used.

Look at #1354 for a proposed fix. (it fixes pcre2 only).

As to match-span data, see the reg_span() function -- it gets called by matchspan_regex() which sets start and end values. These seemed to sometimes get set so that there are zero-length gwords, and sometimes set so that utf8 chars are cut in the wrong place.

Sigh. Apparently, a max of 1024 thread keys can be created, and the python test exceeds this number.. Not sure if keys can be deleted, once created...

oh never mind, they can be delleted!

After deleting keys it is much better. However, the russian dict needs more than 1024 keys; I guess there must be a regex node for each russian suffix...

Closing, fixed by #1354