Empty subword!?
linas opened this issue · comments
I recently experienced a crash (stack trace below) and so I added the check
--- a/link-grammar/tokenize/wordgraph.c
+++ b/link-grammar/tokenize/wordgraph.c
@@ -28,6 +28,7 @@ Gword *gword_new(Sentence sent, const char *s)
memset(gword, 0, sizeof(*gword));
assert(NULL != s, "Null-string subword");
+ assert(0 != *s, "Empty-string subword");
gword->subword = string_set_add(s, sent->string_set);
if (NULL != sent->last_word) sent->last_word->chain_next = gword;
As a result of this, tests.py
fails. Investigating now.
The crash is this:
assert_failure (cond_str=cond_str@entry=0x7fffecce8be3 "temp_wend>w",
func=func@entry=0x7fffecce9108 <__func__.10> "strip_right",
src_location=src_location@entry=0x7fffecce7cb0 "../../link-grammar/tokenize
/tokenize.c:2096",
fmt=fmt@entry=0x7fffecce8bc6 "Unexpected empty-string word")
at ../../link-grammar/error.c:467
467 DEBUG_TRAP; /* leave stack trace in debugger */
\
(gdb) bt
#0 assert_failure (cond_str=cond_str@entry=0x7fffecce8be3 "temp_wend>w",
func=func@entry=0x7fffecce9108 <__func__.10> "strip_right",
src_location=src_location@entry=0x7fffecce7cb0 "../../link-grammar/tokenize
/tokenize.c:2096",
fmt=fmt@entry=0x7fffecce8bc6 "Unexpected empty-string word")
at ../../link-grammar/error.c:467
#1 0x00007fffeccd0f56 in strip_right (sent=sent@entry=0x7fff38411430,
w=0x7fff382c6f10 "", wend=wend@entry=0x7fff44ff7c78,
stripped=stripped@entry=0x7fff44ff7cd0,
n_stripped=n_stripped@entry=0x7fff44ff7c68, p=p@entry=2,
rootdigit=<optimized out>, classnum=<optimized out>)
at ../../link-grammar/tokenize/tokenize.c:2096
#2 0x00007fffeccd4ea0 in separate_word (sent=sent@entry=0x7fff38411430,
unsplit_word=unsplit_word@entry=0x7fff383e22f0, opts=0x7fff381fce10)
at ../../link-grammar/tokenize/tokenize.c:2577
#3 0x00007fffeccd6155 in separate_sentence (sent=sent@entry=0x7fff38411430,
opts=opts@entry=0x7fff381fce10)
at ../../link-grammar/tokenize/tokenize.c:3116
#4 0x00007fffecc810c0 in sentence_split (sent=sent@entry=0x7fff38411430,
opts=opts@entry=0x7fff381fce10) at ../../link-grammar/api.c:494
#5 0x00007fffecc8158f in sentence_parse (sent=sent@entry=0x7fff38411430,
opts=opts@entry=0x7fff381fce10) at ../../link-grammar/api.c:679
(gdb) print word
$3 = (Gword *) 0x7fff383e22f0
(gdb) print word->start
$4 = 0x7fff382c6d4b " as he saw the figure nearby walking closer to him."
(gdb) print word->end
$5 = 0x7fff382c6d4b " as he saw the figure nearby walking closer to him."
(gdb) print word->subword
$6 = 0x7fff382c6f10 ""
Note that the subword is the empty string, which seems to be the root cause of the asset. So I thought I would check for the empty string much earlier.
@ampli -- this should be right up your alley -- the crash is in tests.py
, here: line 843
def test_he_word_positions(self):
linkage_testfile(self, Dictionary(lang='he'), ParseOptions(), 'pos')
def test_he_word_positions(self):
In this case the empty-string is intentional, see line 139:
link-grammar/link-grammar/linkage/linkage.c
Lines 126 to 146 in f7b7c16
BTW, line 130 starts with a typo. It should be: If it is ...
I think this is the only place an empty-string Gword subword is used.
So if you would like to keep the added assert()
(to detect possible other problems) then maybe we can change line 139 to be something like:
new_word = gword_new(sent, "PLACE_HOLDER");
To check the problem with the English sentence, I will need instructions on how to repeat it.
- Possible FIXME: Currently it is also used by w/ in English.
It is a comment rot since en
morphemes are not combined on !morphology=0
(I think such a combination was temporarily done in a development code).
OK I will add the below.
new_word = gword_new(sent, "PLACE_HOLDER");
But
I will need instructions on how to repeat it.
Can't. It's something statistical; it doesn't always happen. I'm seeing two bugs (might be the same bug): using the any language, if the first character is a utf8 quote sometimes (rarely) the gword length on it is set to 1, instead of the full utf8 length. The other bug is that the token
at line 772 of tokenize.c
comes up as zero length. This is far less frequent, I'm investigating this now. Happened on the first word, which was He’d
-- note the apostrophe is utf8.
None of this reproduces easily, I don't know why. Maybe some weird threading issue or weird memory corruption, but its very very specific, always the same symptoms, utf8 related.
I'm confused by zero-length stems. Very rarely, the *affix
is zero length at line 607. Because the char** stem
argument has an empty string in it. You seem to accept this as normal, just two lines later in a debug print.
A sentence that generates this is "Then." (with the any
language) but this does NOT happen in the command-line parser, or even through my library setup, when I try to trigger it on purpose. It only happens occasionally, Last few times, I had to chew through 120K sentences before it triggered.
Current hypothesis is that regex is not being used in a thread-safe fashion.
Current hypothesis is that pcre2_match_data *match_data
needs to be per-thread, and its not. Which means threads are racing and clobbering this. Making this per-thread seems like it might need extensive changes.
I'm writing a patch now to use tss_create
, tss_get
, tss_set
for pcre2_match_data. This will leave the non-pcre2 implementations thread-unsafe.
I think it can be fixed by:
- Using per-regex-lib new functions
regex_match_alloc()
®ex_match_free() in
match_regex()`. - Adding
regex_match_data *
argument toreg_match()
.
This will add an malloc()
call per match. It seems it is also possible to use static TLS for the natch data to save malloc()
calls, but I don't know if this is preferable.
- Using per-regex-lib new functions
regex_match_alloc()
®ex_match_free() in
match_regex()`.
... in matchspan_regex()
.
Since match_regex()
doesn't currently need the match span data, it is most probably possible not to allocate match data at all when it is used.
Look at #1354 for a proposed fix. (it fixes pcre2 only).
As to match-span data, see the reg_span()
function -- it gets called by matchspan_regex()
which sets start
and end
values. These seemed to sometimes get set so that there are zero-length gwords, and sometimes set so that utf8 chars are cut in the wrong place.
Sigh. Apparently, a max of 1024 thread keys can be created, and the python test exceeds this number.. Not sure if keys can be deleted, once created...
oh never mind, they can be delleted!
After deleting keys it is much better. However, the russian dict needs more than 1024 keys; I guess there must be a regex node for each russian suffix...
Closing, fixed by #1354