Whitespace at start of string mishandled by SentenceTokenizer?

Question

Whitespace at start of string mishandled by SentenceTokenizer?

qtdaniel opened this issue 3 years ago · comments

I'm not sure if this is expected behaviour or a bug but the following code illustrates my uncertainty:

pre_split = bistring.bistr(" \tFoo. \t\n \tBar. \t") \
    .sub(r"^\s+", "") \
    .sub(r"\s*\n\s*", "\n") \
    .sub(r"\s+$", "\n")

post_split = bistring.bistr.join(
    [s.text for s in bistring.SentenceTokenizer("en_GB").tokenize(pre_split)]
)

# These should print True but actually print False
print(pre_split == post_split)
print(pre_split.original == post_split.original)

# This should print True and does print True
print(pre_split.modified == post_split.modified)

# This should print False but actually prints True
print(pre_split.original[2:] == post_split.original)

In summary, I was expecting the result of re-joining the tokens produced by SentenceTokenizer to yield an identical bistr to the one that existed prior to the splitting. This appears to be true with the only (known) exception being whitespace at the start of the first sentence is being lost. Whitespace at the end of the string, and whitespace between sentences within the string, are retained as expected.

Is this expected behaviour?

Produces behaviour using Python 3.7, bistring 0.4.0, pyicu 2.6, and icu 68.1 (all installed via conda-forge).

Tavian Barnes · Answer 1 · Thu Mar 18 2021 01:08:56 GMT+0800 (China Standard Time)

I believe this is expected behaviour. The reason is that the leading/trailing whitespace is mapped to an empty substring. The algorithm for mapping back and forth between the strings tries to find the smallest interval that contains the input interval across the alignment. Since "Foo." is smaller than " \tFoo.", that will be the answer:

>>> print(pre_split[:4])
⮎'Foo.'⮌

The trailing whitespace only worked because some trailing whitespace was preserved by the SentenceTokenizer:

>>> print(pre_split[5:9])
⮎'Bar.'⮌
>>> print(pre_split[5:10])
('Bar. \t' ⇋ 'Bar.\n')

That's usually the desired behaviour, though sometimes it makes sense to get the largest corresponding interval instead (kind of like rounding up instead of down). There's no API for that, but I could add one. It would help to know more about your use case.

Also note that in general it's not true that joining the tokens from a tokenization will get you the original string back. E.g. with a WordTokenizer you'll just get 'FooBar' back with no spaces anywhere. It's better to use Tokenization.substring() to join some tokens back together, e.g.

>>> tokens = bistring.WordTokenizer("en_GB").tokenize(pre_split)
>>> print(tokens)
Tokenization((' \tFoo. \t\n \tBar. \t' ⇋ 'Foo.\nBar.\n'), [[0:3]=⮎'Foo'⮌, [5:8]=⮎'Bar'⮌])
>>> print(tokens.substring(0, 2))
('Foo. \t\n \tBar' ⇋ 'Foo.\nBar')

But note that you still lose leading/trailing whitespace, for the same reason.

qtdaniel · Answer 2 · Wed Mar 31 2021 19:49:15 GMT+0800 (China Standard Time)

Thanks for the reply. Sorry I've not been able to look into this again since posting the issue. I do still intend to come back to this and reply with a confident "resolved" or "still uncertain" but that reply may not be quick.

qtdaniel · Answer 3 · Thu Jul 01 2021 18:55:45 GMT+0800 (China Standard Time)

I've only now had a chance to look into this again and understand the main point you're making, i.e. it's unreasonable to assume that tokenization results in tokens that span the entire original text with no gaps or omitted characters. I've now modified my code to account for this reality so will close this issue. Thanks!

Tavian Barnes · Answer 4 · Sat Jul 03 2021 04:19:04 GMT+0800 (China Standard Time)

@qtdaniel Great!

I'm considering a slight change that's relevant to this issue, basically distinguishing between foo[:4] and foo[0:4]. The first one would include the entire beginning of both the original and modified strings, while the second one would skip whitespace etc. that's been stripped. Similarly for foo[4:] and foo[4:len(foo)].

qtdaniel · Answer 5 · Mon Jul 05 2021 00:36:55 GMT+0800 (China Standard Time)

That feels a little strange to me. I would have been surprised if I discovered that foo[:4] did not behave the same as foo[0:4]. I think as long as it is clear that there is no intention for any form of tokenisation to span the entire original string, then retaining normal Python slice semantics would be the safest way to go. For my use-case, this change would not have helped since I need to deal with "missing" whitespace between tokens as well as at the start/end. Perhaps if other people are hitting similar issues it could be solved by providing an alternate joining algorithm like the one I ended up writing for myself, but maybe my use-case is unusual.

Tavian Barnes · Answer 6 · Mon Jul 05 2021 00:51:16 GMT+0800 (China Standard Time)

True. On the other hand, it's also surprising that foo[:] != foo, because it can slice text off of the beginning and end of the original string.

Perhaps if other people are hitting similar issues it could be solved by providing an alternate joining algorithm like the one I ended up writing for myself, but maybe my use-case is unusual.

I'm curious what you ended up writing, and whether it would be convenient to have bistring provide it. Is it different than Tokenization.substring()?

qtdaniel · Answer 7 · Mon Jul 05 2021 19:41:45 GMT+0800 (China Standard Time)

An outline of the requirement is illustrated in the attached diagram. We need a process that takes a bistring as input and emits a bistring as output such that the output bistring has the same "original" string as the input bistring but internally the process will be transforming the full bistring, tokenizing, and then transforming the individual tokens. We need a join algorithm that can join the modified tokens back together in a way that aligns from the full original string, retaining the modified version of any content that was not part of any token. Note that insertions and deletions can occur any point in the process. Maybe I've missed something but I don't think bistring provides such an algorithm right now.

Tavian Barnes · Answer 8 · Mon Jul 05 2021 22:02:09 GMT+0800 (China Standard Time)

That looks like a job for BistrBuilder:

>>> bs = bistr('  the _ quick  brown -- fox  ')
>>> bs = bs.strip()
>>> bs = bs.replace('  ', ' ')
>>>
>>> builder = BistrBuilder(bs)
>>>
>>> last = 0
>>> for token in WordTokenizer('en_US').tokenize(bs):
...     builder.skip(token.start - last)
...     builder.append(bistr(token.modified).title('en_US'))
...     last = token.end
...
>>> builder.skip_rest()
>>> bs = builder.build()
>>> print(bs)
('  the _ quick  brown -- fox  ' ⇋ 'The _ Quick Brown -- Fox')
>>> print(bs[0:3])
('the' ⇋ 'The')
>>> print(bs[6:11])
('quick' ⇋ 'Quick')
>>> print(bs[6:7])
('q' ⇋ 'Q')

qtdaniel · Answer 9 · Wed Jul 07 2021 15:22:45 GMT+0800 (China Standard Time)

Yes, that might make things a bit easier but sadly it would mean quite a substantial change to the code we've already developed.

Just to help me understand, shouldn't builder.append(bistr(token.modified).title('en_US')) in your example be better as builder.append(token.text.title('en_US'))?

Tavian Barnes · Answer 10 · Wed Jul 07 2021 21:26:51 GMT+0800 (China Standard Time)

No, the usual use of append() is something like bs = bistr(builder.peek(10)); builder.append(bs.title()), so bs.original should be a chunk of the "current" string, which means the previous .modified string.