microsoft / bistring

Bidirectionally transformed strings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Whitespace at start of string mishandled by SentenceTokenizer?

qtdaniel opened this issue · comments

I'm not sure if this is expected behaviour or a bug but the following code illustrates my uncertainty:

pre_split = bistring.bistr(" \tFoo. \t\n \tBar. \t") \
    .sub(r"^\s+", "") \
    .sub(r"\s*\n\s*", "\n") \
    .sub(r"\s+$", "\n")

post_split = bistring.bistr.join(
    [s.text for s in bistring.SentenceTokenizer("en_GB").tokenize(pre_split)]
)

# These should print True but actually print False
print(pre_split == post_split)
print(pre_split.original == post_split.original)

# This should print True and does print True
print(pre_split.modified == post_split.modified)

# This should print False but actually prints True
print(pre_split.original[2:] == post_split.original)

In summary, I was expecting the result of re-joining the tokens produced by SentenceTokenizer to yield an identical bistr to the one that existed prior to the splitting. This appears to be true with the only (known) exception being whitespace at the start of the first sentence is being lost. Whitespace at the end of the string, and whitespace between sentences within the string, are retained as expected.

Is this expected behaviour?

Produces behaviour using Python 3.7, bistring 0.4.0, pyicu 2.6, and icu 68.1 (all installed via conda-forge).

I believe this is expected behaviour. The reason is that the leading/trailing whitespace is mapped to an empty substring. The algorithm for mapping back and forth between the strings tries to find the smallest interval that contains the input interval across the alignment. Since "Foo." is smaller than " \tFoo.", that will be the answer:

>>> print(pre_split[:4])
⮎'Foo.'

The trailing whitespace only worked because some trailing whitespace was preserved by the SentenceTokenizer:

>>> print(pre_split[5:9])
⮎'Bar.'>>> print(pre_split[5:10])
('Bar. \t''Bar.\n')

That's usually the desired behaviour, though sometimes it makes sense to get the largest corresponding interval instead (kind of like rounding up instead of down). There's no API for that, but I could add one. It would help to know more about your use case.

Also note that in general it's not true that joining the tokens from a tokenization will get you the original string back. E.g. with a WordTokenizer you'll just get 'FooBar' back with no spaces anywhere. It's better to use Tokenization.substring() to join some tokens back together, e.g.

>>> tokens = bistring.WordTokenizer("en_GB").tokenize(pre_split)
>>> print(tokens)
Tokenization((' \tFoo. \t\n \tBar. \t''Foo.\nBar.\n'), [[0:3]='Foo'⮌, [5:8]='Bar'⮌])
>>> print(tokens.substring(0, 2))
('Foo. \t\n \tBar''Foo.\nBar')

But note that you still lose leading/trailing whitespace, for the same reason.

Thanks for the reply. Sorry I've not been able to look into this again since posting the issue. I do still intend to come back to this and reply with a confident "resolved" or "still uncertain" but that reply may not be quick.

I've only now had a chance to look into this again and understand the main point you're making, i.e. it's unreasonable to assume that tokenization results in tokens that span the entire original text with no gaps or omitted characters. I've now modified my code to account for this reality so will close this issue. Thanks!

@qtdaniel Great!

I'm considering a slight change that's relevant to this issue, basically distinguishing between foo[:4] and foo[0:4]. The first one would include the entire beginning of both the original and modified strings, while the second one would skip whitespace etc. that's been stripped. Similarly for foo[4:] and foo[4:len(foo)].

That feels a little strange to me. I would have been surprised if I discovered that foo[:4] did not behave the same as foo[0:4]. I think as long as it is clear that there is no intention for any form of tokenisation to span the entire original string, then retaining normal Python slice semantics would be the safest way to go. For my use-case, this change would not have helped since I need to deal with "missing" whitespace between tokens as well as at the start/end. Perhaps if other people are hitting similar issues it could be solved by providing an alternate joining algorithm like the one I ended up writing for myself, but maybe my use-case is unusual.

True. On the other hand, it's also surprising that foo[:] != foo, because it can slice text off of the beginning and end of the original string.

Perhaps if other people are hitting similar issues it could be solved by providing an alternate joining algorithm like the one I ended up writing for myself, but maybe my use-case is unusual.

I'm curious what you ended up writing, and whether it would be convenient to have bistring provide it. Is it different than Tokenization.substring()?

An outline of the requirement is illustrated in the attached diagram. We need a process that takes a bistring as input and emits a bistring as output such that the output bistring has the same "original" string as the input bistring but internally the process will be transforming the full bistring, tokenizing, and then transforming the individual tokens. We need a join algorithm that can join the modified tokens back together in a way that aligns from the full original string, retaining the modified version of any content that was not part of any token. Note that insertions and deletions can occur any point in the process. Maybe I've missed something but I don't think bistring provides such an algorithm right now.

bistring split join algorithm (1)

That looks like a job for BistrBuilder:

>>> bs = bistr('  the _ quick  brown -- fox  ')
>>> bs = bs.strip()
>>> bs = bs.replace('  ', ' ')
>>>
>>> builder = BistrBuilder(bs)
>>>
>>> last = 0
>>> for token in WordTokenizer('en_US').tokenize(bs):
...     builder.skip(token.start - last)
...     builder.append(bistr(token.modified).title('en_US'))
...     last = token.end
...
>>> builder.skip_rest()
>>> bs = builder.build()
>>> print(bs)
('  the _ quick  brown -- fox  ''The _ Quick Brown -- Fox')
>>> print(bs[0:3])
('the''The')
>>> print(bs[6:11])
('quick''Quick')
>>> print(bs[6:7])
('q''Q')

Yes, that might make things a bit easier but sadly it would mean quite a substantial change to the code we've already developed.

Just to help me understand, shouldn't builder.append(bistr(token.modified).title('en_US')) in your example be better as builder.append(token.text.title('en_US'))?

No, the usual use of append() is something like bs = bistr(builder.peek(10)); builder.append(bs.title()), so bs.original should be a chunk of the "current" string, which means the previous .modified string.