When `MRB_UTF8_STRING` is enabled, giving byte characters for `String#index` and `String#split` gives wrong results

Question

When `MRB_UTF8_STRING` is enabled, giving byte characters for `String#index` and `String#split` gives wrong results

dearblue opened this issue 2 months ago · comments

I build with MRUBY_CONFIG=host-debug.
mruby revision is dcee404.

p "①②③④⑤\xe2".size
# => 6
p "①②③④⑤\xe2".unpack1("H*")
# => "e291a0e291a1e291a2e291a3e291a4e2"

p "①②③④⑤\xe2".index("\xa4")
# => 6
## Expected to nil
## mruby-3.3 returns nil

p "①②③④⑤\xe2".split("\xe2")
# => ["", "\x91\xa0", "\x91\xa1", "\x91\xa2", "\x91\xa3", "\x91\xa4"]
## Expected to ["①②③④⑤"]
## mruby-3.3 is also wrong

The String#split method uses the result of mrb_memsearch() directly.
The String#index method adjusts the result of mrb_memsearch() with byte2char(), but the result is wrong.

Just reporting it for now.

Yukihiro "Matz" Matsumoto · Answer 1 · Wed May 15 2024 06:58:13 GMT+0800 (China Standard Time)

I am not sure what is the correct behavior when we search a byte in UTF-8 string. Basically it is undefined behavior, that means anything could happen. For example, if we specify ASCII-8BIT encoding for strings in CRuby, that last p line prints ["", "\x91\xA0", "\x91\xA1", "\x91\xA2", "\x91\xA3", "\x91\xA4"].

dearblue · Answer 2 · Wed May 15 2024 22:54:52 GMT+0800 (China Standard Time)

Thanks for your comment.

I had assumed that since the encoding cannot be changed in mruby, all but illegal bytes are handled in UTF-8 character units.
However, I can agree with your opinion.

So if it's not a UTF-8 string, you need to use byte-oriented methods.

Use String#byteindex instead of String#index.
If you give String#split a byte sequence, it will be split as it should be.

😺 Ok, no problem.

Yukihiro "Matz" Matsumoto · Answer 3 · Thu May 16 2024 06:12:23 GMT+0800 (China Standard Time)

Thank you. But p "①②③④⑤\xe2".index("\xa4") should still be nil. So I reopen this issue, and fix it soon.