mruby / mruby

Lightweight Ruby

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When `MRB_UTF8_STRING` is enabled, giving byte characters for `String#index` and `String#split` gives wrong results

dearblue opened this issue · comments

I build with MRUBY_CONFIG=host-debug.
mruby revision is dcee404.

p "①②③④⑤\xe2".size
# => 6
p "①②③④⑤\xe2".unpack1("H*")
# => "e291a0e291a1e291a2e291a3e291a4e2"

p "①②③④⑤\xe2".index("\xa4")
# => 6
## Expected to nil
## mruby-3.3 returns nil

p "①②③④⑤\xe2".split("\xe2")
# => ["", "\x91\xa0", "\x91\xa1", "\x91\xa2", "\x91\xa3", "\x91\xa4"]
## Expected to ["①②③④⑤"]
## mruby-3.3 is also wrong

The String#split method uses the result of mrb_memsearch() directly.
The String#index method adjusts the result of mrb_memsearch() with byte2char(), but the result is wrong.

Just reporting it for now.

I am not sure what is the correct behavior when we search a byte in UTF-8 string. Basically it is undefined behavior, that means anything could happen. For example, if we specify ASCII-8BIT encoding for strings in CRuby, that last p line prints ["", "\x91\xA0", "\x91\xA1", "\x91\xA2", "\x91\xA3", "\x91\xA4"].

Thanks for your comment.

I had assumed that since the encoding cannot be changed in mruby, all but illegal bytes are handled in UTF-8 character units.
However, I can agree with your opinion.

So if it's not a UTF-8 string, you need to use byte-oriented methods.

  • Use String#byteindex instead of String#index.
  • If you give String#split a byte sequence, it will be split as it should be.

😺 Ok, no problem.

Thank you. But p "①②③④⑤\xe2".index("\xa4") should still be nil. So I reopen this issue, and fix it soon.