Bug: incorrect `len()` for UTF-8 Chars
Starshipping opened this issue · comments
Resonance Router commented
Python:
Python 3.11.3 (main, May 3 2023, 23:19:07) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "資源互換檔案格式"
>>> len(a)
8
Starlark Go:
Welcome to Starlark (go.starlark.net)
>>> a = "資源互換檔案格式"
>>> len(a)
24
Alan Donovan commented
Python3's strings are sequences of Unicode code points, of which "資源互換檔案格式" contains 8. But Starlark strings are sequences of UTF-k codes, where k=8 in the Go implementation and 16 in the Java implementation, of which that string contains 24, since each Hanzi has a 3-byte UTF-8 encoding. So this is working as intended.