Bug: incorrect `len()` for UTF-8 Chars

Question

Bug: incorrect `len()` for UTF-8 Chars

Starshipping opened this issue a year ago · comments

Python:

Python 3.11.3 (main, May  3 2023, 23:19:07) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "資源互換檔案格式"
>>> len(a)
8

Starlark Go:

Welcome to Starlark (go.starlark.net)
>>> a = "資源互換檔案格式"
>>> len(a)
24

Alan Donovan · Answer 1 · Fri Jun 30 2023 01:35:07 GMT+0800 (China Standard Time)

Python3's strings are sequences of Unicode code points, of which "資源互換檔案格式" contains 8. But Starlark strings are sequences of UTF-k codes, where k=8 in the Go implementation and 16 in the Java implementation, of which that string contains 24, since each Hanzi has a 3-byte UTF-8 encoding. So this is working as intended.