jqlang / jq

Command-line JSON processor

Home Page:https://jqlang.github.io/jq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

substring of index/rindex doesn't work for utf8 inputs

ilovenwd opened this issue · comments

echo  '正xyz' | jq -Rsr '.[:rindex("x")]'
正xy
while it should output:
正

It seems substring thinks length(正xy)==3
but rindex thinks length(正xy)==5
this issue make jsonp parse example in FAQ fails for utf8 inputs

Q: How can I convert JSON-P (JSONP) to JSON using jq?
A: Assuming that the padding takes the form of a function call:
$ jq -s -R  '.[1+index("("): rindex(")")] | fromjson'

You're right and I've updated the FAQ so that it uses match. In the example you give, we would have:

echo  '正xyz' | jq1.5 -Rsr '.[: (match("x").offset)]'
正

Thank you!

match works, but it's really confusing that index and slice use different string model (bytes vs strings)
it's very different from common languages like c,python, etc.
I suggest add another pair of functions like indexu/rindexu to behave exactly as substring slice does.

I believe this behavior is worth changing its default. How much is byte index important in jq? Defining index of type string by explode | .[$x|explode] will work with utf8 strings. If someone needs conversion between string and bytes, how about adding byte-version of explode and implode?

Same issue: #1430.

@itchyny - Changing the semantics of index would only be possible in a "Major Release" of jq, and might never happen.

Rather than tilting at that particular windmill, I would suggest adding a new C-coded built-in function with the desired semantics, not least because the existing implementation of index is ill-suited for finding the first index of anything.

Although there is something to be said for a function with a narrow domain (e.g. codepointOf for JSON strings), it would be more in keeping with jq's existing builtins to be polymorphic, which would suggest a name such as indexOf, though a more distinctive name would no doubt be preferable.

Okay, thanks for detail explanation.