substring of index/rindex doesn't work for utf8 inputs
ilovenwd opened this issue · comments
echo '正xyz' | jq -Rsr '.[:rindex("x")]'
正xy
while it should output:
正
It seems substring thinks length(正xy)==3
but rindex thinks length(正xy)==5
this issue make jsonp parse example in FAQ fails for utf8 inputs
Q: How can I convert JSON-P (JSONP) to JSON using jq?
A: Assuming that the padding takes the form of a function call:
$ jq -s -R '.[1+index("("): rindex(")")] | fromjson'
You're right and I've updated the FAQ so that it uses match
. In the example you give, we would have:
echo '正xyz' | jq1.5 -Rsr '.[: (match("x").offset)]'
正
Thank you!
match works, but it's really confusing that index and slice use different string model (bytes vs strings)
it's very different from common languages like c,python, etc.
I suggest add another pair of functions like indexu/rindexu to behave exactly as substring slice does.
I believe this behavior is worth changing its default. How much is byte index important in jq? Defining index of type string by explode | .[$x|explode]
will work with utf8 strings. If someone needs conversion between string and bytes, how about adding byte-version of explode and implode?
@itchyny - Changing the semantics of index
would only be possible in a "Major Release" of jq, and might never happen.
Rather than tilting at that particular windmill, I would suggest adding a new C-coded built-in function with the desired semantics, not least because the existing implementation of index
is ill-suited for finding the first index of anything.
Although there is something to be said for a function with a narrow domain (e.g. codepointOf
for JSON strings), it would be more in keeping with jq's existing builtins to be polymorphic, which would suggest a name such as indexOf
, though a more distinctive name would no doubt be preferable.
Okay, thanks for detail explanation.