Add aliases for `Bytes.indexOf` and `Text.indexOf`
pchiusano opened this issue · comments
There's now much more efficient implementations of these as builtins (see this PR)
Bytes.indexOf : Bytes -> Bytes -> Optional Nat
Text.indexOf : Text -> Text -> Optional Nat
I notice Text.indexOf
exists but has a slow implementation. So maybe as simple as...
replace Text.indexOf ##Text.indexOf
alias.term ##Bytes.indexOf Bytes.indexOf
Plus docs / tests. @runarorama if you want to assign this to @stew go ahead.
Text.indexOf
looks like it has a bug for multi-byte characters:
1 | > ##Text.indexOf "foo👊🏿" "bar👊🏿foo👊🏿baz👊🏿"
⧩
Some 7
3 | > Text.size "foo👊🏿"
⧩
5
The 👊🏿 character is two codepoints, so the answer should be Some 5
Is this a bug in the Text package?? @stew is just calling through to that... this function: https://hackage.haskell.org/package/text-2.0.2/docs/Data-Text-Internal-Lazy-Search.html
Or maybe we're just using it wrong?
If it's working as intended, I don't understand what it's doing. It's interpreting "👊🏿"
to have length 4, but it's two codepoints and 8 bytes. But 4 what?
What if you just call it in ghci? Does it still behave incorrectly?
Yeah
λ> indices (Text.pack "foo") (Text.pack "👊🏿foo")
[4]
Is this just due to the fact that Haskell strings are UTF-16?
https://hackage.haskell.org/package/text-2.0.2/docs/src/Data.Text.Lazy.html#breakOn - it looks like the indices are byte offsets.
@stew maybe implement in terms of Text.breakOn, it's the size of the first element of the pair.
Text.breakOn
, Text.breakOnEnd
, and Text.breakOnAll
would be great builtins to have. We could implement a fast Text.indexOf
etc. in terms of those.
I like breakOnAll
and indexOfEnd
as new builtins. That could be for later though.
indexOf
seems better than breakOn
- you can implement breakOn
using indexOf
, and indexOf
can have a direct implementation if we want to make it more efficient. In the cases where you're just Text.drop
-ing up to that index, you avoid needlessly allocating a prefix you're just discarding.
Yeah, if we can get indexOf
to work correctly for text, that's ideal.
@runarorama fyi, don't know if you saw, but the bug has been fixed, so you can add / replace the existing functions.
Now there's a different bug:
> ##Text.indexOf "" "foo"
⬇️
Encountered exception:
Data.Text.Lazy.breakOn: empty input
CallStack ( from HasCallStack ):
error
I can do a pure Unison check for the empty search string, but it really feels like the builtin should be doing this. The correct index of the empty string is 0
.
Fixed in unisonweb/unison#4101
Replaced the Unison definitions with the builtins and pushed to main.