unisonweb / base

Unison base libraries

Home Page:https://share.unison-lang.org/@unison/base

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add aliases for `Bytes.indexOf` and `Text.indexOf`

pchiusano opened this issue · comments

There's now much more efficient implementations of these as builtins (see this PR)

Bytes.indexOf : Bytes -> Bytes -> Optional Nat
Text.indexOf : Text -> Text -> Optional Nat

I notice Text.indexOf exists but has a slow implementation. So maybe as simple as...

replace Text.indexOf ##Text.indexOf
alias.term ##Bytes.indexOf Bytes.indexOf

Plus docs / tests. @runarorama if you want to assign this to @stew go ahead.

commented

Text.indexOf looks like it has a bug for multi-byte characters:

    1 | > ##Text.indexOf "foo👊🏿" "bar👊🏿foo👊🏿baz👊🏿"
          ⧩
          Some 7

    3 | > Text.size "foo👊🏿"
          ⧩
          5

The 👊🏿 character is two codepoints, so the answer should be Some 5

Is this a bug in the Text package?? @stew is just calling through to that... this function: https://hackage.haskell.org/package/text-2.0.2/docs/Data-Text-Internal-Lazy-Search.html

Or maybe we're just using it wrong?

commented

If it's working as intended, I don't understand what it's doing. It's interpreting "👊🏿" to have length 4, but it's two codepoints and 8 bytes. But 4 what?

What if you just call it in ghci? Does it still behave incorrectly?

commented

Yeah

λ> indices (Text.pack "foo") (Text.pack "👊🏿foo")
[4]

Is this just due to the fact that Haskell strings are UTF-16?

https://hackage.haskell.org/package/text-2.0.2/docs/src/Data.Text.Lazy.html#breakOn - it looks like the indices are byte offsets.

@stew maybe implement in terms of Text.breakOn, it's the size of the first element of the pair.

commented

Text.breakOn, Text.breakOnEnd, and Text.breakOnAll would be great builtins to have. We could implement a fast Text.indexOf etc. in terms of those.

I like breakOnAll and indexOfEnd as new builtins. That could be for later though.

indexOf seems better than breakOn - you can implement breakOn using indexOf, and indexOf can have a direct implementation if we want to make it more efficient. In the cases where you're just Text.drop-ing up to that index, you avoid needlessly allocating a prefix you're just discarding.

commented

Yeah, if we can get indexOf to work correctly for text, that's ideal.

@runarorama fyi, don't know if you saw, but the bug has been fixed, so you can add / replace the existing functions.

commented

Now there's a different bug:

> ##Text.indexOf "" "foo"

⬇️

Encountered exception:
Data.Text.Lazy.breakOn: empty input
CallStack ( from HasCallStack ):
    error

I can do a pure Unison check for the empty search string, but it really feels like the builtin should be doing this. The correct index of the empty string is 0.

commented

Fixed in unisonweb/unison#4101

Replaced the Unison definitions with the builtins and pushed to main.