Add aliases for `Bytes.indexOf` and `Text.indexOf`

Question

Add aliases for `Bytes.indexOf` and `Text.indexOf`

pchiusano opened this issue a year ago · comments

There's now much more efficient implementations of these as builtins (see this PR)

Bytes.indexOf : Bytes -> Bytes -> Optional Nat
Text.indexOf : Text -> Text -> Optional Nat

I notice Text.indexOf exists but has a slow implementation. So maybe as simple as...

replace Text.indexOf ##Text.indexOf
alias.term ##Bytes.indexOf Bytes.indexOf

Plus docs / tests. @runarorama if you want to assign this to @stew go ahead.

Rúnar · Answer 1 · Fri Jun 09 2023 09:07:51 GMT+0800 (China Standard Time)

Text.indexOf looks like it has a bug for multi-byte characters:

    1 | > ##Text.indexOf "foo👊🏿" "bar👊🏿foo👊🏿baz👊🏿"
          ⧩
          Some 7

    3 | > Text.size "foo👊🏿"
          ⧩
          5

The 👊🏿 character is two codepoints, so the answer should be Some 5

Paul Chiusano · Answer 2 · Fri Jun 09 2023 09:19:41 GMT+0800 (China Standard Time)

Is this a bug in the Text package?? @stew is just calling through to that... this function: https://hackage.haskell.org/package/text-2.0.2/docs/Data-Text-Internal-Lazy-Search.html

Or maybe we're just using it wrong?

Rúnar · Answer 3 · Fri Jun 09 2023 09:24:45 GMT+0800 (China Standard Time)

If it's working as intended, I don't understand what it's doing. It's interpreting "👊🏿" to have length 4, but it's two codepoints and 8 bytes. But 4 what?

Paul Chiusano · Answer 4 · Fri Jun 09 2023 09:25:49 GMT+0800 (China Standard Time)

What if you just call it in ghci? Does it still behave incorrectly?

Rúnar · Answer 5 · Fri Jun 09 2023 09:30:16 GMT+0800 (China Standard Time)

Yeah

λ> indices (Text.pack "foo") (Text.pack "👊🏿foo")
[4]

Is this just due to the fact that Haskell strings are UTF-16?

Paul Chiusano · Answer 6 · Fri Jun 09 2023 09:33:44 GMT+0800 (China Standard Time)

https://hackage.haskell.org/package/text-2.0.2/docs/src/Data.Text.Lazy.html#breakOn - it looks like the indices are byte offsets.

@stew maybe implement in terms of Text.breakOn, it's the size of the first element of the pair.

Rúnar · Answer 7 · Fri Jun 09 2023 10:00:16 GMT+0800 (China Standard Time)

Text.breakOn, Text.breakOnEnd, and Text.breakOnAll would be great builtins to have. We could implement a fast Text.indexOf etc. in terms of those.

Paul Chiusano · Answer 8 · Fri Jun 09 2023 10:13:23 GMT+0800 (China Standard Time)

I like breakOnAll and indexOfEnd as new builtins. That could be for later though.

indexOf seems better than breakOn - you can implement breakOn using indexOf, and indexOf can have a direct implementation if we want to make it more efficient. In the cases where you're just Text.drop-ing up to that index, you avoid needlessly allocating a prefix you're just discarding.

Rúnar · Answer 9 · Fri Jun 09 2023 22:22:48 GMT+0800 (China Standard Time)

Yeah, if we can get indexOf to work correctly for text, that's ideal.

Paul Chiusano · Answer 10 · Thu Jun 15 2023 06:57:09 GMT+0800 (China Standard Time)

@runarorama fyi, don't know if you saw, but the bug has been fixed, so you can add / replace the existing functions.

Rúnar · Answer 11 · Thu Jun 15 2023 23:26:44 GMT+0800 (China Standard Time)

Now there's a different bug:

> ##Text.indexOf "" "foo"

⬇️

Encountered exception:
Data.Text.Lazy.breakOn: empty input
CallStack ( from HasCallStack ):
    error

I can do a pure Unison check for the empty search string, but it really feels like the builtin should be doing this. The correct index of the empty string is 0.

Rúnar · Answer 12 · Fri Jun 16 2023 00:24:15 GMT+0800 (China Standard Time)

Fixed in unisonweb/unison#4101

Replaced the Unison definitions with the builtins and pushed to main.