haskell / text

Haskell library for space- and time-efficient operations over Unicode text.

Home Page:http://hackage.haskell.org/package/text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a function `slice start end = take (end − start) . drop start`?

kindaro opened this issue · comments

Many languages offer a feature that lets you take a substring by indices real quick. It is handy. Here is an example from JavaScript:

> "Hello World".slice (6, 11)
'World'

One frequent use case is when you have a parse tree annotated with locations and you want to recover the literal source of a given syntactic element. With slice, this is done at once.

Can we add such a function to text? Maybe it can have a more efficient implementation than take … . drop …?

When I encounter slice :: Int -> Int -> f a -> f a, I never remember whether the second Int is the number of elements to take or the index of the final element to take: is it slice start len or slice start end? take ... . drop ... does not pose an ambiguity and is only marginally longer.

However, if we extend slice to allow negative numbers mimicking Python, it would offer asymptotic improvements. Currently take (length xs - 2) . drop 1 is O(n) because length is O(n), but slice 1 (-1) could be implemented in O(1).

Currently take (length xs - 2) . drop 1 is O(n) because length is O(n)

We can use dropEnd for this.

When I encounter slice :: Int -> Int -> f a -> f a, I never remember whether the second Int is the number of elements to take or the index of the final element to take: is it slice start len or slice start end? take ... . drop ... does not pose an ambiguity and is only marginally longer.

I agree, this is terrible. But convenience also matters. slice is convenient.

  • We should do what Python and JavaScript do. 9 out of 10 programmers have experience with these. Python and JavaScript do absolute indexing, so we should do absolute indexing.
  • A Text is conceptually an array. The simplest thing slice could do is take two indices within the bounds of the array and return a new appropriate array. It would not be possible with C strings but it is possible with Text — we should celebrate this possibility.
  • drop … . take … is list thinking, not array thinking. Or wait, was it take … . drop …?

It hurts me that there is no function out of the box that does what text is good for — taking substrings in constant time space.

what text is good for — taking substrings in constant time.

You can't take a substring of a UTF-8 string in constant time, because characters have encodings of variable lengths. In Python, strings are byte-indexed, i.e., they're really ByteString.

Haskell's vector has a slice that takes start and length instead of start and end.

Yes, of course. I meant «constant space» but I guess I was sleepy.

  • For a Haskell String, if you want to get a tail, you can do it in constant space, but you cannot take a general sub-list in constant space.
  • For a null terminated C string, again you cannot get a sub-string without copying because it must be null terminated.

Text is different. It stores a pointer to an array and start and end indices. It is in a special position here that makes slicing space efficient — the space cost of a slice is a pointer and two numbers. It makes sense to expose this special feature.

If we add slice, users of Python or Javascript will be surprised to find it to be O(n) instead of the O(1) they expect. index already poses this problem.
In my opinion, Python and Javascript provide a reason to not add this function, because it would be misleading.

Not saying that this is enough reason to reject slice outright, of course.


In Python, strings are byte-indexed, i.e., they're really ByteString.

It's a little more complicated. Python strings are arrays of whatever the the largest code point in the string is: PEP-393

In Javascript, Strings are sequences of UTF-16 code units, which means slice arguably does the wrong thing if you want to slice on Unicode code points.