Add a function `slice start end = take (end − start) . drop start`?
kindaro opened this issue · comments
Many languages offer a feature that lets you take a substring by indices real quick. It is handy. Here is an example from JavaScript:
> "Hello World".slice (6, 11)
'World'
One frequent use case is when you have a parse tree annotated with locations and you want to recover the literal source of a given syntactic element. With slice
, this is done at once.
Can we add such a function to text
? Maybe it can have a more efficient implementation than take … . drop …
?
When I encounter slice :: Int -> Int -> f a -> f a
, I never remember whether the second Int
is the number of elements to take or the index of the final element to take: is it slice start len
or slice start end
? take ... . drop ...
does not pose an ambiguity and is only marginally longer.
However, if we extend slice
to allow negative numbers mimicking Python, it would offer asymptotic improvements. Currently take (length xs - 2) . drop 1
is O(n) because length
is O(n), but slice 1 (-1)
could be implemented in O(1).
Currently take (length xs - 2) . drop 1 is O(n) because length is O(n)
We can use dropEnd
for this.
When I encounter
slice :: Int -> Int -> f a -> f a
, I never remember whether the secondInt
is the number of elements to take or the index of the final element to take: is itslice start len
orslice start end
?take ... . drop ...
does not pose an ambiguity and is only marginally longer.
I agree, this is terrible. But convenience also matters. slice
is convenient.
- We should do what Python and JavaScript do. 9 out of 10 programmers have experience with these. Python and JavaScript do absolute indexing, so we should do absolute indexing.
- A
Text
is conceptually an array. The simplest thingslice
could do is take two indices within the bounds of the array and return a new appropriate array. It would not be possible with C strings but it is possible withText
— we should celebrate this possibility. drop … . take …
is list thinking, not array thinking. Or wait, was ittake … . drop …
?
It hurts me that there is no function out of the box that does what text
is good for — taking substrings in constant time space.
what text is good for — taking substrings in constant time.
You can't take a substring of a UTF-8 string in constant time, because characters have encodings of variable lengths. In Python, strings are byte-indexed, i.e., they're really ByteString
.
Haskell's vector has a slice
that takes start and length instead of start and end.
Yes, of course. I meant «constant space» but I guess I was sleepy.
- For a Haskell
String
, if you want to get a tail, you can do it in constant space, but you cannot take a general sub-list in constant space. - For a null terminated C string, again you cannot get a sub-string without copying because it must be null terminated.
Text
is different. It stores a pointer to an array and start and end indices. It is in a special position here that makes slicing space efficient — the space cost of a slice is a pointer and two numbers. It makes sense to expose this special feature.
If we add slice
, users of Python or Javascript will be surprised to find it to be O(n) instead of the O(1) they expect. index
already poses this problem.
In my opinion, Python and Javascript provide a reason to not add this function, because it would be misleading.
Not saying that this is enough reason to reject slice
outright, of course.
In Python, strings are byte-indexed, i.e., they're really ByteString.
It's a little more complicated. Python strings are arrays of whatever the the largest code point in the string is: PEP-393
In Javascript, Strings are sequences of UTF-16 code units, which means slice
arguably does the wrong thing if you want to slice on Unicode code points.