A better `Data.Text.Lazy.inits`

Question

A better `Data.Text.Lazy.inits`

meooow25 opened this issue 3 months ago · comments

#558 made me notice that while the implementation of inits is overall optimal ( $O(n^2)$ ), it has an awkward property that evaluating the ith text in the result takes $O(i^2)$ time. Ideally this would just take $O(i)$ time, since it involves reaching the ith element in a list and the element is a text of length i.

I believe Data.List.inits has solved this problem using a queue: https://hackage.haskell.org/package/base-4.19.0.0/docs/src/Data.OldList.html#inits
The same could be adopted here.

Matthew Craven · Answer 1 · Sat Feb 17 2024 05:18:30 GMT+0800 (China Standard Time)

The implementation of lazy text inits is not asymptotically optimal. Unlike for lists, useful internal sharing is possible. By concatenating buffers we can use larger chunks. An implementation that doesn't care to be properly lazy can achieve $O(n)$ easily: inits = map TL.fromStrict . TS.inits . TL.toStrict. But with proper laziness I expect $O(n \log(numChunks))$ is optimal.
~~Evaluating the i-th text in the result only takes $O(i)$ time, because map is lazy and doesn't evaluate the earlier prefixes.~~ EDIT: Sorry, right, I'm tired. Each layer of maping has to traverse the spine of the layer below, even if it's not evaluating the elements of that layer below. So we generate the entirety $O(i \times numChunks)$ inits structure before the first element of the last prefix becomes available. Gross. The set-up in LazyByteString.inits avoids this by not building intermediate lists, but still suffers from the map (take 1) problem below.)
The idea of the SnocBuilder stuff in List.inits is to ensure stuff like map (take 1) (inits xs) needs only $O(n)$ time instead of $O(n^2)$.

Soumik Sarkar · Answer 2 · Sat Feb 17 2024 05:43:20 GMT+0800 (China Standard Time)

An implementation that doesn't care to be properly lazy can achieve $O(n)$ easily

Sure, I considered it a given that we want to preserve laziness. More specifically, we preserve the chunk boundaries in the result. Also, for now, I haven't distinguished between length and chunk length for simplicity. In the worst case they are equal. Of course, we want to consider this if it matters between competing implementations.

But with proper laziness I expect $O(n \log (numChunks))$ is optimal.

How?

Evaluating the i-th text in the result only takes $O(i)$ time, because map is lazy and doesn't evaluate the earlier prefixes.

map still takes time to reach the required position in the list. The $O(i^2)$ time can be demonstrated, of course.

Benchmarks of some use cases

import Test.Tasty.Bench
import Data.List
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Internal.Lazy as TL

main :: IO ()
main = defaultMain
  [ env (pure t_) $ \t ->
    bgroup "lazy text"
    [ bench "inits" $ nf TL.inits t
    , bench "last . inits" $ nf (last . TL.inits) t
    , bench "take 1 . last . inits" $ nf (TL.take 1 . last . TL.inits) t
    , bench "map (take 1) . inits" $ nf (map (TL.take 1) . TL.inits) t
    ]
  , env (pure xs_) $ \xs ->
    bgroup "list"
    [ bench "inits" $ nf inits xs
    , bench "last . inits" $ nf (last . inits) xs
    , bench "take 1 . last . inits" $ nf (take 1 . last . inits) xs
    , bench "map (take 1) . inits" $ nf (map (take 1) . inits) xs
    ]
  ]
  where
    n = 1000

    t_ :: TL.Text
    t_ = foldr TL.Chunk TL.empty $ replicate n $ T.singleton 'a'

    xs_ :: [Int]
    xs_ = [1..n]

  lazy text
    inits:                 OK
      27.7 ms ± 1.1 ms
    last . inits:          OK
      3.07 ms ± 235 μs
    take 1 . last . inits: OK
      3.00 ms ± 202 μs
    map (take 1) . inits:  OK
      3.05 ms ± 176 μs
  list
    inits:                 OK
      14.4 ms ± 1.3 ms
    last . inits:          OK
      14.7 μs ± 1.1 μs
    take 1 . last . inits: OK
      8.05 μs ± 456 ns
    map (take 1) . inits:  OK
      107  μs ±  10 μs

All of these except for inits can be $O(n)$ but are $O(n^2)$ for lazy text.

Matthew Craven · Answer 3 · Sat Feb 17 2024 09:48:20 GMT+0800 (China Standard Time)

More specifically, we preserve the chunk boundaries in the result.

Well, if you insist that the chunk boundaries in in the i-th output Text agree with the chunk boundaries within the first i characters of the input Text then we cannot do better. But that's very specific!

But with proper laziness I expect $O(n \log(numChunks))$ is optimal.

How?

As mentioned, the basic idea is to reduce the total number of chunks in the output by concatenating input chunks. I haven't worked through all of the details, but here's a (rather long, sorry) sketch:

Upper bound:

Since the cost of a concatenation operation on a bunch of chunk-buffers has cost equal to the total length of those chunk-buffers, we can use any chunk-buffer of length k around k times in the output "for free" before the cost of not concatenating it becomes asymptotically relevant.
Once we've done that for a chunk-buffer C, there are at least k characters of input between C and the current location in the input, which can be broken up into
- the chunks fully between C and the input chunk holding the current location
- a number of characters in the input chunk holding the current location
Since these add up to k, one of them must be at least k/2.
1. If the total length of the intervening "fully between" chunks is at least k/2 it's efficient to concatenate C with those intermediate chunks.
  - (It might not be very efficient if one of those chunks is large compared to C; I will need to prove this doesn't happen.)
2. Otherwise, the "current location" chunk has length at least k/2.
  - We should still concatenate C with any intervening "fully between" chunks now, making a new chunk-buffer C' even though it is not as directly efficient for C.
    - Otherwise we would run into problems with input chunk lengths like [32,16,8,4,2,1,1e10] since I claimed $O(n \log(numChunks))$ and not just $O(n \log(n))$.)
  - But then when we reach the end of the long "current location" chunk we will make up for this by performing a reasonably efficient concatenation for C' with that long "current position" chunk.
    - (Performing this merge will also ensure that the lengths of the "active" chunk-buffers always forms a monotone sequence, preventing the potential inefficient concatenation in scenario 1 above.)
    - If that "current position" chunk is very long, this merge might seem inefficient for that "current position" chunk. But that's OK because we only pay for one such inefficient "backwards" merge like this for each original input chunk.

When following this strategy:

Each chunk-buffer is used in at most one concatenation event.
- Consequently, the chunk-buffers that are used but never concatenated are pairwise disjoint.
- Since we only use chunk-buffers corresponding to input ranges, this means that the total length of all used chunk-buffers is at most n plus the total cost of all concatenations.
For each used chunk-buffer of length k, there are at most k output LazyText which contain that chunk-buffer followed by more than one other chunk. Thus, the total number of chunks in the output is at most 2(n+1) plus the total length of all used chunk-buffers, and hence contribute a cost of at most 3n+2 plus the total cost of all concatenations.
Each original input chunk might suffer one "inefficient backwards" merge when it is first reached.
Every non-original chunk-buffer, created by concatenation, will be merged into one that is at least 4/3 as long within 2 layers of merging:
- It may be merged with a longer active chunk-buffer corresponding to an earlier part of the input, doubling in one merge.
- A chunk like C in scenario 1 grows by a factor of at least 3/2 in one merge.
- A chunk like C in scenario 2 grows by a factor of at least 2 after two layers of merging.
- A chunk like C' in scenario 2 grows by a factor of at least 4/3 after one merge.
Thus, the number of "layers" of merging an original input chunk of length k participates in is at most 1 + 2 * ceil(log(n/k)/log(4/3)).
The total cost of all concatenations is exactly equal to the sum over all original input chunks of that chunk's length times the number of layers of merging it participates in.
- Hence, it is at most sum [k_i + 2 * k_i * ceil(log(n/k_i)/log(4/3)) | k_i <- input_chunk_lens].
- Hence, it is at most sum [3*k_i + 2 * k_i * log(n/k_i)/log(4/3) | k_i <- input_chunk_lens].
- Since the map \k_i -> 3*k_i + 2 * k_i * log(n/k_i)/log(4/3) is concave, and the sum of all k_is is n, by Jensen's inequality the above sum is maximized when all k_i are equal.
- Thus, the total cost of all concatenations is at most 3*n + 2*n*log(numChunks)/log(4/3).
By the above estimate on the total number of chunks in the output, this readily implies that the total cost of producing the output is at most 9*n+2+4*n*log(numChunks)/log(4/3). (Okay, there is certainly more cost due to book-keeping in a real implementation. But it should hopefully be obvious that the number of book-keeping steps will not be much larger than the number of chunks in the output.)

Lower bound:

For simplicity I will only consider the case where numChunks is n and every chunk has size 1.

Assign a weight of 1/i to the i-th character from the end of any LazyText in the output (one-indexed). The total of these weights is equal to the sum of the first n harmonic numbers and is roughly n*ln(n)+O(n).

It turns out that for any chunk-buffer used in the output, the total weight of all the characters in the output that are provided by that chunk-buffer is at most the cost of creating and using that chunk-buffer.

Suppose the chunk-buffer has length p>0 and is used q>0 times. Because we are properly lazy, we cannot start using this chunk-buffer before its last character is supposed to have been forced. So in particular the last character in the chunk-buffer's i-th appearance has a weight of at most 1/i. And the j-th-to-last character in the i-th appearance has a weight of at most 1/(i+j-1), one-indexing again. There is at most one valid (i,j) pair that adds up to 2, and these have total weight of at most 1. There are at most two valid (i,j) pairs that add up to 3 and these also have total weight at most 1. Generally there are at most k-1 valid (i,j) pairs that add up to k, and for any single sum k they provide a total weight of at most 1. Since the valid sums range between 2 and p+q inclusive, this means the total weight provided by a given chunk-buffer is at most p+q-1.

If p>1, then we paid a cost of p to create the chunk-buffer by concatenation for a total cost of p+q, which is greater than our upper bound for the weight. Otherwise, p=1 so that p+q-1=q which is exactly our total cost.

I think the same basic idea should work for smaller numbers of roughly-equally-large chunks, giving each input-chunk equivalent in an output LazyText a weight of 1/(numLaterInputChunkEquivalents+1), but the details of the final calculation will be much messier.

map still takes time to reach the required position in the list.

Yeah, I remembered the problem of map's stupid intermediate lists a little bit after I posted. (The trickier map (take 1) still is the worst-case behavior when just squashing the map layers together like I did in the bytestring PR.)

Xia Li-yao · Answer 4 · Sat Feb 17 2024 17:59:05 GMT+0800 (China Standard Time)

A better implementation of inits would be welcome! The Data.List implementation seems like a good starting point.

I am unsure about an implementation that concatenates chunks to produce later prefixes faster. For one thing, it can have a negative effect on lazy texts consisting of few big chunks, where the work of re-chaining chunks together is negligible compared to concatenating them. That is arguably the common case.

Xia Li-yao · Answer 5 · Wed Mar 13 2024 22:41:00 GMT+0800 (China Standard Time)

Is using a queue or a difference list as in OldList and bytestring worth it? Instead isn't it sufficient to keep only the length as an accumulator, so each prefix can simply be generated as take n? Am I missing something?

inits :: [a] -> [[a]]
inits xs = [take n xs | n <- [0 .. length xs]]    -- the [0 .. length xs] list should be fused away.

Soumik Sarkar · Answer 6 · Wed Mar 13 2024 23:30:06 GMT+0800 (China Standard Time)

That seems much simpler! It needs to be modified to support infinite lists however.

I traced the OldList implementation to https://gitlab.haskell.org/ghc/ghc/-/issues/9345. This take version was considered (search for initsT) but the current version was found to be better.

The same might not apply to lazy text though, so it is worth evaluating.

Xia Li-yao · Answer 7 · Thu Mar 14 2024 00:07:42 GMT+0800 (China Standard Time)

Thanks for finding that thread!

@treeowl might you remember why the queue implementation of inits was chosen in the end instead of the simpler one relying on take? It is mentioned briefly (as initsT) in the ghc issue https://gitlab.haskell.org/ghc/ghc/-/issues/9345 but then "initsQ2 really is better".

I also dug up the libraries thread https://mail.haskell.org/pipermail/libraries/2014-July/023328.html

David Feuer · Answer 8 · Thu Mar 14 2024 00:15:47 GMT+0800 (China Standard Time)

@Lysxia , it was chosen because it was (surprisingly to me) faster. As I recall, someone who knows more than I do speculated that this had to do with the way GHC represents closures.

Xia Li-yao · Answer 9 · Thu Mar 14 2024 05:39:29 GMT+0800 (China Standard Time)

I was also quite surprised for a while but I found a simple explanation. I was wrongly assuming that Okasaki's banker's queue was somehow incurring unnecessary overhead. In fact, take does more work per element than the banker's queue's toList function: take has to do one comparison on Int# and one match on a cons cell for every element, whereas toList is a sequence of (++) and reverse (each one producing half of the output list), which only have to match on a cons cell at every step (no Int# comparison).

To confirm this, I managed to write an even faster inits by replacing the banker's queue with a mutable array, whose toList avoids the work of consuming a list.

Implementation and benchmarking https://gist.github.com/Lysxia/3b733d87da8ea056c4e4a27fc047e170

Benchmark results (initsA is the new implementation with arrays, initsQ = Data.List.inits, and initsT uses take):

Plain text output:

benchmarking initsA
time                 216.7 ms   (210.6 ms .. 221.5 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 214.4 ms   (210.5 ms .. 216.7 ms)
std dev              4.450 ms   (1.862 ms .. 7.362 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking initsQ
time                 249.9 ms   (243.1 ms .. 259.4 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 246.9 ms   (245.5 ms .. 248.6 ms)
std dev              2.080 ms   (1.056 ms .. 3.051 ms)
variance introduced by outliers: 16% (moderately inflated)

benchmarking initsT
time                 259.8 ms   (252.5 ms .. 266.8 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 262.6 ms   (260.7 ms .. 264.4 ms)
std dev              2.408 ms   (1.423 ms .. 3.065 ms)
variance introduced by outliers: 16% (moderately inflated)

David Feuer · Answer 10 · Thu Mar 14 2024 05:55:02 GMT+0800 (China Standard Time)

initsA isn't lazy like the others, so it's hard to make a meaningful comparison.

Xia Li-yao · Answer 11 · Thu Mar 14 2024 06:06:03 GMT+0800 (China Standard Time)

It is lazy like the others. It uses unsafe primitives but it is entirely invisible for a user of inits, however they choose to force it.

David Feuer · Answer 12 · Thu Mar 14 2024 06:36:46 GMT+0800 (China Standard Time)

My apologies. I missed one of the unsafePerformIO invocations.

Matthew Craven · Answer 13 · Thu Mar 14 2024 11:37:18 GMT+0800 (China Standard Time)

Ah, right... $O(n)$ total cost is possible in spite of my $\Omega(n \log{numChunks})$ lower bound proof above, by cheating a little and returning slices of not-yet-fully-initialized buffers.