A better `Data.Text.Lazy.inits`
meooow25 opened this issue · comments
#558 made me notice that while the implementation of inits
is overall optimal (
I believe Data.List.inits
has solved this problem using a queue: https://hackage.haskell.org/package/base-4.19.0.0/docs/src/Data.OldList.html#inits
The same could be adopted here.
- The implementation of lazy text
inits
is not asymptotically optimal. Unlike for lists, useful internal sharing is possible. By concatenating buffers we can use larger chunks. An implementation that doesn't care to be properly lazy can achieve$O(n)$ easily:inits = map TL.fromStrict . TS.inits . TL.toStrict
. But with proper laziness I expect$O(n \log(numChunks))$ is optimal. -
Evaluating the i-th text in the result only takesEDIT: Sorry, right, I'm tired. Each layer of$O(i)$ time, becausemap
is lazy and doesn't evaluate the earlier prefixes.map
ing has to traverse the spine of the layer below, even if it's not evaluating the elements of that layer below. So we generate the entirety$O(i \times numChunks)$ inits
structure before the first element of the last prefix becomes available. Gross. The set-up inLazyByteString.inits
avoids this by not building intermediate lists, but still suffers from themap (take 1)
problem below.) - The idea of the
SnocBuilder
stuff inList.inits
is to ensure stuff likemap (take 1) (inits xs)
needs only$O(n)$ time instead of$O(n^2)$ .
An implementation that doesn't care to be properly lazy can achieve
$O(n)$ easily
Sure, I considered it a given that we want to preserve laziness. More specifically, we preserve the chunk boundaries in the result. Also, for now, I haven't distinguished between length and chunk length for simplicity. In the worst case they are equal. Of course, we want to consider this if it matters between competing implementations.
But with proper laziness I expect
$O(n \log (numChunks))$ is optimal.
How?
Evaluating the i-th text in the result only takes
$O(i)$ time, becausemap
is lazy and doesn't evaluate the earlier prefixes.
map
still takes time to reach the required position in the list. The
Benchmarks of some use cases
import Test.Tasty.Bench
import Data.List
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Internal.Lazy as TL
main :: IO ()
main = defaultMain
[ env (pure t_) $ \t ->
bgroup "lazy text"
[ bench "inits" $ nf TL.inits t
, bench "last . inits" $ nf (last . TL.inits) t
, bench "take 1 . last . inits" $ nf (TL.take 1 . last . TL.inits) t
, bench "map (take 1) . inits" $ nf (map (TL.take 1) . TL.inits) t
]
, env (pure xs_) $ \xs ->
bgroup "list"
[ bench "inits" $ nf inits xs
, bench "last . inits" $ nf (last . inits) xs
, bench "take 1 . last . inits" $ nf (take 1 . last . inits) xs
, bench "map (take 1) . inits" $ nf (map (take 1) . inits) xs
]
]
where
n = 1000
t_ :: TL.Text
t_ = foldr TL.Chunk TL.empty $ replicate n $ T.singleton 'a'
xs_ :: [Int]
xs_ = [1..n]
lazy text
inits: OK
27.7 ms ± 1.1 ms
last . inits: OK
3.07 ms ± 235 μs
take 1 . last . inits: OK
3.00 ms ± 202 μs
map (take 1) . inits: OK
3.05 ms ± 176 μs
list
inits: OK
14.4 ms ± 1.3 ms
last . inits: OK
14.7 μs ± 1.1 μs
take 1 . last . inits: OK
8.05 μs ± 456 ns
map (take 1) . inits: OK
107 μs ± 10 μs
All of these except for inits
can be
More specifically, we preserve the chunk boundaries in the result.
Well, if you insist that the chunk boundaries in in the i-th output Text
agree with the chunk boundaries within the first i characters of the input Text
then we cannot do better. But that's very specific!
But with proper laziness I expect
$O(n \log(numChunks))$ is optimal.How?
As mentioned, the basic idea is to reduce the total number of chunks in the output by concatenating input chunks. I haven't worked through all of the details, but here's a (rather long, sorry) sketch:
Upper bound:
- Since the cost of a concatenation operation on a bunch of chunk-buffers has cost equal to the total length of those chunk-buffers, we can use any chunk-buffer of length k around k times in the output "for free" before the cost of not concatenating it becomes asymptotically relevant.
- Once we've done that for a chunk-buffer C, there are at least k characters of input between C and the current location in the input, which can be broken up into
- the chunks fully between C and the input chunk holding the current location
- a number of characters in the input chunk holding the current location
- Since these add up to k, one of them must be at least k/2.
- If the total length of the intervening "fully between" chunks is at least k/2 it's efficient to concatenate C with those intermediate chunks.
- (It might not be very efficient if one of those chunks is large compared to C; I will need to prove this doesn't happen.)
- Otherwise, the "current location" chunk has length at least k/2.
- We should still concatenate C with any intervening "fully between" chunks now, making a new chunk-buffer C' even though it is not as directly efficient for C.
- Otherwise we would run into problems with input chunk lengths like [32,16,8,4,2,1,1e10] since I claimed
$O(n \log(numChunks))$ and not just$O(n \log(n))$ .)
- Otherwise we would run into problems with input chunk lengths like [32,16,8,4,2,1,1e10] since I claimed
- But then when we reach the end of the long "current location" chunk we will make up for this by performing a reasonably efficient concatenation for C' with that long "current position" chunk.
- (Performing this merge will also ensure that the lengths of the "active" chunk-buffers always forms a monotone sequence, preventing the potential inefficient concatenation in scenario 1 above.)
- If that "current position" chunk is very long, this merge might seem inefficient for that "current position" chunk. But that's OK because we only pay for one such inefficient "backwards" merge like this for each original input chunk.
- We should still concatenate C with any intervening "fully between" chunks now, making a new chunk-buffer C' even though it is not as directly efficient for C.
- If the total length of the intervening "fully between" chunks is at least k/2 it's efficient to concatenate C with those intermediate chunks.
When following this strategy:
- Each chunk-buffer is used in at most one concatenation event.
- Consequently, the chunk-buffers that are used but never concatenated are pairwise disjoint.
- Since we only use chunk-buffers corresponding to input ranges, this means that the total length of all used chunk-buffers is at most n plus the total cost of all concatenations.
- For each used chunk-buffer of length k, there are at most k output
LazyText
which contain that chunk-buffer followed by more than one other chunk. Thus, the total number of chunks in the output is at most 2(n+1) plus the total length of all used chunk-buffers, and hence contribute a cost of at most 3n+2 plus the total cost of all concatenations. - Each original input chunk might suffer one "inefficient backwards" merge when it is first reached.
- Every non-original chunk-buffer, created by concatenation, will be merged into one that is at least 4/3 as long within 2 layers of merging:
- It may be merged with a longer active chunk-buffer corresponding to an earlier part of the input, doubling in one merge.
- A chunk like C in scenario 1 grows by a factor of at least 3/2 in one merge.
- A chunk like C in scenario 2 grows by a factor of at least 2 after two layers of merging.
- A chunk like C' in scenario 2 grows by a factor of at least 4/3 after one merge.
- Thus, the number of "layers" of merging an original input chunk of length k participates in is at most
1 + 2 * ceil(log(n/k)/log(4/3))
. - The total cost of all concatenations is exactly equal to the sum over all original input chunks of that chunk's length times the number of layers of merging it participates in.
- Hence, it is at most
sum [k_i + 2 * k_i * ceil(log(n/k_i)/log(4/3)) | k_i <- input_chunk_lens]
. - Hence, it is at most
sum [3*k_i + 2 * k_i * log(n/k_i)/log(4/3) | k_i <- input_chunk_lens]
. - Since the map
\k_i -> 3*k_i + 2 * k_i * log(n/k_i)/log(4/3)
is concave, and the sum of allk_i
s isn
, by Jensen's inequality the above sum is maximized when allk_i
are equal. - Thus, the total cost of all concatenations is at most
3*n + 2*n*log(numChunks)/log(4/3)
.
- Hence, it is at most
- By the above estimate on the total number of chunks in the output, this readily implies that the total cost of producing the output is at most
9*n+2+4*n*log(numChunks)/log(4/3)
. (Okay, there is certainly more cost due to book-keeping in a real implementation. But it should hopefully be obvious that the number of book-keeping steps will not be much larger than the number of chunks in the output.)
Lower bound:
Assign a weight of 1/i to the i-th character from the end of any LazyText in the output (one-indexed). The total of these weights is equal to the sum of the first n harmonic numbers and is roughly n*ln(n)+O(n).
It turns out that for any chunk-buffer used in the output, the total weight of all the characters in the output that are provided by that chunk-buffer is at most the cost of creating and using that chunk-buffer.
Suppose the chunk-buffer has length p>0 and is used q>0 times. Because we are properly lazy, we cannot start using this chunk-buffer before its last character is supposed to have been forced. So in particular the last character in the chunk-buffer's i-th appearance has a weight of at most 1/i. And the j-th-to-last character in the i-th appearance has a weight of at most 1/(i+j-1), one-indexing again. There is at most one valid (i,j) pair that adds up to 2, and these have total weight of at most 1. There are at most two valid (i,j) pairs that add up to 3 and these also have total weight at most 1. Generally there are at most k-1 valid (i,j) pairs that add up to k, and for any single sum k they provide a total weight of at most 1. Since the valid sums range between 2 and p+q inclusive, this means the total weight provided by a given chunk-buffer is at most p+q-1.
If p>1, then we paid a cost of p to create the chunk-buffer by concatenation for a total cost of p+q, which is greater than our upper bound for the weight. Otherwise, p=1 so that p+q-1=q which is exactly our total cost.
I think the same basic idea should work for smaller numbers of roughly-equally-large chunks, giving each input-chunk equivalent in an output LazyText a weight of 1/(numLaterInputChunkEquivalents+1), but the details of the final calculation will be much messier.
map
still takes time to reach the required position in the list.
Yeah, I remembered the problem of map
's stupid intermediate lists a little bit after I posted. (The trickier map (take 1)
still is the worst-case behavior when just squashing the map
layers together like I did in the bytestring PR.)
A better implementation of inits
would be welcome! The Data.List
implementation seems like a good starting point.
I am unsure about an implementation that concatenates chunks to produce later prefixes faster. For one thing, it can have a negative effect on lazy texts consisting of few big chunks, where the work of re-chaining chunks together is negligible compared to concatenating them. That is arguably the common case.
Is using a queue or a difference list as in OldList and bytestring worth it? Instead isn't it sufficient to keep only the length as an accumulator, so each prefix can simply be generated as take n
? Am I missing something?
inits :: [a] -> [[a]]
inits xs = [take n xs | n <- [0 .. length xs]] -- the [0 .. length xs] list should be fused away.
That seems much simpler! It needs to be modified to support infinite lists however.
I traced the OldList implementation to https://gitlab.haskell.org/ghc/ghc/-/issues/9345. This take
version was considered (search for initsT
) but the current version was found to be better.
The same might not apply to lazy text though, so it is worth evaluating.
Thanks for finding that thread!
@treeowl might you remember why the queue implementation of inits
was chosen in the end instead of the simpler one relying on take
? It is mentioned briefly (as initsT
) in the ghc issue https://gitlab.haskell.org/ghc/ghc/-/issues/9345 but then "initsQ2
really is better".
I also dug up the libraries thread https://mail.haskell.org/pipermail/libraries/2014-July/023328.html
@Lysxia , it was chosen because it was (surprisingly to me) faster. As I recall, someone who knows more than I do speculated that this had to do with the way GHC represents closures.
I was also quite surprised for a while but I found a simple explanation. I was wrongly assuming that Okasaki's banker's queue was somehow incurring unnecessary overhead. In fact, take
does more work per element than the banker's queue's toList
function: take
has to do one comparison on Int#
and one match on a cons cell for every element, whereas toList
is a sequence of (++)
and reverse
(each one producing half of the output list), which only have to match on a cons cell at every step (no Int#
comparison).
To confirm this, I managed to write an even faster inits
by replacing the banker's queue with a mutable array, whose toList
avoids the work of consuming a list.
Implementation and benchmarking https://gist.github.com/Lysxia/3b733d87da8ea056c4e4a27fc047e170
Benchmark results (initsA
is the new implementation with arrays, initsQ = Data.List.inits
, and initsT
uses take
):
Plain text output:
benchmarking initsA
time 216.7 ms (210.6 ms .. 221.5 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 214.4 ms (210.5 ms .. 216.7 ms)
std dev 4.450 ms (1.862 ms .. 7.362 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking initsQ
time 249.9 ms (243.1 ms .. 259.4 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 246.9 ms (245.5 ms .. 248.6 ms)
std dev 2.080 ms (1.056 ms .. 3.051 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking initsT
time 259.8 ms (252.5 ms .. 266.8 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 262.6 ms (260.7 ms .. 264.4 ms)
std dev 2.408 ms (1.423 ms .. 3.065 ms)
variance introduced by outliers: 16% (moderately inflated)
initsA
isn't lazy like the others, so it's hard to make a meaningful comparison.
It is lazy like the others. It uses unsafe primitives but it is entirely invisible for a user of inits
, however they choose to force it.
My apologies. I missed one of the unsafePerformIO
invocations.
Ah, right...