haskell / text

Right now there's many ways of creating empty Text values that are represented by distinct heap objects. This is a waste of memory and it would be good if the library guaranteed that they were all represented by Data.Text.empty.

My suggestion is that whenever we detect that we are creating a zero length Text value, let's use Data.Text.empty rather than create a new heap object.

I'd be happy to give implementing this a go

Actually it looks like this is already attempted:

text/src/Data/Text/Internal.hs

Line 124 in 197bbcb

text arr off len | len == 0 = empty

But I definitely see a lot of empty Text values on the heap on our work application when using ghc-heap. I'll investigate further

Edit: it looks like this smart constructor is not used consistently and even then it looks like empty gets inlined so that might mean that no sharing occurs

Not all Text values are created using that. E.g. aeson's key / value parser doesn't (I wasn't aware of text smart constructor).

it looks like this smart constructor is not used consistently and even then it looks like empty gets inlined so that might mean that no sharing occurs

That sounds like a quite tricky situation. It would be great to document what's going on somewhere.

There's two variants of empty:

empty is marked INLINE [1] (introduced in fd96b66 and also present in Tom Harper's phd thesis https://www.cs.ox.ac.uk/files/3929/dissertation.pdf but without an explanation)
empty_ is marked NOINLINE (introduced in 6d75e1a)

I'm a bit confused why there's a variant that inlines at all. I can't really see an advantage in ever inlining this. It seems to me that it would just lead to code bloat and increased allocations.

Maybe it's explained by it being very old code and no one has had the confidence to touch it.

Perhaps someone else knows if there's a reason to have it?

If you have code that destructs the output Text right away, inlining lets you simplify that step. I could at least imagine a toy benchmark where this pays off. I'm not sure how good or bad that is in practice.

That's a good point! I can imagine something like length or null could be optimised away. I think we could probably add some RULEs to mitigate possible regressions

I took a quick look at this. I made the changes in my draft MR and peered at the Core diffs with GHC 9.4.
It looks like the current code sometimes leads to empty being floated-out de-facto creating a shared empty value for each module, but sometimes doesn't.

The code as it stands often gets optimised into a worker-wrapper that uses unboxed Text. This is helpful because then the functions in Data.Text.Lazy can often go directly from the unboxed representation to a Lazy Text without having to allocate a strict Text at all.

Changing it so that we are more consistent about sharing empty Text in Data.Text leads to some of this unboxing being inhibited, which then leads to worse code in Data.Text.Lazy.

I think one way of avoiding this is to do some manual worker-wrapper on the functions in Data.Text that are also used by Data.Text.Lazy. We can extract a worker that doesn't check for emptiness and returns an unboxed Text. Then wrappers in Data.Text and Data.Text.Lazy can do the appropriate boxing.

Optimise storage of empty Text