Making Nim's strings Copy-On-Write (COW)

Question

Making Nim's strings Copy-On-Write (COW)

Araq opened this issue 4 years ago · comments

Motivation: Make naive string handling code faster

In the ARC/ORC implementation string literals are already not copied and not destroyed. When these strings are potentially mutated nimPrepareStrMutationV2 is called to copy the constant string into a buffer that allows for mutations. Currently the condition (s.p.cap and strlitFlag) == strlitFlag is used to determine whether s.p points to a constant string.

In Delphi/FPC a refcount of -1 is used instead to allow for a more general COW mechanism. This RFC proposes to copy this design as much everyday Nim code is written without the current expensive copies in mind, e.g.:

proc getName(p: Person): string = p.name

This is particularly troublesome for async procs where the snippet env.field = param is introduced so that parameters can be captured inside closures. However, a sink parameter probably has the same benefit.

As a side effect Nim's ARC strings can support the old GC_ref / GC_unref calls.

Expected compat problems

It means that cast[seq[char]](mystr) stops working. We can detect this case in the compiler
and fusion / compat can introduce castToSeq / castToString cast operations that
work for all runtimes.

Alternatives/additions considered but rejected (for now)

Make Nim's strings use C++'s SSO

SSO looks like a rather fragile questionable optimization, so copying short strings around doesn't involve touching the heap, yet long strings are as slow as before to copy? And almost every operation internally needs to distinguish between "is this a long or a shorts string". Also, in Nim filenames/paths are currently not a dedicated type and directly stored as string too; most paths are longer than what is covered by the "short string" representation.

Make Nim's string slices O(1)

People coming from Go (and for some reason also people coming from Python) expect substr and s[1..3] to be faster than they really are. In the old runtime this was even worse as string allocations cause the occasional GC run. In the current new runtime the allocations remain. If the string implementation is changed from

type
  NimStringV2 {.core.} = object
    len: int
    p: ptr NimStrPayload ## can be nil if len == 0.

to

type
  NimStringV2 {.core.} = object
    len, offset: int
    p: ptr NimStrPayload ## can be nil if len == 0.

The slice operation can be done in O(1) time. However, every slice creation would
cause a inc(s.p.refcount) operation and also involve a corresponding destruction step. This
probably means that it's not fast enough in practice and so our effort is better spent on
using more openArray[char] with borrowing rules instead. This also has the benefit that short string slices cannot keep a multi megabyte sized string buffer alive for too long; a problem
big enough in the JVM world that they moved from O(1) slicing to O(N) slicing in a patch release.

Implementation effort

Patch strs_v2.nim
Patch the Nim C code generator so that string literals have a refcount field too.
Run some benchmarks.

Clyybber · Answer 1 · Wed May 06 2020 20:26:45 GMT+0800 (China Standard Time)

I think slices should be of type lent and should copy on write.
For mutable views maybe a mutable equivilalent of lent called view should be introduced.

Clyybber · Answer 2 · Wed May 06 2020 20:32:47 GMT+0800 (China Standard Time)

We could also do this: In all assignments let a = b where a doesn't escape the stackframe, make a lent.
If we were to later on pass a to a proc that takes a as a sink or normal parameter, we could detect that and convert let a = b to a = or =sink assigment. (Maybe only when it can be converted to a =sink assigment? And in all other cases COW instead?)

Then we would benefit from parameters being lent if possible, so maybe some kind of inference as we do for sink now would be possible.

Mamy Ratsimbazafy · Answer 3 · Wed May 06 2020 20:53:14 GMT+0800 (China Standard Time)

I agree with @Clyybber before going all the way to COW we need to use the sink/move/lent optimizations.

Second, there are 2 performance issues with strings:

Copies that you mention
Intermediate strings when doing transformations

I'd argue that COW is solving the wrong problem (problem 1). The copies are often the result of a slice that stay in the same stack frame and can benefit from lent and/or openarray as values (#88).

And the most important performance issue is intermediate allocations. dup is a step in the right direction. We need a way to compose string transformations that minimize the allocations, hence re-use buffers. It can be zero-functional inspired, it can be state-machine based, it can be C++ or D rangesinspired.

COW doesn't solve the problem of allocation of intermediate results.

Sources (i.e. benchmark should be inspired by the kind of code people are trying to write here):

Andreas Rumpf · Answer 4 · Wed May 06 2020 21:00:14 GMT+0800 (China Standard Time)

Let me say it differently: --gc:arc already has COW, but limited. It seems to me there is no additional runtime overhead to make it more general. The cost is a refcount per string, one machine word.

Mamy Ratsimbazafy · Answer 5 · Wed May 06 2020 21:07:16 GMT+0800 (China Standard Time)

It's a very basic type, it should be as simple as possible given that it will be used in many contexts including embedded, async and multithreading.

Does the refcount become atomic with --threads:on?

Andreas Rumpf · Answer 6 · Wed May 06 2020 21:13:35 GMT+0800 (China Standard Time)

As simple as possible is covered by openArray[char] which is C++'s string_view in all but its name.

Does the refcount become atomic with --threads:on?

Unlikely, we'll probably keep doing =deepCopy for threading.

zah · Answer 7 · Thu May 07 2020 01:35:18 GMT+0800 (China Standard Time)

Wouldn't copy-on-write be addressed better in the type system instead of through run-time flags?

A string literal can have a different type that is implicitly convertible to an immutable string, but also triggers a copy when being passed as var string parameter.

With that said, I like the idea of allowing the allocated buffer of strings and sequences to be shared and reference counted without introducing an indirection. Something along the lines of C++'s allocated_shared comes to mind where the ref count appears in the memory just before the normal string value.

The holy grail in the strings vs sequence debates for me is figuring out a way how to make string the same type as seq[byte] (or merely distinct types). The fact that they are different types can lead to a lot of unnecessary copying of buffers in networking and serialization libraries. For the uninitiated, the obvious problem here is that the null terminator of strings is required for the C interop. I dream of a solution where the null terminator will be inserted on-demand when you trigger the cstring conversion, but then the question is how do you ensure that there is a place for it?

One possible answer is that the cstring conversion is a place where you introduce a copy of the string if its buffer cannot be patched in place with a zero past the last character, but the other more crazy solution is that you leave something at the end of the string allocation cell (e.g. the ref count of the flag we are discussing here). If you have left just one byte there, you can temporarily turn it to a null terminator for the duration of the cstring call.

Combining these two ideas, the ref counted strings can store their ref count at the end of the allocated cell for the buffer. The price you'll pay is that the string capacity is slightly decreased. You can check the capacity value to recognize the type of string you are working with, so you can support transitions between the ref-counted mode and the single-owner mode.

Andreas Rumpf · Answer 8 · Thu May 07 2020 16:22:03 GMT+0800 (China Standard Time)

Combining these two ideas, the ref counted strings can store their ref count at the end of the allocated cell for the buffer.

That seems to be very bad for locality / CPU caches.

Andreas Rumpf · Answer 9 · Fri May 08 2020 15:53:33 GMT+0800 (China Standard Time)

For the uninitiated, the obvious problem here is that the null terminator of strings is required for the C interop.

In practice, for many programs the IO primitives offered by C often have APIs that take char* data, size_t len pairs and so don't need the null terminator. The most common exception is filenames and both Rust and C++ use dedicated, non-string types for Path. Then Path can be zero-terminated under the hood whereas cstring(nimstr) can be more expensive and do the alloc+copy for the zero terminator.

Given the other obvious benefits (type safety!), maybe it's time to replace the os and io subsystems with something better.

zah · Answer 10 · Fri May 08 2020 22:57:52 GMT+0800 (China Standard Time)

Well, I'll be happy if seq[byte] and string get the same run-time represenation, one way or another.

Otherwise, to explain the idea above more precisely, the cache locality argument shouldn't be that bad, because:

A) We can consider the ref counting case as more rare and not worth optimising.

To achieve this we only need a fast check for the question "Is this string ref-counted?". We can do this by checking a single bit in the capacity value (all normal allocations will have even capacity values because the memory allocator works with such cells, while the ref counted ones will use odd capacity values). Please notice how in this scheme, it's always possible to switch from the ref counted mode to normal ownership and the entire capacity is recovered when you do this.

A read of the capacity field is not a problem for cache locality, so this problem is gone.

B) As SSO proclaims, most strings are small (smaller than the cache line of a modern processor, which is likely to continue increasing). Thus, quite often, the ref count will also be in the CPU cache anyway.

cooldome · Answer 11 · Wed May 13 2020 03:04:07 GMT+0800 (China Standard Time)

I had a read, feels that sink T, lent T, var T and openarray[T] combined with #178 is more generic solution to the optimization/performance problems.

It does feel odd that string becomes cow while the rest of types don't. Isn't it better to have a generic cow solution that can turn any type into in cow if user wants it?

Jacek Sieka · Answer 12 · Thu May 21 2020 18:00:47 GMT+0800 (China Standard Time)

I dream of a solution where the null terminator will be inserted on-demand when you trigger the cstring conversion, but then the question is how do you ensure that there is a place for it?

C++ dreamed of that too, with c_str being a const method that mutated the string nonetheless in some libraries - these strategies however quickly ran into issues because strings can no longer be consts (ie stored in the readonly section of the executable) and they violated the signature of the function causing trouble downstream for optimizations and downstream analysis tools that now couldn't make other optimizations - one step forwards, two steps backward.

Instead of making string and seq[byte] the same, it seems like one could exploit their differences somehow instead - after all, why go through all the trouble of having two types if you end up making them the same?

any cow solution doesn't just introduce one word of refcount overhead - it also forces every operation to maintain that refcount and introduces complex / costly operations in unexpected places - ie mystring[4] = 'c' is suddenly a potential O(N) operation which is pretty nuts.

SSO is efficient for several reasons:

most strings small and therefore can save not only on the allocation but also because it makes good systemic use of all those bits that for the vast majority of strings don't get used at all, like the higher parts of the length and capacity - this remains a waste with COW
with threads, there is no contention around a cow flag that requires synchronized access - depending on where nim threading goes, this would play a significant role.

borrowing is indeed the systemic solution to this problem - if it was introduced as an optimization, it could analyze away copies so the cost to the programmer in terms of having to deal with syntax would be kept low - the way to approach the problem then would be to design the string to be more amenable to this kind of analysis by constraining what you can do with it.

Jacek Sieka · Answer 13 · Thu May 21 2020 18:12:34 GMT+0800 (China Standard Time)

cstring(nimstr)

reallocating here opens up a problematic memory safety issue as well - who owns the memory of proc x(s: string): cstring = cstring(x)?

zah · Answer 14 · Thu May 21 2020 19:15:50 GMT+0800 (China Standard Time)

strings can no longer be consts (ie stored in the readonly section of the executable)

It's trivial to make the const-section strings zero-terminated

violated the signature of the function

With const_cast existing in the language, I cannot see how any tool can rely on the signature to enable or disable certain optimisations.

it seems like one could exploit their differences

There aren't enough differences to justify the code bloat and the buffer copies needed in conversions that stem from having two different types.

zah · Answer 15 · Thu May 21 2020 19:20:18 GMT+0800 (China Standard Time)

reallocating here opens up a problematic memory safety issue as well - who owns the memory of proc x(s: string): cstring = cstring(x)?

This memory can certainly be freed if you consider the code to be equivalent to:

callSomeCFunction(TempCString(data: createCStringCopy(nimString)))

where TempCString is a type with a destructor that will deallocate the memory. If we follow C++ rules, the destructor will be called at the semi-colon of the enclosing expression (i.e. just after the callSomeCFunction completes)

Antonis Geralis · Answer 16 · Tue Jul 20 2021 04:17:04 GMT+0800 (China Standard Time)

So here is a working implementation https://github.com/planetis-m/dumpster/blob/master/cow/cowstrings.nim in case anyone wants to benchmark it.

And here is a benchmark of std.strings/cowstrings/ssostrings. https://github.com/planetis-m/ssostrings/blob/master/tests/bench1.nim

Also relevant: http://www.gotw.ca/publications/optimizations.htm