Efficient alternate representations for lists of characters (strings).

Question

Efficient alternate representations for lists of characters (strings).

UWN opened this issue 6 years ago · comments

Consider to add an internal string data type as an alternate representation for lists of characters:

Complete strings

The tagged value is a byte-aligned pointer to a null-terminated sequence of UTF-8 characters on the heap (copy stack). This means that the full address-space of 32/64 bits cannot be used but some (1 or two) bits less than that. In many systems this is not a problem since the operating systems already makes similar assumptions anyway.

A list of n characters [C1, ..., CN] would thus occupy between n+1 and n+8 bytes, depending on alignment, whereas the naive representation as a list of chars would take n · (3 · 8) bytes.

In this manner, the "tail" of the strings can be rapidly (constantly) computed which is key for fast parsing with DCGs. Note that SWI7's built-in string requires for sub_string(String, 1, _, 0, Tail) time and space proportional to the length of String.

Then, all operations from unification upwards need to be able to handle both lists of chars and that string type. That is indeed the more challenging part.

Partial strings

This would take the idea one step further. Instead of representing a list of known characters, a partial list of known characters would be represented compactly. E.g. Xs0 = [a,b,c,d|Xs] would be represented as a null-terminated string "abcd" plus padding to the next word which represents Xs. This would make library(pio) very fast!

Just as a remark: Strings that contain the null-character can still be represented as a regular lists of characters.

Mark Thom · Answer 1 · Thu Aug 30 2018 10:26:57 GMT+0800 (China Standard Time)

Should there be a specialized syntactic form for the introduction of partial strings? Something like X = diff_string("abc"), so that X unifies against [a,b,c|Y] (but Y remains free) maybe?

UWN · Answer 2 · Thu Aug 30 2018 23:00:26 GMT+0800 (China Standard Time)

if you want to have something special for it, then use a built-in predicate. Like list_to_compactlist/2 or the like. But not a special functor as diff_string/1.

UWN · Answer 3 · Thu Aug 30 2018 23:33:15 GMT+0800 (China Standard Time)

But most of the time there will be built-ins (like those for library(pio)) that will generate compact representations. Or think of atom_chars/2, number_chars/2.

Another option is to produce a compact representation during garbage collection. But rather leave this for a much later phase.

Kauê Hunnicutt Bazilli · Answer 4 · Sat Jan 06 2024 07:31:03 GMT+0800 (China Standard Time)

Ok, from what I was able to understand by reading this, #95 and the source code at different points in time, it seems that Scryer once did implement partial strings as intended (inline in the heap) using RawBlock. But after redoing the heap representation in 0404c3b it fell back into allocating the strings in the atom table instead of in the heap. The new representation is really nice, but I think it has to be changed again to enable inline heap partial strings.

Ideally we could just interpret a pointer into the heap as either a HeapCellValue or a str buffer, but I don't even know if it is actually "allowed" (in the sense of Undefined Behavior) to do something like this in Rust. It certainly isn't in C¹, because of strict aliasing. I need to research this better.

A better way may be to represent a HeapCellValue as an enum of the current representation and [u8; 8], and then implement a bazillion traits to make it actually work as intended. This would be a big architectural change, but way more safe, if it is possible.

Actually, this would be legal in C because you can alias any type by char* if I remember correctly. ↩

Mark Thom · Answer 5 · Sat Jan 06 2024 07:48:50 GMT+0800 (China Standard Time)

Partial strings were never allocated in the heap, no.. pointers in the heap owned string instances that were allocated with the global allocator.

Kauê Hunnicutt Bazilli · Answer 6 · Sat Jan 06 2024 08:02:49 GMT+0800 (China Standard Time)

Partial strings were never allocated in the heap

This makes more sense, thanks for the correction. I think this just adds to my point that implementing this as intended is hard.

Kauê Hunnicutt Bazilli · Answer 7 · Sat Jan 06 2024 08:30:08 GMT+0800 (China Standard Time)

Or maybe that indicates that inlining the string is the wrong approach. I can see a lot of benefits that an indirection provides:

Avoids all of this questionably legal unsafe stuff.
Can make use of the builtin Rust functionality for strings.
Allows special management if we want to make some fancy stuff latter.
- This is maybe a bad thing. As I understand this is part of the problem we have with the current implementation that uses the atom table, as it can't clean itself on backtracking because it's lifetime isn't connected to the heap. It will also need to be garbage collected separately from the heap, which is more implementation complexity.
- I also think this is maybe not as bad of a thing as it may seem, because we could just use a type that owns the string. For "garbage collection" we could maybe just use a Rc initially and it will cover most cases.
- This indirection also makes things that need to maintain the partial string across backtrackings (like phrase_from_stream/2) not be artificially bad, because it doesn't need to copy the (possibly giant) string from the heap, just the owning reference.
- Even though this means more indirection (maybe even more than 1 layer with things such as Rc), I still think it's worth it because it will still be better than raw lists, and will avoid many difficulties in inlining into the heap.

While I'm at it, I also think it would be good to store the partial string length somewhere, instead of representing it implicitly with a NUL byte. If we use a &str or String in the implementation this would be done automatically. This would also simplify the handling of strings with NUL bytes, but that doesn't really matter that much.

Markus Triska · Answer 8 · Sat Jan 06 2024 15:32:48 GMT+0800 (China Standard Time)

The key advantage of heap allocation is fast reclamation of memory on backtracking, using constant time!

UWN · Answer 9 · Sun Jan 07 2024 04:56:34 GMT+0800 (China Standard Time)

... and the other advantage of null-terminated strings on the heap is that a DCG for parsing does not require auxiliary memory. Not like #1714

Kauê Hunnicutt Bazilli · Answer 10 · Mon Jan 08 2024 12:23:22 GMT+0800 (China Standard Time)

Ok, now that I've researched it a bit I think it's completely legal in Rust to just straight up interpret a part of the heap as a &str. I made a proof of concept here, and it passes Miri. It's actually very simple, but I think implementing the rest of how this would interact with all the other things in the engine (copying of heap cells for example) will be a lot of work.

And maybe this simple implementation is not actually enough, because it ties the lifetime of the &str with the lifetime of the heap, so I think this constraints what you can do with a &mut heap while you have that &str around. Maybe something more like slice::split_at_mut() would be needed here, but it would also complicate the uses of the heap anytime we needed to get a inlined string.

Maybe it would be best to not create a reference here at all and just work with *const u8, which would be kind of unfortunate because we couldn't easily make use of the builtin Rust functionality for strings and because we would need unsafe everywhere we use it.

My proof of concept also has the unfortunate implementation detail that it needs to traverse the string twice, once for making a CStr and another time for UTF-8 validation in the conversion to &str. The traversal in CStr seems to be an implementation detail that isn't really supposed to be there. Ideally they could both be done in a single pass.

Kauê Hunnicutt Bazilli · Answer 11 · Tue Jan 09 2024 01:00:11 GMT+0800 (China Standard Time)

Thinking more about this, I think using *const u8 here is more useful, not only because of the lifetime issues I talked about above, but also because it enables partial strings to contain arbitrary bytes, not just valid UTF-8, which may be useful in a lot of use cases. We could do the UTF-8 check separately for the "raw inline buffers" that need it.

Markus Triska · Answer 12 · Tue Jan 09 2024 01:19:22 GMT+0800 (China Standard Time)

The internal encoding should be fixed to UTF-8, since supporting different internal encoding variants for strings is too error-prone.

Conversion to a Rust string may only be needed in comparatively few places (example use case: predicates from library(os) that are implemented with Rust features) and in such cases a temporary Rust string can be quickly constructed character by character as they occur on the heap. It seems more important to implement the basic Prolog building blocks (unification, reading/writing, term inspection etc.) correctly so that they support this representation.

An interesting design question is: How can the code be structured so that as few places as possible need to take the specialized string representation into account? The more places are affected, the easier it is to accidentally forget that the representation must be handled.

Kauê Hunnicutt Bazilli · Answer 13 · Tue Jan 09 2024 01:24:40 GMT+0800 (China Standard Time)

The internal encoding should be fixed to UTF-8, since supporting different internal encoding variants for strings is too error-prone.

I was more thinking about "partial lists of bytes/octets", not alternative encodings for strings. This would be useful for things that need to operate on raw bytes. I agree that UTF-8 should be the only builtin encoding for strings.

UWN · Answer 14 · Tue Jan 09 2024 01:25:18 GMT+0800 (China Standard Time)

There is one aspect which caught my attention recently. The key element of partial strings is the use of null-terminated UTF-8 strings which begs the question how to represent the zero-character. So far, I believed that resorting to Prolog terms would be the best representation, but another option might be to use Modified UTF-8 which assigns some two-byte encoding to the null character. So that might make things a bit easier.

UWN · Answer 15 · Tue Jan 09 2024 01:27:33 GMT+0800 (China Standard Time)

I was more thinking about "partial lists of bytes/octets"

Compared to using UTF-8 characters, this would reduce their overhead at best by a factor of two. Worth the effort?

Markus Triska · Answer 16 · Tue Jan 09 2024 02:50:02 GMT+0800 (China Standard Time)

Personally, I think the space saving is not worth the effort, but there is a different interesting advantage that a dedicated encoding specifically for octets could yield, if it guarantees that every byte can be processed in the same amount of time: This could be a very useful property for cryptographic routines that must not reveal any properties of secret data in any way, not even by different amounts of times it takes to process it. If processing the byte 0 (for example) takes exactly the same amount of time as any other byte, then such a dedicated internal encoding could be attractive, combining the advantages of same-time-per-byte processing with a very compact internal encoding.

This is only worth considering if the number of places in the Rust parts of Scryer Prolog that have to take such an encoding into account is kept to an acceptably small amount, certainly much smaller than it is currently and where even "normal" strings are not yet correctly implemented (#1969).

Kauê Hunnicutt Bazilli · Answer 17 · Tue Jan 09 2024 03:03:43 GMT+0800 (China Standard Time)

So far, I believed that resorting to Prolog terms would be the best representation, but another option might be to use Modified UTF-8 which assigns some two-byte encoding to the null character. So that might make things a bit easier.

Or we could just store the length of the inlined string somewhere and avoid this and many other problems automatically. Conceptual example for term Xs = [a,b,c,'\x0\',d]:

Null terminated layout:

Cell 0: PartialStringPointer(1)
Cell 1: InlinedBytes(['a','b','c',0,0,0,0,0])       ┬── PartialString
Cell 2: ListPointer(3)                              ┘
Cell 3: Atom('\x0\')                                ┬── Cons
Cell 4: PartialStringPointer(5)                     ┘
Cell 5: InlinedBytes(['d',0,0,0,0,0,0,0])           ┬── PartialString
Cell 6: Atom([])                                    ┘

Length layout:

Cell 0: PartialStringPointer(1)
Cell 1: Integer(5)                                  ┬── PartialString
Cell 2: InlinedBytes(['a','b','c',0,'d',0,0,0])     │
Cell 3: Atom([])                                    ┘

This simplifies this case, but a string with a NUL character is an edge case. However, having a length is very useful in many other places, so that we don't have to traverse the string and/or keep track of the length separately. NUL terminated strings are widely considered to be one of C's greatest defects, and Zig and Rust have builtin slices for this reason. I don't think that an extra 8 bytes of overhead in the representation of the partial string is too much of a cost to pay for this, especially for large strings which are the ones that benefit the most from this.

Personally, I think the space saving is not worth the effort, but there is a different interesting advantage that a dedicated encoding specifically for octets could yield

Another use would be to mmap a file as binary, so that we could efficiently reason about it's contents with DCGs.

UWN · Answer 18 · Tue Jan 09 2024 03:06:26 GMT+0800 (China Standard Time)

just store the length of the inlined string somewhere

There is no such space. That is it would produce garbage all the time, like #1714

Kauê Hunnicutt Bazilli · Answer 19 · Tue Jan 09 2024 03:08:37 GMT+0800 (China Standard Time)

There is no such space. That is it would produce garbage all the time

I don't understand, could you elaborate? Can't we encode it like I showed?

UWN · Answer 20 · Tue Jan 09 2024 03:15:42 GMT+0800 (China Standard Time)

Just look at l/1:

l([]).
l([_|L]) :- l(L).

And now use your representation with "abc". In the first inference, L is "bc". How do you express this with your length layout? You would need some auxiliary datastructure that now contains the new length and "bc". And that costs space.

Whereas the null-terminated version just increments the pointer (and checks that the next element is not \0).

Kauê Hunnicutt Bazilli · Answer 21 · Tue Jan 09 2024 03:18:38 GMT+0800 (China Standard Time)

Ok, that makes complete sense, thanks for the explanation. I still wonder if there is a way to avoid sentinels without space concerns, but with this I'm convinced that inlining the length doesn't make sense.

Kauê Hunnicutt Bazilli · Answer 22 · Tue Jan 09 2024 05:45:15 GMT+0800 (China Standard Time)

I just thought about an optimization: if we have an inlined partial string that contains zero bytes, we don't need two allocations, we can reuse that zero byte as a sentinel. For Xs = [a,b,c,'\x0\',d,e]:

Cell 0: PartialStringPointer(address_of_cell(2))        ┬── PartialString
Cell 1: ListPointer(3)                                  ┘
Cell 2: InlinedBytes(['a','b','c',0,'d','e',0,0])
                      ├─────────┘   ├─────┘
                      Cell 0        Cell 4
Cell 3: Atom('\x0\')                                                        ┬ Cons
Cell 4: PartialStringPointer(                           ┬── PartialString   │ 
            address_of_cell(2) + 4                      │                   │
        )                                               │                   │
Cell 5: Atom([])                                        ┘                   ┘

Here I'm assuming that the "arguments" to "Cons" can be of arbitrary size, not just 2. I don't know how this is actually implemented or if there are any problems with this, but it allows skipping an indirection by having "PartialStringPointer" always be followed by the tail. This could make something #251 work with arbitrary UTF-8 strings (or even arbitrary mmaped binary files, as I talked about above).

UWN · Answer 23 · Tue Jan 09 2024 15:47:54 GMT+0800 (China Standard Time)

(I don't get your point) What happens with l/1? How does this scheme avoid the creation of any auxiliary datastructure?

Kauê Hunnicutt Bazilli · Answer 24 · Tue Jan 09 2024 16:35:36 GMT+0800 (China Standard Time)

How does this scheme avoid the creation of any auxiliary datastructure?

It avoids an indirection and the need for a PartialStringLocation cell.

Indirection:

Cell 0: ListPointer(1) % Here ListPointer points to 2 cells                 
Cell 1: Atom('\x0\')                    ┬── Cons (2 cells)
Cell 2: PartialStringLocation(4)        ┘
Cell 3: PartialStringPointer(           ┬── PartialString (2 cells, pointed at)
            address_of_cell(1) + 4      │
        )                               │
Cell 4: Atom([])                        ┘

No indirection:

Cell 0: ListPointer(1) % Here ListPointer points to an abitrary number of cells,
                       % but 2 "values".
Cell 1: Atom('\x0\')                    ┬── Cons (3 cells, with the partial string in
Cell 2: PartialStringPointer(           │         the cdr represented as 2 cells)
            address_of_cell(1) + 4      │
        )                               │
Cell 3: Atom([])                        ┘

What happens with l/1?

Well, in the example I gave with [a,b,c,'\x0\',d,e], it traverses the buffer until finding the NUL character, then goes to the ['\x0\'|[d,e]], then traverses the buffer again until the final '\x0\'. Is there any problem with that? This doesn't seem to have any problem that the indirection version shouldn't also have, unless I'm completely misunderstanding how all of this works.

Also, the indexes in my example before were wrong, I corrected them.

UWN · Answer 25 · Tue Jan 09 2024 17:23:14 GMT+0800 (China Standard Time)

When does what operation happen? You are not referring to the l/1 example, say where the original string is, where L is (in the first inference), then in the second etc

Markus Triska · Answer 26 · Sun Mar 24 2024 17:11:56 GMT+0800 (China Standard Time)

One general comment about this issue, also directed to authors of other Prolog systems so that they can learn from the experience we gain here: Implementing or retaining an ad hoc workaround about such an elementary design issue may eventually cost more resources than actually solving it. As a recent example, and ca. 6 years after first discussing this issue, we have #2356 which again needed addressing on its own. With partial strings implemented correctly, all the writing can be moved to Prolog and the time can be used instead to further strengthen the core.

Markus Triska · Answer 27 · Mon Mar 25 2024 06:37:11 GMT+0800 (China Standard Time)

With #2370, another issue that seems related to the current representation has now appeared.

@mthom, @bakaq ... please ... save us!

Markus Triska · Answer 28 · Mon Apr 08 2024 13:31:06 GMT+0800 (China Standard Time)

Another such case is #2381.

Efficient alternate representations for lists of characters (strings).

Complete strings

Partial strings

Footnotes