Aliasing rules for `Vec<T>` and other standard containers

Question

Aliasing rules for `Vec<T>` and other standard containers

chorman0773 opened this issue 4 years ago · comments

This was brought up briefly in #258, but I was wondering about the aliasing rules for Standard Containers, partricularily Vec<T,A>.
Right now, a safety invariant for Vec is that the inner pointer is not aliased (this is necessary, because Vec can modify it, and implements DerefMut). However, as was noted, making this an aliasing invariant may be non-trivial. However, notably, Vec<T,A> has an unstable layout, so an implementation could be (Unique<[T]>,usize,A). My question is, would this be a valid implementation? Would an implementation be permitted to treat Vec<T> specially w.r.t. the stacked borrows aliasing model?

(Note: all of the above explicitly references Vec<T,A> but the question also stands without loss of generality, to all standard library containers. IE. can an implementation assume exclusive ownership of the pointers managed by standard containers, however that may extend, even if the container is unused and mem::forgotten.)

Ralf Jung · Answer 1 · Sun Dec 06 2020 19:07:18 GMT+0800 (China Standard Time)

However, notably, Vec<T,A> has an unstable layout

Well... quoting from the docs:

Most fundamentally, Vec is and always will be a (pointer, capacity, length) triplet. No more, no less. The order of these fields is completely unspecified, and you should use the appropriate methods to modify these. The pointer will never be null, so this type is null-pointer-optimized.

So, only the order is unstable.

so a valid implementation could be (Unique<[T]>,usize,A)

Is the slice length here len or cap? If it is cap, shouldn't this be Unique<[MaybeUninit<T>]>?

Regarding your higher-level point, I don't have an answer, but I can try to tease this apart into two questions:

Vec as implemented in rustc uses Unique<T>. What does that mean in terms of aliasing guarantees? In SB it means nothing, but that's only because SB isn't well-suited to make aliasing assumptions without a known size. This is a problem elsewhere as well (#134), so if a "sizeless" way of asserting aliasing surfaces, I will certainly attempt to wire that up with Unique.
Looking beyond rustc itself, what is the "library UB" around aliasing for container data types? This is very hard to even say precisely, and I do not have a clear idea of the trade-offs here. Is there any legitimate code that would be ruled out by saying "all containers may assume their inner pointers are not aliased by user-held pointers"? This seems extremely vague so I am not even sure if it is useful.

One data point: for Vec, we actually do allow the user to keep pointers to some interior data as long as they ensure that the capacity does not change. You can see this witnessed by this test. This could be incompatible with your proposal of Unique<[T]>, if the aliasing requirements made by Unique are sufficiently strong.

Connor Horman · Answer 2 · Sun Dec 06 2020 22:16:30 GMT+0800 (China Standard Time)

Most fundamentally, Vec is and always will be a (pointer, capacity, length) triplet. No more, no less. The order of these fields is completely unspecified, and you should use the appropriate methods to modify these. The pointer will never be null, so this type is null-pointer-optimized.

I'd argue that isn't a stable guarantee, especially since an interpretation that it is wouldn't necessarily be compatible with the efforts in https://github.com/rust-lang/wg-allocators. This information has to be stored somewhere, and, at least for A = Global it must be reconstructible from these (in all other cases it must be reconstructible from those + A), but

Is the slice length here len or cap? If it is cap, shouldn't this be Unique<[MaybeUninit<T>]>?

Either one is reasonable. I actually originally had the question be Unique<[MaybeUninit<T>]> indicating it should be capacity, but having it be length may work better for things like derefence/deref_mut in particular, and also into_boxed_slice.

As for how the aliasing rules work, I'm not worried about the concreate rules for the implementation. lccc defines noalias in terms of the object a pointer is pointing to, which notably, for non-dynamic allocations, is (or should be) one-to-one with the Stacked Borrows model (reference-to-slices simply create a fictional object of type [T;n] where n is the length of the slice, but need not be a constant expression, and point into this).
The higher level rules are more where I am concerned, if I am even permitted to make these assumptions. However, seeing as it is (presumably) possible to calculate the length of a slice for aliasing purposes, defining it in terms of Unique<[T]> should be sufficiently permissive to allow such an implementation.
Thus turning to the rules for Unique, I like a version of the rules I use, that is:

Unique<T> shall be a non-null, well aligned pointer to T
If Unique<T> is not dangling, the region it points to shall not be aliased, as though occupying a Uniq item on the borrow stack.

we actually do allow the user to keep pointers to some interior data as long as they ensure that the capacity does not change

These are notably derivative pointers of the interior, by definition. The definition I use notably permits as many derivative pointers as it wants (provided either only 1 exists, or none have the unique attribute). I believe Stacked Borrows is similarily permissive when reborrowing as/converting to a raw pointer? So this would be well defined if getting the pointer to the array is defined as a reborrow to SharedReadWrite or SharedReadOnly (depending on the type of pointer). Though this may be dependant on #248, I'm not 100% familiar with the model (I will be reading up on it at some point, to attempt to form a relation between Stacked Borrows and the model lccc uses, to hopefully prove the latters correctness).

Is there any legitimate code that would be ruled out by saying "all containers may assume their inner pointers are not aliased by user-held pointers"?

This was a sort-of catch all question, and can certainly be analyzed individually, for each such container. In particular, this is desirable from the implementation-side for String, which is just a special-case of Vec<u8>, and probably for HashSet and HashMap as well. Whether or not any others could benefit from the same, I do not know. As for user code, I don't know of any that would be ruled out by a blanket statement like this, but I'd be open for input if there is. I assume in general, it would be similar things to what is discussed for Box in #258, though whether any actual code for this exists, I do not know.
I can happily settle for a more narrowly tailored statement, for specifically those types I mentioned (and even then, I'm the most attached to allowing this for Vec and String specifically)

Mario Carneiro · Answer 3 · Sun Dec 06 2020 22:38:35 GMT+0800 (China Standard Time)

I'd argue that isn't a stable guarantee, especially since an interpretation that it is wouldn't necessarily be compatible with the efforts in https://github.com/rust-lang/wg-allocators. This information has to be stored somewhere, and, at least for A = Global it must be reconstructible from these (in all other cases it must be reconstructible from those + A)

I wouldn't say it's incompatible; the quoted stability guarantee is only for the Vec<T> that we know and love, which is being ret-conned as Vec<T, Global>. Vec<T, A> for other values of A is a new kind of thing which may not have the same guarantees, but the existing guarantee about Vec<T> behaving as expected implies that Global is necessarily a ZST (and of course it is).

Ralf Jung · Answer 4 · Sun Dec 06 2020 23:11:44 GMT+0800 (China Standard Time)

I'd argue that isn't a stable guarantee

Things like that, stated in the doc, as as close to a stable guarantee as we have for libraries. If this isn't a stable guarantee, I don't know what is.

Connor Horman · Answer 5 · Mon Dec 07 2020 00:05:25 GMT+0800 (China Standard Time)

My question is be, what is that stable guarantee. What does the fact that a type consists of only one pointer and two usizes, but is repr(rust), guarantee. Even if we narrow it to only Vec<T,Global>, I can't see what it supposedly says. Does it say that Vec<T,Global> has the same size and alignment as [usize;3]? In my opinion, at best, this is a logical guarantee.

Lokathor · Answer 6 · Mon Dec 07 2020 00:39:23 GMT+0800 (China Standard Time)

I've seen old unsafe code that relied on Vec being "3x usize", on the grounds that the docs say it's a stable part of the type.

bjorn3 · Answer 7 · Mon Dec 07 2020 01:04:31 GMT+0800 (China Standard Time)

What does the fact that a type consists of only one pointer and two usizes, but is repr(rust), guarantee.

I did say that the guarantee is that decomposing into the raw parts and assembling it back again will always give exactly the original value back. It doesn't have any private fields that would get lost in the translation. Nor does it have any extra validity invariants, only extra safety invariants.

Connor Horman · Answer 8 · Mon Dec 07 2020 01:15:20 GMT+0800 (China Standard Time)

I did say that the guarantee is that decomposing into the raw parts and assembling it back again will always give exactly the original value back

That's what I thought, and originally mentioned in my comment on that ("this information has to be stored somewhere, and, at least for A = Global it must be reconstructible from these").

scottmcm · Answer 9 · Mon Dec 07 2020 09:45:29 GMT+0800 (China Standard Time)

Nor does it have any extra validity invariants, only extra safety invariants.

I largely agree with your post, but I'm not certain about this one. It already has a non-null validity invariant in the current implementation, which is tighter validity than the *const that from_raw_parts takes. I think it would be within its rights, for example, to have a validity invariant that the len and capacity cannot be usize::MAX when the type is size 2, for example.

To me, this statement in the documentation is about saying "no, it's not reference counted" and "no, it's not a SmallVec" and "yes, the elements always stay at the same address through a move" and similar things like that. It's a straightforward Box<[MaybeUninit<T>]> with a length to track which are initialized. But I don't think that it means it has to store literally three fields of those precise types with precisely those values.

Steve Klabnik · Answer 10 · Tue Dec 08 2020 23:53:47 GMT+0800 (China Standard Time)

Historically, we treat these kinds of comments in documentation as guarantees, and I've taken pains to get the libs team to sign off on them.

Elichai Turkel · Answer 11 · Wed Dec 09 2020 01:04:16 GMT+0800 (China Standard Time)

To me, this statement in the documentation is about saying "no, it's not reference counted" and "no, it's not a SmallVec" and "yes, the elements always stay at the same address through a move" and similar things like that. It's a straightforward Box<[MaybeUninit<T>]> with a length to track which are initialized. But I don't think that it means it has to store literally three fields of those precise types with precisely those values.

FWIW

Most fundamentally, Vec is and always will be a (pointer, capacity, length) triplet

Never says that length/capacity have to be usize and have to be valid for all values, so I do think it should be allowed in the future to constrain them for capacity < usize::MAX/size_of::<T>() and length < usize::MAX/size_of::<T>().

Connor Horman · Answer 12 · Thu Feb 18 2021 22:28:54 GMT+0800 (China Standard Time)

I'd like to bring this up again and ask a further question (one that also applies to Box):

If the implementation can treat Vec<T> (or any collection in general) as though it has Unique access to the inner allocation, can an implementation treat a borrow (Uniq, SharedReadWrite, or SharedReadOnly) of the container as the same kind of borrow of the inner? On rustc's impl, I remember seeing something along the lines of "if the Unique is accessed through a shared borrow, only shared access is allowed to the pointee" (or it may have been the contrapositive, that mutable access to the pointee was only allowed through mutable access to the Unique).
If not, can safe or unsafe functions called on a mutable reference to a container still invalidate any borrows of the inner allocation even if it does not necessarily access the allocation or the borrowed portion thereof (IE. Vec::set_len, Vec::resize for slices less than the new size, where no new allocation is necessary).

Ralf Jung · Answer 13 · Mon Feb 22 2021 00:48:22 GMT+0800 (China Standard Time)

On rustc's impl, I remember seeing something along the lines of "if the Unique is accessed through a shared borrow, only shared access is allowed to the pointee"

The type system has to ensure this restriction for soundness reasons. However, Stacked Borrows doesn't do anything like that. It work strictly per-pointer / per-allocation, and in particular "the pointer used to load X" has no effect on the 'content' of X even if X is itself a pointer.

can safe or unsafe functions called on a mutable reference to a container still invalidate any borrows of the inner allocation even if it does not necessarily access the allocation or the borrowed portion thereof (IE. Vec::set_len, Vec::resize for slices less than the new size, where no new allocation is necessary).

To make sure I understand the question -- that would rule out code like this, right? Currently, it is my interpretation of the documentation of Vec that at least some of that code is intended to be allowed. Vec documents that if the backing buffer is big enough, push will not reallocate -- that statement is only useful to clients if push does not invalidate borrows of the existing elements of the vector. So I spent some time to patch Vec until those existing borrows are maintained.

I think this has to necessarily imply "no" to both of your questions, or am I misunderstanding something?

Ralf Jung · Answer 14 · Mon Jul 03 2023 19:41:33 GMT+0800 (China Standard Time)

Closing in favor of #326.