CHERIoT-Platform / cheriot-rtos

The RTOS components for the CHERIoT research platform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Would be nice to support hazard pointer like claims on heap objects

nwf-msr opened this issue · comments

We have a notion of claims of heap objects, but they're fairly heavy, requiring a full compartment call to either acquire or drop. We'd like a lighter-weight option for a small number of objects per thread: hazard pointers understood by the heap.

Hazard pointers originate in the world of lock/wait-free data structures, where pointers have a four-stage lifecycle: unpublished, public, withdrawn, and freed. Hazard pointers mediate the transition from withdrawn to freed, deferring it until there are no more hazardous, local copies of once-public pointers. Because everyone involved is presumed cooperative, hazard lists need only be scanned once: a request to free memory must follow that pointer being withdrawn from the shared state, and so it cannot become newly hazardous.

The story for potentially uncooperative threads is slightly different, as it may be impossible to ensure that a pointer is completely withdrawn from the shared state (and is not cached somewhere without being declared hazardous) when it comes time to free() it. This is, after all, why we have sweeping revocation! As such, our story needs to be a little different. In particular, we need a ratchet mechanism to advance an object through these states:

  1. allocated and hazard-able (and possibly hazardous)
  2. allocated but no longer hazard-able, with N hazards registered
  3. ...
  4. allocated but no longer hazard-able, with 1 hazard registered
  5. No longer hazardous, and so free (well, quarantined)

Additionally, hazards are intended to be ephemeral. As such, we do not expect them to survive cross-compartment calls. (And anyway, a callee cannot assume that its caller has declared any hazards appropriately, and so must do so itself; if we ensured that hazards persisted across calls, there would be redundant registrations.)

Here is a design sketch. This is going to need to be a fair bit of carefully crafted assembler but I believe it's entirely possible.

There is a Hazard Registry maintained on behalf of the allocator by a privileged library. A Hazard Registry Entry (or just "Hazard") is an optional pair of a capability to a (sub)object and a taint flag. The None branch of the option is indicated by a null cap. Hazard entries are, we imagine, stored in static sealed objects, with handle type SealedHeapHazard, each of which has room for some number of hazards. To avoid linked-list traversal, we imagine that these are collected in a dedicated section, forming a "linker set".

The privileged library exposes a single call to clients (and more to the allocator itself):
void *heap_hazard_register(SealedHeapHazards *hh, size_t ix, void *p), with interrupts deferred,

  • unseals hh,
  • releases the existing hazard in the ix-th entry in hh, if any, and
  • scans all existing hazards, looking for a tainted entry that precludes registering p (see below), and
  • if p is tagged, installs p into that entry and and returns p.
    If p is untagged or has a tainted hazard, this call returns null. The registry entries are write-only from the clients' perspective: it is not possible to retrieve a hazard pointer.
     
    Releasing an untainted hazard is as simple as replacing it with null. Releasing a tainted hazard triggers a full, interrupt-enabled cross-compartment call to the allocator to advance the effort to free the associated object. Importantly, the hazard remains tainted while this call is in progress so that the taint is seen by other calls to heap_hazard_register. Specifically, this advance involves scanning the set of hazards, and either
  • finding no relevant hazards; proceed to quarantine the object
  • propagating taint: taint some other, relevant hazard, upgrading the pointer in that entry to be the full object
    before replacing the hazard with null.

Upon discovering that there are no more claims to an object, the allocator will effectively register this object as a tainted hazard and then immediately release that hazard (that is, scan the hazard table looking for another entry to taint). One could imagine a dedicated SealedHeapHazard chunk of the registry specifically for free()'s use, but it might make more sense to special case this primordial tainting.

Tainted hazards are always carrying pointers full heap objects, while untainted ones may have sub-object pointers; this simplifies the scan above without unduly challenging the initial sweep for hazards.

Commentary welcome.

My biggest concern (which I think I wrote in the design doc for the claims mechanism) is that this makes freeing an O(n) operation in the number of hazard slots, which is likely to be some small number times the number of compartments.

I wonder if there's something clever (or, at least, 'clever') that we can do with a bloom filter to let the allocator know if a given hazard set is likely to contain a given pointer (and, ideally, have a fast path for none have it, though clearing the bloom filter entries is tricky if it's global because recalculating the minimum set requires storing everything).

I am not sure I understand the synchronisation in this model. Currently, the allocator does not run with interrupts disabled at any point (except when reacquiring a lock, which will probably change at some point). As such, I'm not sure how you avoid a race between the allocator seeing that there are no hazards and freeing. I believe this could be addressed by having each hazard table hold a capability to a ptraddr_t owned by the allocator and have the allocator insert the address of an object to be freed into it, as soon as it has found the base address but before it checks hazard pointers.

I'm also not sure how this would scale to multicore systems.

My biggest concern (which I think I wrote in the design doc for the claims mechanism) is that this makes freeing an O(n) operation in the number of hazard slots, which is likely to be some small number times the number of compartments.

I debated using per-thread state rather than the SealedHeapHazards thing... it might require our privileged library run with ASR for a little bit to get the current thread structure and the array of hazards there in, and it would need to traverse a collection of threads. If you prefer O(threads) to O(compartments), that could be a way to go.

I wonder if there's something clever (or, at least, 'clever') that we can do with a bloom filter to let the allocator know if a given hazard set is likely to contain a given pointer (and, ideally, have a fast path for none have it, though clearing the bloom filter entries is tricky if it's global because recalculating the minimum set requires storing everything).

There are "counting bloom filters" that support fast removal of elements so long as none of the internal counters saturate (IIRC). I wonder how well we could do with just a single cheap hash function.

If we were OK with making hazard registration require finding the object header, like claims do, we can use a bit to indicate that there might be a hazard on an object.

I am not sure I understand the synchronisation in this model. Currently, the allocator does not run with interrupts disabled at any point (except when reacquiring a lock, which will probably change at some point). As such, I'm not sure how you avoid a race between the allocator seeing that there are no hazards and freeing. I believe this could be addressed by having each hazard table hold a capability to a ptraddr_t owned by the allocator and have the allocator insert the address of an object to be freed into it, as soon as it has found the base address but before it checks hazard pointers.

That's the purpose of the "initial taint" in free() and why registering a hazard looks for taint: as soon as that initial taint is registered, it is no longer possible to create a new hazard for a given object.

I'm also not sure how this would scale to multicore systems.

Indeed, that's tricky, since this builds on IRQ deferral to achieve atomicity.

Some thoughts from waking up in the middle of the night:

The main purpose of the hazard mechanism is for very short-lived claims. If you’re doing a longer-term claim, the cost of calling heap_claim is amortised. As such, we don’t need per-thread per-compartment hazards. We can, in fact, get away with two hazard slots per thread, because that lets us support memcpy between untrusted objects. We could manage with one if we did some bounce buffering but two seems like the minimum for convenience.

We can assume hazard slots are clobbered on every cross-compartment call. You can always reestablish hazards on return.

I think we can implement this fairly easily:

  1. Add a hazard region that is provided to the allocator as an MMIO capability.
  2. In the loader, place a store-only capability (yay, or first use case for these!) to this into each register save area.
  3. Provide a switcher API to load this capability.

This leaves a potential race in the scanning. I suggest that we address this by having the allocator expose a free epoch counter. This is incremented before each scan of the hazard list and after each free. The flow for acquiring a hazard is therefore:

  1. Read the epoch counter.
  2. Insert the capability into your hazard slot.
  3. If the epoch counter is odd, wait until it is even (futex).
  4. Test the tag on the capability.

If the epoch word uses a 16-bit counter, then we can use a priority-inheriting wait, so whichever thread is in the allocator gets a short boost from threads trying to use the hazard mechanism.

On the fast path, this doesn’t require any cross-compartment calls (assuming that we’ve already got the epoch counter. I really should extend MMIO capabilities to allow read-only views so that we can make it an MMIO capability) and we can put an acquire-hazards function in a shared library.

The extra costs in free are:

  • Two extra pointer checks (load, ctestsubset) per thread.
  • Two counter updates.
  • Storing hazard states until they are gone and freeing the hazard list later (maybe defer this until an allocation fails?)

There are some potential availability risks here. A thread can keep an object live for an unbounded amount of time with the hazard mechanism. If we zero the hazard pointers on compartment call, then any yielding operation will implicitly zero, so that’s probably fine.

Thoughts?

I'm happy with 2 hazards per thread, presumed clobbered on cross-compartment calls, and for them to be stored in a contiguous array set up by the loader and shared as you describe. It's a little unfortunate to need to go to the switcher for the per-thread cap, but that could plausibly be a privileged library call rather than a full compartment call.

I don't understand where your proposal distinguishes between removing a hazard declaration to an object that is still live and removing a hazard declaration to an object that has been freed while the hazard was in place?

Matt pointed me at the "pass the pointer" scheme in https://dl.acm.org/doi/10.1145/3437801.3441596 which we might want to adapt. It relies on atomics, which we can emulate readily enough, and on having a copy of the hazard pointers ("handover") that's used to walk objects closer to free. However, as with almost all discussions of hazard pointers, it assumes cooperation in the sense that one only free-s something that is retired from global visibility, something upon which we cannot depend; declaring a hazard will inherently be more complex as a result, as we need to do something to ensure progress, that probably means preventing the same object(s) from being juggled between hazard slots forever, and that probably means either linear time (to examine object headers or other hazard slots) or space (expanded revocation bitmap).

There is a small, bounded, number of object whose lifetime is prolonged by the hazard mechanism (2* number of threads). As such, my plan (this is the part I haven’t implemented yet) is to have the allocator maintain a list of object that it has not freed because they have hazards. When it sees an object on the hazard list that someone is trying to free, it copies the pointer to the other list. On every malloc or free, if this list is not empty, we try to recheck any queued object and free them. This is quadratic in the worst case, but with a very small n.