rust-lang / miri

An interpreter for Rust's mid-level intermediate representation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding FFI support to Miri, by emarteca

RalfJung opened this issue · comments

This issue is specifically tracking the effort by @emarteca and @maurer to add FFI support to Miri. There's a design document here, but it does not go into a ton of the actual implementation details.

Echoing what I said on Zulip:

If this means changes to the core interpreter engine, I am concerned. That engine is carefully structured to prioritize correctness and readability over basically everything else (with some concession to performance, as Miri would probably be >10x slower otherwise).
If this is "just" loads of code in a new subdir in Miri that we can easily disable if it starts causing problems, then that's fine.

@maurer replied

the main changes would be to add another value type (for foreign pointers), and another memory type (a wrapper for Rust objects with a parallel native representation).

add another value type (for foreign pointers)

A new value type would be a Big Deal and is basically not an option. But I also don't think you will need it. We already have different types of memory (data and functions, and we will hopefully have vtables soon). But they are all identified by an AllocId so they can all use the same value type for their pointers. Shouldn't that also work for foreign pointers?

The new memory type seems indeed pretty fundamental. I would be interested to see your design there before you spend a lot of effort implementing it. The key challenges I see are:

  • Doing this with minimal changes to the part of the interpreter that is inside rustc, i.e., as much as possible should be abstracted via the Machine trait so that it can be done locally in Miri without touching the core engine.
  • Relatedly, how do these foreign allocations support get_ptr_alloc? Functions and vtables do not contain any data that the Rust program can actually read, so this is easy for them. But your allocations will obviously actually have data in them, so it should be possible to have an AllocRef to them. And that seems hard to do without invasive changes?

The new memory type seems indeed pretty fundamental. I would be interested to see your design there before you spend a lot of effort implementing it.

I was thinking we'd extend the MiriMemoryKind enum with a new kind (something like CInternal). This would be the kind of memory used to store pointers to C. Then, when memory of this kind is read, we can dispatch the read to the corresponding C memory.
Essentially, I was thinking this kind would be how we'd distinguish the memory as corresponding to C memory. If this is too invasive of a change, we can likely do this a different way (such as through the AllocExtra).

Doing this with minimal changes to the part of the interpreter that is inside rustc, i.e., as much as possible should be abstracted via the Machine trait so that it can be done locally in Miri without touching the core engine.

Agreed. So far I haven't touched the rustc internals at all.

Relatedly, how do these foreign allocations support get_ptr_alloc?

Good point -- I'm not sure. I've messed around with this a bit and ran into some problems trying to get it to work. I think it'll depend a lot on how we end up representing the pointers internally (which is currently being discussed on zulip)

Some suggestions/discussion on pointer representation from zulip:

@oli-obk says:

I wanted to give direct access to the underlying miri allocation slice to the C code.
Wrt size and alignment, do we ever need to know these for allocations from FFI code? We can just use the most permissive alignment and the size can be to the end of the address space.

...
@maurer

I am still concerned about this "use the same allocator to create pointers" strategy though, since it will mean that Rust Allocations and the C FFI allocation will need to interleave, which makes the "just have a really big allocation encompassing all of virtual memory" strategy not function
(since Rust objects would end up inside it)

@oli-obk

Hmm, good point. We could still maintain a map instead of just a range, but that's expensive
But maybe it's the least expensive strategy?
Everything else has overhead, too, after all. This strategy is simple and cheap until you need to deref a pointer from miri. Then you need to search the map to check if it's a miri alloc and otherwise do the FFI deref
With strict provenance, this is rarely an issue, as Rust code will almost always have the provenance for cheap lookup available

@maurer

Another possible strategy would be to use mmap on Linux/OSX and VirtualAlloc on Windows and friends to to grab a large region of virtual address space for the Miri allocator to select from. As long as we never actually touch it, that address space won't cost anything, but the mapping will prevent the native heap / loaded libraries from intersecting.
e.g. we reserve 4GB at some random location, and if the pointer is in that span, you look it up as a Miri object, and if not, we assume it's a native pointer and use the special AllocId

Looks like the design document moved to https://hackmd.io/eFY7Jyl6QGeGKQlJvBp6pw. (I haven't had the time to look at it yet.) Thanks for putting it on hackmd!

I was thinking we'd extend the MiriMemoryKind enum with a new kind (something like CInternal).

How will that be enough? Memory kinds only serve to make sure the allocator and deallocator match up. They still all have the same representation in the interpreter, the same Allocation type. You will need a completely different representation, won't you? No tracking of pointers and uninit, for example.

I think it'll depend a lot on how we end up representing the pointers internally

What even is the design space here? Pointers are an absolute address plus a Option<Provenance>, I think that is fairly fixed and I would have severe reservations about changing it.

How will that be enough? Memory kinds only serve to make sure the allocator and deallocator match up. They still all have the same representation in the interpreter, the same Allocation type. You will need a completely different representation, won't you? No tracking of pointers and uninit, for example.
What even is the design space here? Pointers are an absolute address plus a Option<Provenance>, I think that is fairly fixed and I would have severe reservations about changing it.

Sorry, I should've updated this issue WRT the discussion on zulip. The design doc on hackMD is up-to-date though -- here is the relevant section.
We're not planning to add any new value types or memory types anymore.

The main high-level idea so far (from discussion with @RalfJung,@maurer,@oli-obk, and a few others on zulip, and taken from the design doc):

Distinguishing Miri memory from external/foreign memory

In order to be able to distinguish between foreign memory and Miri memory (i.e., for a given location in memory, determine if it corresponds to C memory or Miri), we propose to reserve a large section of virtual memory for the Miri allocator. This way, we ensure there is no overlap in the memory used by the Miri allocator, and external calls. Using this, we can tell if the memory is Miri internal by looking at its AllocID.

If the return of a C call is a Miri pointer (i.e., if it is contained in the memory range reserved by the Miri allocator), then we will need to handle this case: likely we'll need to create the corresponding Allocation object (if it doesn't already exist), and we'll have to track that this Miri memory is shared with C (as discussed in the design doc).
If the return of a C call is a C pointer (i.e., if it is in the foreign range of memory), then maybe we can represent it with placeholder AllocID that indicates that it is a foreign pointer.

Any advice/ideas on this proposed plan are welcome :)
In particular:

  • How should we go about reserving a section of memory for the Miri allocator?
  • Is it possible to use the same AllocID for all foreign pointers, and what are the pros/cons of this idea?
  • Do the machine hooks we propose to use (memory_read, ptr_get_alloc) seem reasonable (any other hooks we should be using instead / in addition?)
  • Are there any situations in which we will actually need to maintain two versions of some shared memory? (i.e., a Miri version and a C version)
    • And, following this, if so, we need to design a plan for memory synchronization.Advice appreciated!

If the return of a C call is a Miri pointer (i.e., if it is contained in the memory range reserved by the Miri allocator)

I don't understand. When you get data back from C it is an untyped raw bag of bytes. How do you even know whether something in there is a pointer?

And when you pass a pointer from Miri to C, how are changes that C is making reflected back on the Miri side? The document says to just pass through the pointer, but that makes no sense to me -- Miri memory carries extra metadata, like which bytes are initialized and pointer provenance. You cannot just give C a pointer to the bytes part of a Miri Allocation and ignore those other parts of it.

If the return of a C call is a C pointer (i.e., if it is in the foreign range of memory), then maybe we can represent it with placeholder AllocID that indicates that it is a foreign pointer.

What happens when a Miri pointer is written there, what do we do with the provenance metadata? What happens when Miri wants to write uninitialized bytes there?

Fundamentally, a 'byte' of memory that we can pass to/from C is just a u8. But a byte of memory in Miri is much closer to the AbstractByte type defined here. (A byte in C is very similar to what it is in Miri, but all those extra details are lost when C is compiled to assembly -- similar to what happens in Rust.) That mismatch in information content has to be accounted for somehow, and I don't see your plan taking this into account.

When you get data back from C it is an untyped raw bag of bytes. How do you even know whether something in there is a pointer?

We'll know if a pointer to C memory is directly returned, since the function returning it will have the return type specified as a pointer. You're right though, that we won't know what's in that C memory: the main problem (as I understand it) is that we won't know if that pointer is to memory that contains another pointer to somewhere else in C memory.
Copy-pasting what I said on zulip, I wonder if we can use the fact that we'll know if a given pointer is from C: and then if it is from C, we can dispatch some int2ptr cast explicitly. I think this would only be needed in chained pointer accesses where the root is a C pointer, and the intermediate pointers are not stored in Miri (so Miri won't know their type), right?

So a case like:

let c_ptr: *i32 = some_c_function();
// miri knows c_ptr is a pointer

c_ptr[0][0]; // uh oh, Miri doesn't know c_ptr[0] is a pointer
             // so it'll currently auto-cast c_ptr[0] to a pointer with int2ptr

What about provenance?

We've updated the doc with a proposed plan on how to deal with provenance.
We have two potential ideas, of which we prefer this one -- the high-level of this idea is to have one provenance value (similar to the Wildcard provenance) specifically for pointers to C memory, and for any Miri pointers that are exposed to C. This would only result in a provenance precision loss for data that is exposed to C.

Thanks for the feedback and for looking at this with us! :)

the main problem (as I understand it) is that we won't know if that pointer is to memory that contains another pointer to somewhere else in C memory.

We don't know which part of that memory even is a pointer. There could be pointers to other C memory there, or pointers to Miri memory, and we will have no clue. We cannot use the types to guide us because types are allowed to be wrong in C.

I plan to remove the ptr_from_addr_transmute hook soon, that can help simplify the interpreter a lot. So you shouldn't rely on it for your plans.

We've updated the doc with a proposed plan on how to deal with provenance.

I can't quite figure out what you are even proposing here, sorry. It seems to zoom in on a tiny aspect of the problem when you haven't sketched out the larger architecture of the approach yet.

  • When a Miri pointer is passed to C, how can C make any sense of that memory?
void print(int *x) { printf("%d\n", *x); }
void print2(int **x) { printf("%d\n", **x); }
  • What when C writes to that memory?
void set(int *x, int val) { *x = val; }
void set2(int **x, int val) { **x = val; }
  • What about C writing a pointer to that memory?
void setptr(int **x, int *val) { *x = val; }
void setptr2(int ***x, int *val) { **x = val; }

None of these even involve C creating any new allocations. Let's first solve the "simpler" cases. :)

Please start with the first case and describe in detail what exactly happens: given the print argument as a Miri Pointer, what exactly do we do? How does the Miri machine prepare its state to be able to pass a pointer to C that can make all these operations meaningful? Does Miri do any kind of "post-processing" after C returns?

I don't think you will need a "sync list", and Miri already contains what is basically an implementation of PNVI-ae-udi. I don't quite understand which problem that part of the proposal solves, but it seems too complicated on the one hand while also not talking about the real issue I am seeing on the other hand.^^ Mainly, what exactly does the pointer given to C even point to, and how does Miri prepare that memory so that when C does reads and writes, it all makes sense?

similar to the Wildcard provenance

I think wildcard provenance is indeed the key -- it's not just similar to wildcard provenance, it's the same thing! (So indeed, having both FFI support for pointers and strict provenance will just not work.)

This project isn't moving forward any more, so I will close this issue and open a new one instead to track just having more FFI support: #3491.

I should also say thanks a lot to @emarteca and @maurer for the initial work here, this was great to get us started. :) Once the initial support via libffi is in place, it's much easier for me to extend that support -- now it's mostly "just" regular Miri programming, as most of the dark FFI magic is done. I hope. ;)