Performance degradation after fixing #95

Question

Performance degradation after fixing #95

andreeaflorescu opened this issue 4 years ago · comments

When implementing the fix for #95, we introduced a performance degradation on (some?) glibc builds. On Firecracker with iperf3 and we observed a performance degradation of 5% for glibc builds.
On Cloud Hypervisor, the performance degradation is significantly worse. Observed impact is up to 50%. More details here: cloud-hypervisor/cloud-hypervisor#1258

Opening this issue so that we can decide on the next steps for fixing the performance degradation.

My 2-cents: I wouldn't want to introduce this fix only for x86_64 musl builds & aarch64 glibc & musl builds because we cannot know for sure what glibc versions are people using out there, hence we cannot know for sure that glibc is doing the right thing (i.e. optimizing memcpy at a higher granularity than 1 byte).

I would say that the underlying problem which is the reason for the performance degradation & the bug, is that we lose type information about the object that needs to written/read. I would rather like us to work on improving the interface so that type information doesn't need to be sort-of inferred (by checking alignments & reading/writing in the largest possible chunks). I would need to do some experiments before having a solution here.

Another thing that I believe is of paramount importance at this moment is to add performance testing to vm-memory. Pretty much every other function in vm-memory is on the critical performance path. We should make sure to not introduce regression here as we continue the development.

CC: @sboeuf @rbradford @sameo @alexandruag @serban300 @bonzini

Paolo Bonzini · Answer 1 · Wed Jun 03 2020 19:07:57 GMT+0800 (China Standard Time)

we lose type information about the object that needs to written/read

Why is that causing performance degradation? The underlying object is likely &[u8], but because the memcpy-like code does not care about the underlying type it is still able to copy in 64-bit chunks. The problem is simply that the glibc memcpy is more sophisticated.

Pretty much every other function in vm-memory is on the critical performance path. We should make sure to not introduce regression here as we continue the development.

Agreed. Though a better implementation of virtio-net could avoid the copies by doing recv and send directly into memory.

Rob Bradford · Answer 2 · Wed Jun 03 2020 21:02:02 GMT+0800 (China Standard Time)

@bonzini We're looking at using .read_from() and .write_to() but that still involve at least one "vm-memory copy" (see cloud-hypervisor/cloud-hypervisor#1265)

Rob Bradford · Answer 3 · Thu Jun 04 2020 22:44:06 GMT+0800 (China Standard Time)

@bonzini We're looking at using .read_from() and .write_to() but that still involve at least one "vm-memory copy" (see cloud-hypervisor/cloud-hypervisor#1265)

This doesn't work due to TAP API limitations.

Rob Bradford · Answer 4 · Thu Jun 04 2020 22:50:38 GMT+0800 (China Standard Time)

I looked at this issue some more and i'd like to propose that it comes down to a limitation of the current vm-memory API.

The read_obj() / write_obj() functions (or their slice variants) are the only "safe" (in terms of bounds checking) way to writing to or reading from the guest memory. In the case of updating the queue descriptors in virtio you do need a fully atomic write. In the case of filling the allocated buffer from the network device with data we don't need any atomicity. In fact as per the spec the guest will not try and access it until after the queue has been updated and the memory fenced.

So either we need a write_obj_atomic() vs write_obj() or a write_obj() vs write_obj_volatile(). Unless i'm missing something there is now no API that lets you copy into the guest ram without making any memory guarantees (which is the way it worked before.)

Paolo Bonzini · Answer 5 · Thu Jun 04 2020 23:51:36 GMT+0800 (China Standard Time)

But the atomicity is not the reason for the performance decrease, the current code is faster than a stupid byte-by-byte copy. The reason why performance got worse is that the system memcpy does not guarantee atomicity but, if it did, the new code is likely slower.

Rob Bradford · Answer 6 · Thu Jun 04 2020 23:57:46 GMT+0800 (China Standard Time)

@bonzini My argument is that we don't need atomicity all the time only for specific operations and hence why there needs to be atomic and non-atomic versions of the API.

Jiang Liu · Answer 7 · Fri Jun 05 2020 00:06:36 GMT+0800 (China Standard Time)

I looked at this issue some more and i'd like to propose that it comes down to a limitation of the current vm-memory API.

The read_obj() / write_obj() functions (or their slice variants) are the only "safe" (in terms of bounds checking) way to writing to or reading from the guest memory. In the case of updating the queue descriptors in virtio you do need a fully atomic write. In the case of filling the allocated buffer from the network device with data we don't need any atomicity. In fact as per the spec the guest will not try and access it until after the queue has been updated and the memory fenced.

So either we need a write_obj_atomic() vs write_obj() or a write_obj() vs write_obj_volatile(). Unless i'm missing something there is now no API that lets you copy into the guest ram without making any memory guarantees (which is the way it worked before.)

I agree on this direction.
We may add read/write_{u8,u16,u32,u64} for naturally aligned address. For those naturally aligned access, we could assume it won't cross region boundary, thus get rid of the heavy try_access().

So we have two classes of interfaces:

read/write_{u8,u16,u32,u64} for naturally aligned access, with a simple quick patch and ensuring atomic access.
stream oriented access by read/write_xxx(), which handle the cross region boundary case but doesn't ensure atomic access.

Jiang Liu · Answer 8 · Fri Jun 05 2020 00:21:37 GMT+0800 (China Standard Time)

When analyzing the disassembled code, it's really a little heavy by calling try_access() for every guest memory access. We should build quick path for those accesses which never cross the region boundary.

Paolo Bonzini · Answer 9 · Fri Jun 05 2020 01:55:04 GMT+0800 (China Standard Time)

My argument is that we don't need atomicity all the time only for specific operations and hence why there needs to be atomic and non-atomic versions of the API.

No, that would be true if non-atomic versions would provide additional value. Right now the value would be speed, but that should not be the case with a properly optimized VolatileMemory copy, since glibc memcpy provides both atomicity and speed.

@jiangliu It should not call try_access() any more after #95 went in, though.

Alexandru Agache · Answer 10 · Mon Jun 15 2020 06:47:28 GMT+0800 (China Standard Time)

Hi! try_access is still invoked often, as every method from the Bytes implementation for GuestMemory uses try_access or calls another method that does. Also, while memcpy implementations are fast, they don't appear to be as fast as using something like a single read(_volatile) or write(_volatile) for some T. Unfortunately, we can't use those anymore for read_obj and write_obj on a GuestMemoryMmap, because the object can span adjacent regions, which are arbitrarily placed in the VMM process address space :(

What do you think about enforcing (via the GuestMemoryMmap implementation) that regions which are adjacent in the guest physical address space, are also adjacent in the VMM process address space? This can be done using an initial mmap that only reserves a virtual address range, and then subsequent mmaps + MAP_FIXED within that range to place each particular region right where we want it. This introduces the limitation of having to declare the maximum amount of memory available to the guest (the size of the initial mmap), but hopefully that's not a concern.

With this in place, we only have to check that a memory access takes place within a valid guest physical address range, before doing it in one go. Moreover, the valid ranges (potentially including multiple adjacent regions) can be precomputed to speed up future validations every time the guest memory layout changes.

Alexandru Agache · Answer 11 · Tue Jul 14 2020 04:07:51 GMT+0800 (China Standard Time)

Hi again! Here goes another wall of text :D Now is a good time to polish and clear up some aspects around the vm-memory interface, while thinking about optimizing the implementation. What we're looking at is there are the multiple ways of achieving the same thing (but not all of them always applicable), some operations don't have clearly specified semantics (with respect to atomicity, volatility, etc.), and certain design aspects hinge on assumptions which are worth validating again.

For example, GuestMemory allows specific accesses through its implementation of the Bytes interface, or by directly working with VolatileSlice (via GuestMemory::get_slice) and the related abstractions. The code paths are disjoint to a non-negligible extent, and the latter cannot be exclusively relied upon because it doesn't allow cross-region accesses. This adds (undesirable IMO) nuances to the interface and how a T: GuestMemory object is used.

It looks like the primitives we're looking for are a set of methods similar to what Bytes exposes right now, together with some Array<T> and Ref<T> abstractions, kinda like what we already have, but that also transparently work across regions and parts of the guest memory that are backed by something else than a host memory area. These two will be implementation-specific, so it seems natural to have them as associated types. Different implementation will be able to simplify things as much as their constraints allow. I've started building a prototype for GuestMemoryMmap (which also attempts to implement the above comment), and was wondering if people have any thoughts, concerns, or are trying things out as well.

In terms of validating assumptions, I wanted to start by asking what use cases are we targeting by allowing that certain guest memory ranges don't have to be backed by memory on the host (i.e. get_host_address may return HostAddressNotAvailable)?

Jiang Liu · Answer 12 · Tue Jul 14 2020 11:39:46 GMT+0800 (China Standard Time)

Currently we are using the vm-memory's interfaces in three typical ways:

get_atomic_ref() for atomic access to specific fields. This ensures atomicity.
read_obj()/write_obj() for a whole object. This doesn't ensure atomicity.
read()/write() for byte stream based access. This doesn't ensure atomicity.
For case 1, the access should not cross region boundary. But the way to use get_atomic_ref() is a little complex because the GuestMemory doesn't expose the interface directly. https://github.com/cloud-hypervisor/vm-virtio/blob/dragonball/src/queue.rs#L791
For case 2/3, it doesn't ensure atomicity. And we should optimize for case 3, because it may be used to copy bulk data for net/blk/fs devices.

So it may help to

add atomic interfaces to GuestMemory
explicitly state which interfaces ensure atomicity and which don't.

Rob Bradford · Answer 13 · Tue Jul 14 2020 21:42:04 GMT+0800 (China Standard Time)

This is the solution we use with Cloud Hypervisor that mitigates the performance drop: We only use the slower alignment checked write for copies <= size of usize:

cloud-hypervisor@708e9aa

This means the same API can be used for small updates that must be atomic (like those for updating virtio queue offsets) and for large bulk copies where there is no expectation of that behaviour.

Paolo Bonzini · Answer 14 · Wed Jul 15 2020 02:03:56 GMT+0800 (China Standard Time)

I like @rbradford's solution very much, possibly extended to 16 bytes.

Alexandru Agache · Answer 15 · Wed Jul 15 2020 22:25:20 GMT+0800 (China Standard Time)

That's a cool implementation! I think we should clear up some things at the interface level as well; opened #102 and would greatly appreciate if ppl can take a look.