x86 VM + SMP Issue
chrisguikema opened this issue · comments
Hello,
A while back, I made this PR that fixed an issue we were having with passthrough interrupts on an x86 SMP VM configuration. We've tried with MCS and the Round Robin scheduler, and the issue is present with both.
Here is a simplified version of our configuration (The actual version gives multiple hardware resources and cores to each VM)
Each VM is isolated on its own core with a fault handler:
Dumping all tcbs on Core 1!
Name State IP Prio Core
----------------------------------------------------------------------------------------------------------------------
vm0:control blocked on reply 0x40e363 100 1
vm0:fault_handler blocked on recv 0x40b1cb 255 1
idle_thread idle 0xffffffff80806781 0 1
We noticed that one of the VMs would begin to experience a time drift. Naturally, I assumed it was a problem with the Time Server, or the HPET emulation layer. However, I noticed that there were a bunch of times when the calls to init_timer_completed()
would return 0. The time server locks all shared resources around a mutex, so this would only make sense if the Time Server notification was being received multiple times.
Looking at @kent-mcleod's comments on the closed PR, he mentioned that pending events would be delivered multiple times. This seems to match the behavior that I'm currently seeing. However, when I remove that patch, the VMs print sporadically, and passthrough interrupts don't work at all, which causes our system to break.
So I began to look into the problem a bit. I'm not an expert in the (Bound) Notification path, so please correct me if I'm wrong.
Specifically for SMP VMs, there are two places in the kernel where the Notification is delivered:
- When the VMEnter syscall is triggered:
if (syscall == SysVMEnter && NODE_STATE(ksCurThread)->tcbArch.tcbVCPU) {
vcpu_update_state_sysvmenter(NODE_STATE(ksCurThread)->tcbArch.tcbVCPU);
if (NODE_STATE(ksCurThread)->tcbBoundNotification
&& notification_ptr_get_state(NODE_STATE(ksCurThread)->tcbBoundNotification) == NtfnState_Active) {
completeSignal(NODE_STATE(ksCurThread)->tcbBoundNotification, NODE_STATE(ksCurThread));
setRegister(NODE_STATE(ksCurThread), msgInfoRegister, SEL4_VMENTER_RESULT_NOTIF);
/* Any guest state that we should return is in the same
* register position as sent to us, so we can just return
* and let the user pick up the values they put in */
restore_user_context();
- In
VMCheckBoundNotification
:
word_t badge = notification_ptr_get_ntfnMsgIdentifier(ntfnPtr);
notification_ptr_set_state(ntfnPtr, NtfnState_Idle);
setThreadState(tcb, ThreadState_Running);
setRegister(tcb, badgeRegister, badge);
Arch_leaveVMAsyncTransfer(tcb);
/* In the process of performing Arch_leavVMAsyncTransfer we will have
* had to switch the active VMCS. As a result we might as well try and
* run this tcb if it is permitted instead of switching VMCS contexts
* back and forth */
if (tcb != NODE_STATE(ksCurThread)) {
possibleSwitchTo(tcb);
}
The first case is triggered when the Bound Notification is set to the Active
state, when in our case, occurs when the VMM is active, either initializing the VM, or handling a VM Exit / Interrupt.
In this case, the VMEnter call immediately returns with the SEL4_VMENTER_RESULT_NOTIF
fault set, which triggers the vm_run loop.
The second case is triggered when a thread with a VCPU is running in its VM state, and the notification occurs on a different core than the VM, which happens a lot in our use case.
I'm assuming that the Notification is set to Active
such that any additional interrupts or signals cause the badge to be updated. In this case, the possibleSwitchTo
command is called after the VM is switched to its VMM state, which should trigger the same exit with the SEL4_VMENTER_RESULT_NOTIF
fault set.
To test this, I added some print statements in each of these cases. I know printing effects the timing of the system, but I don't have access to a debugger, so its the best I can do.
To start, there seems to be a 1:1 correlation between the kernel and userspace prints, with the badge values lining up.
Initializing guest to start running at 0x10400000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
But then I started to notice these events, where the prints in VMCheckBoundNotification
would print without the corresponding print in the vm_run
loop.
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
I can only think of a few reasons why this would be happening:
- The VM thread isn't getting rescheduled
- The
msgInfoRegister
register isn't being set with theSEL4_VMENTER_RESULT_NOTIF
value - Some other data race condition elsewhere I'm just missing
The first issue doesn't seem to be the case, as the VMCheckBoundNotification
only triggers when the Thread State is RunningVM
, and the function sets the thread state to Running
. That would only happen if the guest calls the VMEnter
syscall. My "fix" was to leave the Notification in the Active
state, which would trigger the VMEnter
conditional, returning with a notification fault, even if the vm_run
loop had already been triggered by the VMCheckBoundNotification
function.
For the second case, I explicitly set the SEL4_VMENTER_RESULT_NOTIF
flag in the VMCheckBoundNotification
function, but that didn't work either.
That leaves the third issue, a data race that I just don't know about.
Did I miss something with my investigation? Has anyone else run into this problem? I'll reply to this issue if I do find a fix
Could you show your kernel and vm.c diffs? It's a bit unclear where exactly the prints are done.
(For performance you want your IRQ to arrive on the same CPU core as the task handling it, but that doesn't solve the bug though.)
Here's where the prints are:
diff --git a/src/arch/x86/c_traps.c b/src/arch/x86/c_traps.c
index 33df38ea8..de671ffa5 100644
--- a/src/arch/x86/c_traps.c
+++ b/src/arch/x86/c_traps.c
@@ -108,6 +108,7 @@ void NORETURN slowpath(syscall_t syscall)
vcpu_update_state_sysvmenter(NODE_STATE(ksCurThread)->tcbArch.tcbVCPU);
if (NODE_STATE(ksCurThread)->tcbBoundNotification
&& notification_ptr_get_state(NODE_STATE(ksCurThread)->tcbBoundNotification) == NtfnState_Active) {
+ /* printf("%s: %s completing signal with badge: 0x%llx\n", __func__, TCB_PTR_DEBUG_PTR(NODE_STATE(ksCurThread))->tcbName, notification_ptr_get_ntfnMsgIdentifier(NODE
_STATE(ksCurThread)->tcbBoundNotification)); */
diff --git a/src/arch/x86/object/vcpu.c b/src/arch/x86/object/vcpu.c
index 8deacd8b1..2ae94f2ce 100644
--- a/src/arch/x86/object/vcpu.c
+++ b/src/arch/x86/object/vcpu.c
@@ -1466,8 +1468,12 @@ void VMCheckBoundNotification(tcb_t *tcb)
}
#endif
+ /* printf("%s: %s completing signal with badge: 0x%lx [%d]\n", __func__, TCB_PTR_DEBUG_PTR(tcb)->tcbName, badge, (int)getRegister(tcb, msgInfoRegister)); */
For the vm.c
diff, we made some updates to allow for multicore guests, hence why the function is called vcpu_run
instead of vm_run
, but the basic event loop is identical.
diff --git a/libsel4vm/src/arch/x86/vm.c b/libsel4vm/src/arch/x86/vm.c
index 7b09952..03ad812 100644
--- a/libsel4vm/src/arch/x86/vm.c
+++ b/libsel4vm/src/arch/x86/vm.c
@@ -259,12 +258,15 @@ int vcpu_run(vm_vcpu_t *vcpu)
}
vm_update_guest_state_from_interrupt(vcpu, int_message);
}
} else {
seL4_Wait(vcpu->host_endpoint, &badge);
fault = SEL4_VMENTER_RESULT_NOTIF;
}
if (fault == SEL4_VMENTER_RESULT_NOTIF) {
+ /* ZF_LOGE("%s: vcpu %d notification exit with badge: 0x%lx", vcpu->vm->get_name(), vcpu->vcpu_id, badge); */
I recommend keeping counts of each event and printing out the totals now and then.
Sharing a UART with multiple cores is unreliable in the sense that output of one may be overwritten by output of another core, depending on details (my UART driver checked for FIFO empty before it would start writing anything, that reduced the problems with concurrent writes a lot).
Multiple notifications are coalesced, so counting the number of events is not a good idea if you want a reliable timer server.
The VM thread isn't getting rescheduled
But then it should stay in ThreadState_Running
and we would still see only one print from VMCheckBoundNotification
.
The msgInfoRegister register isn't being set with the SEL4_VMENTER_RESULT_NOTIF value
This can be verified by printing something on every wake-up.
Some other data race condition elsewhere I'm just missing
Maybe, or an unexpected corner case. handleIPI()
is called either when trying to take the big kernel lock, or when receiving an IRQ. Could you print out via which way the call was made in your prints?
vcpu_run@vm.c:264 vm1:core0 has gotten 20 notifications
vcpu_run@vm.c:264 vm1:core0 has gotten 30 notifications
vm1:control: notification exit 50 times
vcpu_run@vm.c:264 vm1:core0 has gotten 40 notifications
I put a num_notification
variable in the vm_run
function, and added the same variable to the tcbArch
struct in the kernel. I updated the prints to print every 10x in the VM loop, and every 50x in the kernel, so that the terminal isn't spamming, and that any "misses" are easier to see.
Also, the SEL4_VMENTER_RESULT_NOTIF
is 0, so really by default, the VMM will check for the notification exit.
I'm just failing to see a state where I'd expect to see this behavior.
I'm just failing to see a state where I'd expect to see this behavior.
Agreed.
To me handleVmexit()
seems suspicious, maybe add some printfs there to see if the weirdness starts after a certain path is taken there. In particular, add a check for pending notifications to handleVmxFault()
to see if it overwrites the notification.
In handleVmexit
, the thread state would still be ThreadState_RunningVM
, as that isn't updated til handleVmxFault
, which is the last function called in handleVmexit
. Additionally, the thread does not claim the big kernel lock until after the EXTERNAL_INTERRUPT
check. The external interrupt could either be a passthrough interrupt, or an IPI triggered to check the bound notification, in which case the BKL has already been claimed.
One other thing - we're using MSIs, so I was able to route the interrupts directly to the same core as the expected interrupt handler in the guest. That removes the VMCheckBoundNotification
path, and the system works. But I'm still experiencing quite a bit of choppiness in the serial. The TimeServer interrupts are being generated on a different core than the VMs, so the VMCheckBoundNotification
path is still being triggered.
I'm hoping we can push up an example application showcasing the problem soon.
In handleVmexit, the thread state would still be ThreadState_RunningVM, as that isn't updated til handleVmxFault
That is not true: VMCheckBoundNotification
can have run before that, when handleVmexit
attempts to take the global lock, if the VMExit wasn't triggered by the IPI, but races with it. By the time it comes to handleVmexit
, to me it seems that it will just blissfully overwrite the registers and you will lose a signal.
Please add a check for pending notifications to handleVmxFault() to see if it overwrites the notification. You can add an assertion for task state being ThreadState_RunningVM
to handleVmexit
to check if I'm right.
Incidentally, I intend to change the code so that if an IPI is handled when trying to take the lock, we return to user space instead of continuing, as that seems too surprising and subtle a behaviour to have. It's not worth the small performance gain of avoiding trapping twice.
Ah, you may be right!
diff --git a/src/arch/x86/object/vcpu.c b/src/arch/x86/object/vcpu.c
index 8deacd8b1..a85142e30 100644
--- a/src/arch/x86/object/vcpu.c
+++ b/src/arch/x86/object/vcpu.c
@@ -1317,6 +1317,8 @@ exception_t handleVmexit(void)
NODE_LOCK_SYS;
+ assert(thread_state_get_tsType(NODE_STATE(ksCurThread)->tcbState) == ThreadState_RunningVM);
+
if (!vcpuThreadUsingFPU(NODE_STATE(ksCurThread))) {
/* since this vcpu does not currently own the fpu state, check if the kernel should
* switch the fpu owner or not. We switch if the guest performed and unimplemented device
During runtime:
seL4 failed assertion 'thread_state_get_tsType(NODE_STATE(ksCurThread)->tcbState) == ThreadState_RunningVM' at /host/kernel/src/arch/x86/object/vcpu.c:1320 in function handleVmexit
halting...
Kernel entry via Syscall, number: 1, Call
Cap type: 10, Invocation tag: 20
Now the question needs to be, how do we address this?
Now the question needs to be, how do we address this?
Let's try a restore_user_context()
after both handleIPI()
calls in lock.h and see if that solves it for you. You could test without the affinity fix and see if everything behaves as it should.
That didn't work.
Both seemingly got stuck in a x64_sys_send_recv
syscall.
Bummer, that's unexpected. Is it broken immediately, or does it break after a while?
It seems to break right away.
Here is the scheduler dump:
Dumping all tcbs on Core 1!
Name State IP Prio Core
----------------------------------------------------------------------------------------------------------------------
vm0:control running 0x40eb7e 100 1
And the objdump:
000000000040eb25 <x64_sys_send_recv>:
....
40eb70: 48 8b 7d e0 mov -0x20(%rbp),%rdi
40eb74: 48 8b 75 d0 mov -0x30(%rbp),%rsi
40eb78: 48 89 c2 mov %rax,%rdx
40eb7b: 48 89 e3 mov %rsp,%rbx
40eb7e: 0f 05 syscall
Oh, and here is my restore_user_context
adds. Was this what you were expecting?
diff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..d90044374 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -76,6 +76,7 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+ restore_user_context();
}
arch_pause();
@@ -112,6 +113,7 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
/* We do not need to perform a memory release here as we would have only modified
* local state that we do not need to make visible */
+ restore_user_context();
}
arch_pause();
}
Yes, it is, but.. Oops, we can only do the restore after we released the lock.
So set an ipi flag and check the flag at the end of clh_lock_acquire()
, if it is set release the lock and do the user restore there.
Like this?
diff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..47711d93c 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -61,6 +61,7 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
irqPath, word_t cpu, int memorder)
{
clh_qnode_t *prev;
+ int is_ipi = false;
if (memorder == __ATOMIC_RELEASE || memorder == __ATOMIC_ACQ_REL) {
__atomic_thread_fence(__ATOMIC_RELEASE);
@@ -76,6 +77,7 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+ is_ipi = true;
}
arch_pause();
@@ -87,12 +89,17 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
__atomic_thread_fence(__ATOMIC_SEQ_CST);
}
+ if (is_ipi) {
+ restore_user_context();
+ }
+
return prev;
}
static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
{
clh_qnode_t *prev;
+ int is_ipi = false;
big_kernel_lock.node_owners[cpu].node->value = CLHState_Pending;
prev = sel4_atomic_exchange(&big_kernel_lock.head, irqPath, cpu, __ATOMIC_ACQ_REL);
@@ -112,12 +119,17 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
/* We do not need to perform a memory release here as we would have only modified
* local state that we do not need to make visible */
+ is_ipi = true;
}
arch_pause();
}
/* make sure no resource access passes from this point */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+ if (is_ipi) {
+ restore_user_context();
+ }
}
The VM does print, and FWIW, before it locks up, it does seem to avoid the choppy printing I was seeing.
No, sel4_atomic_exchange
needs to set the same flag (or add a return code). And before restoring the user context, call clh_lock_release()
.
I'm sorry, but I'm not too familiar with the BKL. Can you send me a diff of what you're thinking? I can't get it to work.
So the below diff is not exactly what I said, but the main thing is that once we request the lock, we are committed and need to finish getting it, so we also need to release it before we return to user space. We didn't do anything visible to other cores in sel4_atomic_exchange
yet, so we can just return there.
iff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..ddd2c4df5 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -76,6 +76,9 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+ /* Go back to user space: */
+ restore_user_context();
+ UNREACHABLE();
}
arch_pause();
@@ -94,6 +97,7 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
{
clh_qnode_t *prev;
big_kernel_lock.node_owners[cpu].node->value = CLHState_Pending;
+ bool_t got_ipi = false;
prev = sel4_atomic_exchange(&big_kernel_lock.head, irqPath, cpu, __ATOMIC_ACQ_REL);
@@ -110,6 +114,8 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
* are async and could be delayed. 'handleIPI' may not return
* based on value of the 'irqPath'. */
handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+ /* We are committed to get the lock, can't return here, we have to wait till we get it: */
+ got_ipi = true;
/* We do not need to perform a memory release here as we would have only modified
* local state that we do not need to make visible */
}
@@ -118,6 +124,13 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
/* make sure no resource access passes from this point */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
+ if (got_ipi) {
+ /* Release the lock we just got: */
+ clh_lock_release(cpu);
+ /* Go back to user space: */
+ restore_user_context();
+ UNREACHABLE();
+ }
}
static inline void FORCE_INLINE clh_lock_release(word_t cpu)
Alright, I applied that patch, and I'm getting the same behavior. Each VM is stuck in the syscall instruction.
What if in the handleVmexit
function, if the state is Running
, we just keep the notification state to active? Then when the VM attempts to enter its VM state, it will immediately exit and handle the interrupt? Then any subsequent notifications will be added to the badge instead of dropped
Worth a try, to see if it works at least.
I really want to fix all similar bugs by exiting the kernel after an IPI, but I'll keep working on it in #871.
Okay, so that kind of ended up working. My final solution was to add a case where we handle both a fault and a notification, and then keep the notification active during this edge case.
Unfortunately, that doesn't fix the time drift issue, but it does allow us to not break from the VMEnter every time a notification is received in the VM state. Once your PR is merged, I'll integrate those commits and see if it fixes the problem without the extra state.
Thanks for your help! I'll leave this issue open for other to review / comment, since my fix may not be the best approach here.