x86 VM + SMP Issue

Question

x86 VM + SMP Issue

chrisguikema opened this issue 7 months ago · comments

Christopher Guikema commented 7 months ago

Hello,

A while back, I made this PR that fixed an issue we were having with passthrough interrupts on an x86 SMP VM configuration. We've tried with MCS and the Round Robin scheduler, and the issue is present with both.

Here is a simplified version of our configuration (The actual version gives multiple hardware resources and cores to each VM)

Each VM is isolated on its own core with a fault handler:

Dumping all tcbs on Core 1!
Name                                              State               IP                  Prio  Core
----------------------------------------------------------------------------------------------------------------------
vm0:control                                       blocked on reply    0x40e363              100 1 
vm0:fault_handler                                 blocked on recv     0x40b1cb              255 1 
idle_thread                                       idle                0xffffffff80806781    0   1

We noticed that one of the VMs would begin to experience a time drift. Naturally, I assumed it was a problem with the Time Server, or the HPET emulation layer. However, I noticed that there were a bunch of times when the calls to init_timer_completed() would return 0. The time server locks all shared resources around a mutex, so this would only make sense if the Time Server notification was being received multiple times.

Looking at @kent-mcleod's comments on the closed PR, he mentioned that pending events would be delivered multiple times. This seems to match the behavior that I'm currently seeing. However, when I remove that patch, the VMs print sporadically, and passthrough interrupts don't work at all, which causes our system to break.

So I began to look into the problem a bit. I'm not an expert in the (Bound) Notification path, so please correct me if I'm wrong.

Specifically for SMP VMs, there are two places in the kernel where the Notification is delivered:

When the VMEnter syscall is triggered:

    if (syscall == SysVMEnter && NODE_STATE(ksCurThread)->tcbArch.tcbVCPU) {
        vcpu_update_state_sysvmenter(NODE_STATE(ksCurThread)->tcbArch.tcbVCPU);
        if (NODE_STATE(ksCurThread)->tcbBoundNotification
            && notification_ptr_get_state(NODE_STATE(ksCurThread)->tcbBoundNotification) == NtfnState_Active) {
            completeSignal(NODE_STATE(ksCurThread)->tcbBoundNotification, NODE_STATE(ksCurThread));
            setRegister(NODE_STATE(ksCurThread), msgInfoRegister, SEL4_VMENTER_RESULT_NOTIF);
            /* Any guest state that we should return is in the same
             * register position as sent to us, so we can just return
             * and let the user pick up the values they put in */
            restore_user_context();

In VMCheckBoundNotification:

        word_t badge = notification_ptr_get_ntfnMsgIdentifier(ntfnPtr);
        notification_ptr_set_state(ntfnPtr, NtfnState_Idle);
        setThreadState(tcb, ThreadState_Running);
        setRegister(tcb, badgeRegister, badge);
        Arch_leaveVMAsyncTransfer(tcb);
        /* In the process of performing Arch_leavVMAsyncTransfer we will have
         * had to switch the active VMCS. As a result we might as well try and
         * run this tcb if it is permitted instead of switching VMCS contexts
         * back and forth */
        if (tcb != NODE_STATE(ksCurThread)) {
            possibleSwitchTo(tcb);
        }

The first case is triggered when the Bound Notification is set to the Active state, when in our case, occurs when the VMM is active, either initializing the VM, or handling a VM Exit / Interrupt.

In this case, the VMEnter call immediately returns with the SEL4_VMENTER_RESULT_NOTIF fault set, which triggers the vm_run loop.

The second case is triggered when a thread with a VCPU is running in its VM state, and the notification occurs on a different core than the VM, which happens a lot in our use case.

I'm assuming that the Notification is set to Active such that any additional interrupts or signals cause the badge to be updated. In this case, the possibleSwitchTo command is called after the VM is switched to its VMM state, which should trigger the same exit with the SEL4_VMENTER_RESULT_NOTIF fault set.

To test this, I added some print statements in each of these cases. I know printing effects the timing of the system, but I don't have access to a debugger, so its the best I can do.

To start, there seems to be a 1:1 correlation between the kernel and userspace prints, with the badge values lining up.

Initializing guest to start running at 0x10400000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000

But then I started to notice these events, where the prints in VMCheckBoundNotification would print without the corresponding print in the vm_run loop.

VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
slowpath: vm1:control completing signal with badge: 0x4080000000000000
vcpu_run@vm.c:268 vm1: vcpu 0 notification exit with badge: 0x4080000000000000
VMCheckBoundNotification: vm1:control completing signal with badge: 0x4080000000000000

I can only think of a few reasons why this would be happening:

The VM thread isn't getting rescheduled
The msgInfoRegister register isn't being set with the SEL4_VMENTER_RESULT_NOTIF value
Some other data race condition elsewhere I'm just missing

The first issue doesn't seem to be the case, as the VMCheckBoundNotification only triggers when the Thread State is RunningVM, and the function sets the thread state to Running. That would only happen if the guest calls the VMEnter syscall. My "fix" was to leave the Notification in the Active state, which would trigger the VMEnter conditional, returning with a notification fault, even if the vm_run loop had already been triggered by the VMCheckBoundNotification function.

For the second case, I explicitly set the SEL4_VMENTER_RESULT_NOTIF flag in the VMCheckBoundNotification function, but that didn't work either.

That leaves the third issue, a data race that I just don't know about.

Did I miss something with my investigation? Has anyone else run into this problem? I'll reply to this issue if I do find a fix

Indan Zupancic · Answer 1 · Sat Dec 09 2023 05:31:29 GMT+0800 (China Standard Time)

Could you show your kernel and vm.c diffs? It's a bit unclear where exactly the prints are done.

Indan Zupancic · Answer 2 · Sat Dec 09 2023 05:34:27 GMT+0800 (China Standard Time)

(For performance you want your IRQ to arrive on the same CPU core as the task handling it, but that doesn't solve the bug though.)

Christopher Guikema · Answer 3 · Sat Dec 09 2023 06:01:33 GMT+0800 (China Standard Time)

Here's where the prints are:

diff --git a/src/arch/x86/c_traps.c b/src/arch/x86/c_traps.c
index 33df38ea8..de671ffa5 100644
--- a/src/arch/x86/c_traps.c
+++ b/src/arch/x86/c_traps.c
@@ -108,6 +108,7 @@ void NORETURN slowpath(syscall_t syscall)
         vcpu_update_state_sysvmenter(NODE_STATE(ksCurThread)->tcbArch.tcbVCPU);
         if (NODE_STATE(ksCurThread)->tcbBoundNotification
             && notification_ptr_get_state(NODE_STATE(ksCurThread)->tcbBoundNotification) == NtfnState_Active) {
+            /* printf("%s: %s completing signal with badge: 0x%llx\n", __func__, TCB_PTR_DEBUG_PTR(NODE_STATE(ksCurThread))->tcbName, notification_ptr_get_ntfnMsgIdentifier(NODE
_STATE(ksCurThread)->tcbBoundNotification)); */

diff --git a/src/arch/x86/object/vcpu.c b/src/arch/x86/object/vcpu.c
index 8deacd8b1..2ae94f2ce 100644
--- a/src/arch/x86/object/vcpu.c
+++ b/src/arch/x86/object/vcpu.c
@@ -1466,8 +1468,12 @@ void VMCheckBoundNotification(tcb_t *tcb)
         }
 #endif
 
+        /* printf("%s: %s completing signal with badge: 0x%lx [%d]\n", __func__, TCB_PTR_DEBUG_PTR(tcb)->tcbName, badge, (int)getRegister(tcb, msgInfoRegister)); */

For the vm.c diff, we made some updates to allow for multicore guests, hence why the function is called vcpu_run instead of vm_run, but the basic event loop is identical.

diff --git a/libsel4vm/src/arch/x86/vm.c b/libsel4vm/src/arch/x86/vm.c
index 7b09952..03ad812 100644
--- a/libsel4vm/src/arch/x86/vm.c
+++ b/libsel4vm/src/arch/x86/vm.c
@@ -259,12 +258,15 @@ int vcpu_run(vm_vcpu_t *vcpu)
                 }
                 vm_update_guest_state_from_interrupt(vcpu, int_message);
             }
         } else {
             seL4_Wait(vcpu->host_endpoint, &badge);
             fault = SEL4_VMENTER_RESULT_NOTIF;
         }
 
         if (fault == SEL4_VMENTER_RESULT_NOTIF) {
+            /* ZF_LOGE("%s: vcpu %d notification exit with badge: 0x%lx", vcpu->vm->get_name(), vcpu->vcpu_id, badge); */

Indan Zupancic · Answer 4 · Sat Dec 09 2023 06:59:18 GMT+0800 (China Standard Time)

I recommend keeping counts of each event and printing out the totals now and then.

Sharing a UART with multiple cores is unreliable in the sense that output of one may be overwritten by output of another core, depending on details (my UART driver checked for FIFO empty before it would start writing anything, that reduced the problems with concurrent writes a lot).

Multiple notifications are coalesced, so counting the number of events is not a good idea if you want a reliable timer server.

The VM thread isn't getting rescheduled

But then it should stay in ThreadState_Running and we would still see only one print from VMCheckBoundNotification.

The msgInfoRegister register isn't being set with the SEL4_VMENTER_RESULT_NOTIF value

This can be verified by printing something on every wake-up.

Some other data race condition elsewhere I'm just missing

Maybe, or an unexpected corner case. handleIPI() is called either when trying to take the big kernel lock, or when receiving an IRQ. Could you print out via which way the call was made in your prints?

Christopher Guikema · Answer 5 · Mon Dec 11 2023 23:20:00 GMT+0800 (China Standard Time)

vcpu_run@vm.c:264 vm1:core0 has gotten 20 notifications
vcpu_run@vm.c:264 vm1:core0 has gotten 30 notifications
vm1:control: notification exit 50 times
vcpu_run@vm.c:264 vm1:core0 has gotten 40 notifications

I put a num_notification variable in the vm_run function, and added the same variable to the tcbArch struct in the kernel. I updated the prints to print every 10x in the VM loop, and every 50x in the kernel, so that the terminal isn't spamming, and that any "misses" are easier to see.

Also, the SEL4_VMENTER_RESULT_NOTIF is 0, so really by default, the VMM will check for the notification exit.

I'm just failing to see a state where I'd expect to see this behavior.

Indan Zupancic · Answer 6 · Tue Dec 12 2023 00:26:59 GMT+0800 (China Standard Time)

I'm just failing to see a state where I'd expect to see this behavior.

Agreed.

To me handleVmexit() seems suspicious, maybe add some printfs there to see if the weirdness starts after a certain path is taken there. In particular, add a check for pending notifications to handleVmxFault() to see if it overwrites the notification.

Christopher Guikema · Answer 7 · Tue Dec 12 2023 02:13:57 GMT+0800 (China Standard Time)

In handleVmexit, the thread state would still be ThreadState_RunningVM, as that isn't updated til handleVmxFault, which is the last function called in handleVmexit. Additionally, the thread does not claim the big kernel lock until after the EXTERNAL_INTERRUPT check. The external interrupt could either be a passthrough interrupt, or an IPI triggered to check the bound notification, in which case the BKL has already been claimed.

One other thing - we're using MSIs, so I was able to route the interrupts directly to the same core as the expected interrupt handler in the guest. That removes the VMCheckBoundNotification path, and the system works. But I'm still experiencing quite a bit of choppiness in the serial. The TimeServer interrupts are being generated on a different core than the VMs, so the VMCheckBoundNotification path is still being triggered.

I'm hoping we can push up an example application showcasing the problem soon.

Indan Zupancic · Answer 8 · Tue Dec 12 2023 02:37:37 GMT+0800 (China Standard Time)

In handleVmexit, the thread state would still be ThreadState_RunningVM, as that isn't updated til handleVmxFault

That is not true: VMCheckBoundNotification can have run before that, when handleVmexit attempts to take the global lock, if the VMExit wasn't triggered by the IPI, but races with it. By the time it comes to handleVmexit, to me it seems that it will just blissfully overwrite the registers and you will lose a signal.

Please add a check for pending notifications to handleVmxFault() to see if it overwrites the notification. You can add an assertion for task state being ThreadState_RunningVM to handleVmexit to check if I'm right.

Indan Zupancic · Answer 9 · Tue Dec 12 2023 02:48:20 GMT+0800 (China Standard Time)

Incidentally, I intend to change the code so that if an IPI is handled when trying to take the lock, we return to user space instead of continuing, as that seems too surprising and subtle a behaviour to have. It's not worth the small performance gain of avoiding trapping twice.

Christopher Guikema · Answer 10 · Tue Dec 12 2023 04:04:34 GMT+0800 (China Standard Time)

Ah, you may be right!

diff --git a/src/arch/x86/object/vcpu.c b/src/arch/x86/object/vcpu.c
index 8deacd8b1..a85142e30 100644
--- a/src/arch/x86/object/vcpu.c
+++ b/src/arch/x86/object/vcpu.c
@@ -1317,6 +1317,8 @@ exception_t handleVmexit(void)
 
     NODE_LOCK_SYS;
 
+    assert(thread_state_get_tsType(NODE_STATE(ksCurThread)->tcbState) == ThreadState_RunningVM);
+
     if (!vcpuThreadUsingFPU(NODE_STATE(ksCurThread))) {
         /* since this vcpu does not currently own the fpu state, check if the kernel should
          * switch the fpu owner or not. We switch if the guest performed and unimplemented device

During runtime:

seL4 failed assertion 'thread_state_get_tsType(NODE_STATE(ksCurThread)->tcbState) == ThreadState_RunningVM' at /host/kernel/src/arch/x86/object/vcpu.c:1320 in function handleVmexit
halting...
Kernel entry via Syscall, number: 1, Call
Cap type: 10, Invocation tag: 20

Now the question needs to be, how do we address this?

Indan Zupancic · Answer 11 · Tue Dec 12 2023 04:41:35 GMT+0800 (China Standard Time)

Now the question needs to be, how do we address this?

Let's try a restore_user_context() after both handleIPI() calls in lock.h and see if that solves it for you. You could test without the affinity fix and see if everything behaves as it should.

Christopher Guikema · Answer 12 · Tue Dec 12 2023 05:06:00 GMT+0800 (China Standard Time)

That didn't work.

Both seemingly got stuck in a x64_sys_send_recv syscall.

Indan Zupancic · Answer 13 · Tue Dec 12 2023 05:37:46 GMT+0800 (China Standard Time)

Bummer, that's unexpected. Is it broken immediately, or does it break after a while?

Christopher Guikema · Answer 14 · Tue Dec 12 2023 05:46:31 GMT+0800 (China Standard Time)

It seems to break right away.

Here is the scheduler dump:

Dumping all tcbs on Core 1!
Name                                              State               IP                  Prio  Core
----------------------------------------------------------------------------------------------------------------------
vm0:control                                       running             0x40eb7e              100 1

And the objdump:

000000000040eb25 <x64_sys_send_recv>:
....
  40eb70:	48 8b 7d e0          	mov    -0x20(%rbp),%rdi
  40eb74:	48 8b 75 d0          	mov    -0x30(%rbp),%rsi
  40eb78:	48 89 c2             	mov    %rax,%rdx
  40eb7b:	48 89 e3             	mov    %rsp,%rbx
  40eb7e:	0f 05                	syscall

Christopher Guikema · Answer 15 · Tue Dec 12 2023 05:49:50 GMT+0800 (China Standard Time)

Oh, and here is my restore_user_context adds. Was this what you were expecting?

diff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..d90044374 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -76,6 +76,7 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
              * are async and could be delayed. 'handleIPI' may not return
              * based on value of the 'irqPath'. */
             handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+            restore_user_context();
         }
 
         arch_pause();
@@ -112,6 +113,7 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
             handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
             /* We do not need to perform a memory release here as we would have only modified
              * local state that we do not need to make visible */
+            restore_user_context();
         }
         arch_pause();
     }

Indan Zupancic · Answer 16 · Tue Dec 12 2023 06:00:16 GMT+0800 (China Standard Time)

Yes, it is, but.. Oops, we can only do the restore after we released the lock.

So set an ipi flag and check the flag at the end of clh_lock_acquire(), if it is set release the lock and do the user restore there.

Christopher Guikema · Answer 17 · Tue Dec 12 2023 06:08:51 GMT+0800 (China Standard Time)

Like this?

diff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..47711d93c 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -61,6 +61,7 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
                                          irqPath, word_t cpu, int memorder)
 {
     clh_qnode_t *prev;
+    int is_ipi = false;
 
     if (memorder == __ATOMIC_RELEASE || memorder == __ATOMIC_ACQ_REL) {
         __atomic_thread_fence(__ATOMIC_RELEASE);
@@ -76,6 +77,7 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
              * are async and could be delayed. 'handleIPI' may not return
              * based on value of the 'irqPath'. */
             handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+            is_ipi = true;
         }
 
         arch_pause();
@@ -87,12 +89,17 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
         __atomic_thread_fence(__ATOMIC_SEQ_CST);
     }
 
+    if (is_ipi) {
+        restore_user_context();
+    }
+
     return prev;
 }
 
 static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
 {
     clh_qnode_t *prev;
+    int is_ipi = false;
     big_kernel_lock.node_owners[cpu].node->value = CLHState_Pending;
 
     prev = sel4_atomic_exchange(&big_kernel_lock.head, irqPath, cpu, __ATOMIC_ACQ_REL);
@@ -112,12 +119,17 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
             handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
             /* We do not need to perform a memory release here as we would have only modified
              * local state that we do not need to make visible */
+            is_ipi = true;
         }
         arch_pause();
     }
 
     /* make sure no resource access passes from this point */
     __atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+    if (is_ipi) {
+        restore_user_context();
+    }
 }

The VM does print, and FWIW, before it locks up, it does seem to avoid the choppy printing I was seeing.

Indan Zupancic · Answer 18 · Tue Dec 12 2023 06:23:17 GMT+0800 (China Standard Time)

No, sel4_atomic_exchange needs to set the same flag (or add a return code). And before restoring the user context, call clh_lock_release().

Christopher Guikema · Answer 19 · Tue Dec 12 2023 10:05:18 GMT+0800 (China Standard Time)

I'm sorry, but I'm not too familiar with the BKL. Can you send me a diff of what you're thinking? I can't get it to work.

Indan Zupancic · Answer 20 · Tue Dec 12 2023 19:08:31 GMT+0800 (China Standard Time)

So the below diff is not exactly what I said, but the main thing is that once we request the lock, we are committed and need to finish getting it, so we also need to release it before we return to user space. We didn't do anything visible to other cores in sel4_atomic_exchange yet, so we can just return there.

iff --git a/include/smp/lock.h b/include/smp/lock.h
index 9f9717e3d..ddd2c4df5 100644
--- a/include/smp/lock.h
+++ b/include/smp/lock.h
@@ -76,6 +76,9 @@ static inline void *sel4_atomic_exchange(void *ptr, bool_t
              * are async and could be delayed. 'handleIPI' may not return
              * based on value of the 'irqPath'. */
             handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+            /* Go back to user space: */
+            restore_user_context();
+            UNREACHABLE();
         }
 
         arch_pause();
@@ -94,6 +97,7 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
 {
     clh_qnode_t *prev;
     big_kernel_lock.node_owners[cpu].node->value = CLHState_Pending;
+    bool_t got_ipi = false;
 
     prev = sel4_atomic_exchange(&big_kernel_lock.head, irqPath, cpu, __ATOMIC_ACQ_REL);
 
@@ -110,6 +114,8 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
              * are async and could be delayed. 'handleIPI' may not return
              * based on value of the 'irqPath'. */
             handleIPI(CORE_IRQ_TO_IRQT(cpu, irq_remote_call_ipi), irqPath);
+            /* We are committed to get the lock, can't return here, we have to wait till we get it: */
+            got_ipi = true;
             /* We do not need to perform a memory release here as we would have only modified
              * local state that we do not need to make visible */
         }
@@ -118,6 +124,13 @@ static inline void FORCE_INLINE clh_lock_acquire(word_t cpu, bool_t irqPath)
 
     /* make sure no resource access passes from this point */
     __atomic_thread_fence(__ATOMIC_ACQUIRE);
+    if (got_ipi) {
+        /* Release the lock we just got: */
+        clh_lock_release(cpu);
+        /* Go back to user space: */
+        restore_user_context();
+        UNREACHABLE();
+    }
 }
 
 static inline void FORCE_INLINE clh_lock_release(word_t cpu)

Christopher Guikema · Answer 21 · Wed Dec 13 2023 00:12:48 GMT+0800 (China Standard Time)

Alright, I applied that patch, and I'm getting the same behavior. Each VM is stuck in the syscall instruction.

Christopher Guikema · Answer 22 · Wed Dec 13 2023 03:55:27 GMT+0800 (China Standard Time)

What if in the handleVmexit function, if the state is Running, we just keep the notification state to active? Then when the VM attempts to enter its VM state, it will immediately exit and handle the interrupt? Then any subsequent notifications will be added to the badge instead of dropped

Indan Zupancic · Answer 23 · Wed Dec 13 2023 16:57:39 GMT+0800 (China Standard Time)

Worth a try, to see if it works at least.

I really want to fix all similar bugs by exiting the kernel after an IPI, but I'll keep working on it in #871.

Christopher Guikema · Answer 24 · Sat Dec 16 2023 01:54:40 GMT+0800 (China Standard Time)

Okay, so that kind of ended up working. My final solution was to add a case where we handle both a fault and a notification, and then keep the notification active during this edge case.

Unfortunately, that doesn't fix the time drift issue, but it does allow us to not break from the VMEnter every time a notification is received in the VM state. Once your PR is merged, I'll integrate those commits and see if it fixes the problem without the extra state.

Thanks for your help! I'll leave this issue open for other to review / comment, since my fix may not be the best approach here.