Bareflank / hypervisor

lightweight hypervisor SDK written in C++ with support for Windows, Linux and UEFI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

uefi shell frozen during start_vmm()

nicholasdkunes opened this issue · comments

UEFI shell freezes up during start_vmm()
Seems to call SFS->OpenVolume(), and read_file() on the kernel/extension0 file correctly.
I believe it fails somewhere in start_vmm.

Vendor is Intel, loaded plenty of UEFI HVs before on my PC, so it definitely isn't a configuration issue. I'll try to debug this more, but hopefully you're aware of the issue.

HYPERVISOR_BUILD_LOADER enabled
HYPERVISOR_BUILD_VMMCTL enabled
HYPERVISOR_BUILD_MICROKERNEL enabled
HYPERVISOR_BUILD_EXAMPLES enabled
HYPERVISOR_BUILD_EFI enabled
HYPERVISOR_TARGET_ARCH GenuineIntel
HYPERVISOR_CXX_LINKER ld.lld
HYPERVISOR_EFI_LINKER lld-link
HYPERVISOR_EFI_FS0 X:/
HYPERVISOR_PAGE_SIZE BSL_PAGE_SIZE
HYPERVISOR_PAGE_SHIFT 12
HYPERVISOR_SERIAL_PORT 0x03F8
HYPERVISOR_MAX_ELF_FILE_SIZE 0x800000
HYPERVISOR_MAX_SEGMENTS 3
HYPERVISOR_MAX_EXTENSIONS 1
HYPERVISOR_MAX_VMS 16
HYPERVISOR_MAX_PPS 128
HYPERVISOR_MAX_VPS 256
HYPERVISOR_MAX_VPS_PER_VM HYPERVISOR_MAX_PPS
HYPERVISOR_MAX_VPSS_PER_VP 2
HYPERVISOR_MAX_VPSS HYPERVISOR_MAX_VPS * HYPERVISOR_MAX_VPSS_PER_VP
HYPERVISOR_DEBUG_RING_SIZE 0x7FF0
HYPERVISOR_DIRECT_MAP_ADDR 0x0000400000000000
HYPERVISOR_MK_STACK_ADDR 0x0000008000000000
HYPERVISOR_MK_STACK_SIZE 0x8000
HYPERVISOR_MK_CODE_ADDR 0x0000028000000000
HYPERVISOR_MK_CODE_SIZE HYPERVISOR_MAX_ELF_FILE_SIZE
HYPERVISOR_MK_MAP_ADDR 0x0000038000000000
HYPERVISOR_MK_MAP_SIZE 0x1000000000
HYPERVISOR_EXT_STACK_ADDR 0x0000308000000000
HYPERVISOR_EXT_STACK_SIZE 0x8000
HYPERVISOR_EXT_CODE_ADDR 0x0000328000000000
HYPERVISOR_EXT_CODE_SIZE HYPERVISOR_MAX_ELF_FILE_SIZE
HYPERVISOR_EXT_TLS_ADDR 0x0000338000000000
HYPERVISOR_EXT_TLS_SIZE 0x2000
HYPERVISOR_EXT_PAGE_POOL_ADDR 0x0000358000000000
HYPERVISOR_EXT_PAGE_POOL_SIZE 0x1000000000
HYPERVISOR_EXT_HEAP_POOL_ADDR 0x0000368000000000
HYPERVISOR_EXT_HEAP_POOL_SIZE 0x1000000000
HYPERVISOR_HUGE_POOL_SIZE 0x10000
HYPERVISOR_PAGE_POOL_SIZE 0x2000000

Ok, verified the issue is in start_vmm() and not load_images_and_start, it does find bareflank_kernel/bareflank_extension0 just fine. Will update if I find the source.

I've logged start_vmm step by step, this is my output it gets this far at least.
20210417_023320

Ok, this will be my last new comment but I'll keep updating it with findings.

Edit1:

Issue now must be in start_vmm_per_cpu or lower, will start logging output there to see where the failure is.

Edit2:

ok, here is the source dialed down:

   mk_stack_offs = (HYPERVISOR_MK_STACK_SIZE + HYPERVISOR_PAGE_SIZE) * cpu;
    mk_stack_virt = (g_mk_stack_virt + mk_stack_offs);

    if (alloc_mk_stack(0U, &g_mk_stack[cpu])) {
        bferror("alloc_mk_stack failed");
        goto alloc_mk_stack_failed;
    }
    else {
        bferror("alloc_mk_stack success");
    }

    ret = alloc_and_copy_mk_state(
        g_mk_root_page_table, &g_mk_elf_file, &g_mk_stack[cpu], mk_stack_virt, &g_mk_state[cpu]);

    if (ret) {
        bferror("alloc_and_copy_mk_state failed");
        goto alloc_and_copy_mk_state_failed;
    }
    else {
        bferror("alloc_and_copy_mk_state success");
    }

where this image shows output, showing that it never gets past alloc_and_copy_mk_state() on cpu 0

20210417_024618

In alloc_and_copy_mk_state(), there is a call to enable_hve (im on intel arch, so it redirects to intel version). It never gets past enable_hve().

Here is how far it gets in enable_hve():

int64_t
enable_hve(struct state_save_t *const state)
{
    uint64_t cr4;
    uint64_t phys;
    uint64_t revision_id;

    cr4 = intrinsic_scr4();
    if ((cr4 & CR4_VMXE) != 0) {
        bferror("VT-x is already running. Is another hypervisor running?");
        return LOADER_FAILURE;
    }
    else {
        bferror("VT-x not running - success");
    }

    phys = platform_virt_to_phys(state->hve_page);
    if (((uint64_t)0) == phys) {
        bferror("platform_virt_to_phys failed");
        return LOADER_FAILURE;
    }
    else {
        bferror("platform_virt_to_phys success");
    }

    revision_id = intrinsic_rdmsr(MSR_IA32_VMX_BASIC) & VMX_BASIC_REVISION_ID;
    ((uint32_t *)state->hve_page)[0] = ((uint32_t)revision_id);

    intrinsic_lcr4(cr4 | CR4_VMXE);

    if (intrinsic_vmxon(&phys)) {
        bferror("intrinsic_vmxon failed");
        goto intrinsic_vmxon_failure;
    }
    else {
        bferror("intrinsic_vmxon success");
    }

where the last output I see is "platform_virt_to_phys success". therefore it must fail somewhere between that and the fail/success for intrinsic_vmxon()

ok, officially sourced the problem :) took a while but here you go, hopefully this helps us find a solution.

in enable_hve() intel variant

    revision_id = intrinsic_rdmsr(MSR_IA32_VMX_BASIC) & VMX_BASIC_REVISION_ID;
    bferror("got here-1");

    ((uint32_t *)state->hve_page)[0] = ((uint32_t)revision_id);
    bferror("got here-2");


    intrinsic_lcr4(cr4 | CR4_VMXE);
    bferror("got here-3");

    if (intrinsic_vmxon(&phys)) {
        bferror("intrinsic_vmxon failed");
        goto intrinsic_vmxon_failure;
    }
    else {
        bferror("intrinsic_vmxon success");
    }

last output seen is got here-3, therefore I assume it is failing in the intrinsic vmxon.

@nicholasdkunes Wow... great job tracking this down. What are your system specs? Also, did you compile this from Windows or Linux?

@rianquinn Compiled with ninja on Windows, didn't have to change anything to get compilation to work. The instructions worked out of the box.

What type of specs do you need? I would assume only processor matters at the UEFI level. I'll update this comment with some detailed specs but feel free to source more from me if I missed something.

Host Name:                 DESKTOP-NMI9P2R
OS Name:                   Microsoft Windows 10 Pro
OS Version:                10.0.19042 N/A Build 19042
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Workstation
OS Build Type:             Multiprocessor Free
Registered Owner:          [redacted]
Registered Organization:   N/A
Product ID:               [redacted]
Original Install Date:     2/8/2021, 10:01:51 PM
System Boot Time:          4/17/2021, 8:40:57 AM
System Manufacturer:       Micro-Star International Co., Ltd.
System Model:              MS-7C75
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 165 Stepping 5 GenuineIntel ~3792 Mhz
BIOS Version:              American Megatrends Inc. 2.10, 5/21/2020
Windows Directory:         C:\WINDOWS
System Directory:          C:\WINDOWS\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC-08:00) Pacific Time (US & Canada)
Total Physical Memory:     32,690 MB
Available Physical Memory: 29,024 MB
Virtual Memory: Max Size:  37,554 MB
Virtual Memory: Available: 32,517 MB
Virtual Memory: In Use:    5,037 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              [redacted]
Hotfix(s):                 12 Hotfix(s) Installed.
                           [01]: KB4601554
                           [02]: KB4562830
                           [03]: KB4570334
                           [04]: KB4577266
                           [05]: KB4577586
                           [06]: KB4580325
                           [07]: KB4586864
                           [08]: KB4589212
                           [09]: KB4593175
                           [10]: KB4598481
                           [11]: KB5001330
                           [12]: KB5001405
Network Card(s):           1 NIC(s) Installed.
                           [01]: Realtek PCIe 2.5GbE Family Controller
                                 Connection Name: Ethernet
                                 DHCP Enabled:    Yes
                                 DHCP Server:     10.0.2.1
                                 IP address(es)
                                 [01]: 10.0.2.238
                                 [02]: fe80::157f:bd21:cfc3:2bf6
Hyper-V Requirements:      VM Monitor Mode Extensions: Yes
                           Virtualization Enabled In Firmware: Yes
                           Second Level Address Translation: Yes
                           Data Execution Prevention Available: Yes

Also, processor info:


C:\WINDOWS\system32>wmic cpu get caption, deviceid, name, numberofcores, maxclockspeed, status
Intel64 Family 6 Model 165 Stepping 5  
CPU0      
3792                    
Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz  8 

I'm still looking into this, and I'm not fully educated in intel VT-x yet as I'm still learning so please take this lightly.

Excerpt from another VT-x HV, I see in Bareflank you guys do set CR4.VMXE = 1, but I don't see anyplace where CR0 is checked/set. I'm probably wrong here, but why not bring it up.

    // In order to enter the VMX-mode, the bits in CR0 and CR4 have to be
    // certain values as indicated by the FIXED0 and FIXED1 MSRs. The rule is
    // summarized as below:
    //
    //        IA32_VMX_CRx_FIXED0 IA32_VMX_CRx_FIXED1 Meaning
    // Bit X  1                   (Always 1)          The bit X of CRx is fixed to 1
    // Bit X  0                   1                   The bit X of CRx is flexible
    // Bit X  (Always 0)          0                   The bit X of CRx is fixed to 0
    //
    // See: A.7 VMX-FIXED BITS IN CR0
    //
    // "Ensure the current processor operating mode meets the required CR0 fixed
    //  bits (...). Other required CR0 fixed bits can be detected through the
    //  IA32_VMX_CR0_FIXED0 and IA32_VMX_CR0_FIXED1 MSRs."
    // See: 31.5 VMM SETUP & TEAR DOWN
    //

@nicholasdkunes it is definitely possible that we are not setting CR0 or CR4 correctly in the UEFI loader, and those fixed registers are NOT being checked right now. These checks should be in https://github.com/Bareflank/hypervisor/blob/master/loader/src/x64/intel/check_for_hve_support.c

The UEFI loader should be reading these MSRs and setting CR0 and CR4 correctly before this check is run, so there are two issues here. That code is brand new, so it's not a huge surprise that something is missing here. I can add this code, or you are more than welcome to add it yourself and provide a PR. It is up to you.

If you have time to add it, go for it. I'd appreciate it.

OK, update, I added intrin for load cr0, and set the flags per VT-x standard. I am now able to call vmxon successfully, yet the HV fails elsewhere now. I shall start debugging the next issue and try to fix it.

Do you want me to open a new ticket about general UEFI issues and resolve them as I go? Or shall we just keep this ticket open.

This is the solution:

   #define IA32_VMX_CR0_FIXED0 0x486
   #define IA32_VMX_CR0_FIXED1 0x487
    uint64_t cr0;
    bferror("attempting intrin lcr0");
    cr0 = intrinsic_scr0();
    cr0 &= intrinsic_rdmsr(IA32_VMX_CR0_FIXED1);
    cr0 |= intrinsic_rdmsr(IA32_VMX_CR0_FIXED0);
    intrinsic_lcr0(cr0);

I do this immediately after Bareflank sets VMXE.

New failure is at

    if (demote(g_mk_args[cpu], g_mk_state[cpu], g_root_vp_state[cpu])) {
        bferror("demote failed");
        goto demote_failed;
    }

Can you send me the before and after for CR0 and CR4 for that code. That will help me determine which CR0 and CR4 bits the CPU things has to be enabled, which will also tell me which feature might be causing the issue.

20210417_194528~2

Here is a more accurate photo, ignore the previous. I was printing pre CR4 twice. These values are correct.

Pre CR0 is wrong for some reason, this is the true value, I had a bug in my output: 0x80010013

Everything else is correct, triple checked...

Take note that it does not page fault on demote, and that demote just fails. The shell still functions after the failure.

@nicholasdkunes Ok, so the only two bits that are flipped are Numeric Error and VMXE. I am not sure why NE is not turned on by UEFI, but either way, that is a simple fix, and like I said, we are clearly missing some code in both the generic portion of the loader WRT checks, and the loader itself.

If demote is failing, it means that the microkernel found another issue, that is likely unrelated to this. The best way to determine what the issue is would be to hook up serial to another machine. Since your machine is not halting, but instead is returning with the error, it is possible that we might be able to dump the ring buffer on failure, so I will look into that too.

Your 10700 is a CPU we have to support on our end as well, so this is pretty important. Can you tell me what motherboard you are using? I might try to replicate your system for further testing on my end.

@nicholasdkunes Turns out the NE bit was a regression. We had this same issue in version 2 of Bareflank and had code in there to manually enable the NE bit.

@nicholasdkunes Ok, in my personal branch (which has a lot of changes that will be pushed soon), I added the CR0 and CR4 bits (both the UEFI part and the loader's general sanity checks) and verified that they are working. From there, on my Dell XPS I was able to repro the issue. There where a number of Dell UEFI specific issues that I had to sort out, and I also added the dump function to the loader for UEFI so that if an error occurs, you will get the microkernel's output, which is helpful.

Basically, on Intel, we are getting a VM entry failure, which means that the VMCS guest state is not correct. This shouldn't be too hard to track down as the same code is working great from Linux and Windows, so there is some UEFI specific state that is not correct, above and beyond the NE bit that was not set.

Once I sort this out I will report back.

Hi @rianquinn, thanks for the help on this. It's helping me understand VT-x a bit more as I'm new to this.

Can you explain what it means for the VMCS guest state to be incorrect?

@nicholasdkunes That's awesome. The whole point of this project is to provide people with the ability to create their own hypervisors, as well as teach people how to write hypervisors. We even have several GSoC projects that we have been awarded designed to do exactly this, so that makes me really happy to hear that you are learning as you go with this.

Read up on Chapter 26 in the Intel Manual. It has a series of checks that all need to pass before an entry will occur. I think the issue is with the TR guest register. I noticed an issue with this when trying to implement "stop" for UEFI, which I never did as we actually need to turn the EFI stuff into a device driver if we want to do that, and it's just not on my priority list. But when I was doing that research, I noticed that (even on AMD), TR is not set at all. This leads to the unusable bit being set, which Chapter 26 states is not valid. This would also be a big difference between UEFI and Windows/Linux as Windows/Linux definitely has this set up.

What is puzzling to me is that we should be seeing this issue with the old stuff (i.e. v2.x). @connojd Might know more if that is actually something we are seeing or not (maybe we fixed it downstream). It could also be possible that we are simply not touching something in UEFI with the new code (as it is very basic, even compared to the old stuff) that is causing UEFI to leave TR unused. You really only need to set TR if you plan to use interrupts.

Either way, I am pretty sure that TR is the issue. I will investigate more on what we can do here to fix this. There are many options (we could just set TR to whatever we want honestly as it would not be used), but there might be a way to get UEFI to set TR for us.

@nicholasdkunes and @connojd SimpleVisor sets up the TSS manually here:
https://github.com/ionescu007/SimpleVisor/blob/master/uefi/shvos.c#L158

So basically, they had the same issue that we are having. So... again, why does the old stuff not have this same issue?

Either way, I will add this same logic. It will likely take a little bit of time as this is a fair amount of GDT crap to write, but it should at least address the TR issue, and hopefully get UEFI booting on Intel. It is interesting how this is also not an issue on AMD, but again, I am really puzzled as to why this is the first time I have seen this, as it should have been an issue on the old stuff. I would love to see what the TR register is when we boot from UEFI using the old stuff.

@rianquinn Wow thanks for the explanations. Super helpful. I joined the slack the other day and you pointed to someone the awesome virtualization repo, and I have been heavily studying that for the past two days. It's what helped me solve the CR0 issue! I'm proud of that fix, even though it may seem small to you big guys but it was my first helpful solution and a good start to understanding entering the VM state on Intel.

In regards to why old Bareflank doesn't show support for Intel, it's more than likely because old Bareflank required "Extended APIs" to support EFI, which you already know, but it may be useful to checkout this old revision of EAPIs before deprecation...

https://github.com/Bareflank/extended_apis/tree/7df24afb39c3dac466ca50a59a8efa1995ebb387

OK, so, per your request I've looked into 26.

26.2.3 Checks on Host Segment and Descriptor-Table Registers:

"The selector fields for ... TR cannot be 0000H."

So in this case, the Host TR is not Zero. The Guest TR is however. To truly understand virtualization, you should be able to explain the difference. What is the difference between the host and the guest?

How do you differentiate between the host TR and the guest TR before virtualization occurs? I figured running the intrinsic STR for Intel would provide the Host TR, which is 0x0 from my tests before VMXON.

From my limited understanding, the TR (task register) segment selector points to the task state segment (TSS) in the global descriptor table (GDT).

The TSS stores the processor register state amongst other things. This is very important for context switching like if an interrupt fires, we can store the state of the processor and execute the ISR, then jump back to our previous execution area and re-load the context we were in prior to the interrupt firing.

In virtualization, the host (from what I understand) should have its own GDT, its own TSS, and its own TR. While VT-x allows the guest OS to access the GDT and TSS, via its own TR. Since we are creating the context for the CPU, we must also create a TSS entry in the GDT, and point the guest's TR segment selector to the TSS entry? Hopefully I'm close to understanding that concept...

Regardless, to answer your question, the host machine interacts with the hardware in the case of our UEFI hypervisor. While the guest is technically (at least on Bareflank) on the same machine as the host, the guest only has access to the virtualized system, provided by the host via virtualized memory & virtualized CPU.

@nicholasdkunes The host is the "hypervisor". In our case, this is our microkernel. So it is not UEFI, nor is it Linux, Windows or anything else. Everything that is in the "host" state of the VMCS is our microkernel. This is not the case for other hypervisors that use Windows/Linux to implement the hypervisor like KVM. The host has to have it's own state, so the job of the loader is to create this state and then transition into it so that the microkernel can run. Before it does that, it saves off the current state so that it can put it into the Guest State.

The guest is the state associated with whatever you are going to launch into. In this case, it is UEFI. Once the hypervisor is running, UEFI will be in a VM, and therefore it is the guest, which is why we save off the state before loading the microkernel. Once the hypervisor is running, we will launch back into UEFI and continue from there, but from a Guest point of view.

Give this a try:
https://github.com/rianquinn/hypervisor

It solves all of the UEFI on the couple of test systems that I have. The main changes were:

  • Add CR0/CR4 checks to the loader
  • Set up CR0 and CR4 properly from UEFI
  • Create a new GDT with a custom TSS in it for each core and load TR as needed before transitioning to the microkernel.

Please note that my master branch has a LOT of code changes above and beyond that as I work to get a pretty large PR together that I will hopefully be able to push by the end of the week. This also includes the nested paging example.

The host is the "hypervisor". In our case, this is our microkernel. So it is not UEFI, nor is it Linux, Windows or anything else. Everything that is in the "host" state of the VMCS is our microkernel. This is not the case for other hypervisors that use Windows/Linux to implement the hypervisor like KVM. The host has to have it's own state, so the job of the loader is to create this state and then transition into it so that the microkernel can run. Before it does that, it saves off the current state so that it can put it into the Guest State.

So let me get this straight, Bareflank is a single machine that virtualizes itself for the purpose of virtualizing itself. The Hypervisor is the microkernel, which is loaded by the UEFI module. The UEFI module creates a the host state, enters into that state for MK execution.

If UEFI is virtualized in this case, if I booted an OS via UEFI, would the underlying OS be virtualized too? I'd imagine so, as it's booted by the virtualized UEFI.

Thanks for the info, and thanks for the commit. I've been trying to implement it myself, in my working copy I was able to implement CR0/CR4 checks & setting them up properly. I created a new GDT with a custom TSS in it, but was failing to load TR, kept getting a fault. I'll see what you did differently and learn from it. Thanks a lot.

Is there a way that I can keep asking questions about Bareflank and how it works and how HVs work in general to continue this conversation? If so let me know, I know you're busy and not a 24/7 support guy, but if you're willing to converse it would be helpful.

There are a bunch of BSL errors on compiling your project to test immediately, will look into seeing what's wrong in a bit.

I changed the cmake to pull down your BSL-master instead of Bareflank BSL, and that resolved the BSL errors.

@rianquinn It started, weird con out but it started. Thanks for all the help. Time to spend the rest of my night understanding what the difference is so I don't run into this again ever!

20210419_190334

Edit: Ah god, I didn't check to see if it halted after start. It did. I'll look into it...

@nicholasdkunes Absolutely, you can ask whatever questions you want. I suggest getting onto our slack channel instead of using the issue tracker as there are a lot of people on slack that can help in addition to myself. It's an awesome community of experts from all sorts of projects.

The weird output are the color codes. If you are going to use UEFI, I suggest turing off colors. Use "make info" to see all of your configuration options (ninja info if you are on Windows). Color is one of them.

When you say that it halted after start, what do you mean? Without a proper extension to implement the proper emulation, you cannot really do much with the default extension outside of simple things like "ls" from UEFI after the hypervisor has started. Trying to start Windows/Linux for example will halt the system for sure. I am slowly working on an example extension that will implement the needed support for UEFI to load an OS like Windows/Linux, but I suspect it will take a couple of weeks. It is not simple as you need to emulate CR0/CR4 access including the TLB operations the occur, and you also need to properly handle INIT/SIPI for multicore.

Finally, I would get yourself serial support. Trying to work on a hypervisor without proper debugging is a total nightmare. Ideally, your motherboard should have a serial device that you can connect to with another machine. If it doesn't, you can use a PCI serial device, but you would need to sort out how to get the serial device to start. The other option is you could port XUE, which is in our MicroV/mono branch right now, which would get you USB debugging. Personally... if you are really going to dive into this and learn... make sure you have a motherboard with a native serial port (i.e., no PCI device needed). It is a million times easier.

In this case, your halted system likely has a nice output containing debugging information about what caused the halt... you just need to get access to it, and serial is really the easiest way.

There is still a lot to learn regarding the UEFI side of this, so I'm happy to wait the weeks till Bareflank has OS support (if-ever). I do have native serial and have been meaning to try it. I'll get it setup and start debugging properly. Thanks for the tips.

You can close this issue, I believe it is solved, as bareflank successfully starts in UEFI now :)