Cannot reproduce the snapshot for HEVD fuzzer

Question

Cannot reproduce the snapshot for HEVD fuzzer

mkubx opened this issue 2 years ago · comments

Hello. I followed the instructions here and installed W10 inside Hyper-V VM, 4GB RAM, 1 CPU. I also disabled KVA and used lockmem.exe when doing the memory dump (!bdump).

Still, when I run the fuzzer (I modified only ExGenRandom to reflect my Windows version), I am getting the following errors:

Master

Seeded with 367867526817988208
Iterating through the corpus..
Sorting through the 1 entries..
Running server on tcp://localhost:31337..
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: -nan (1 nodes) lastcov: 2.0s crash: 0 timeout: 0 cr3: 0 uptime: 2.0s
Could not receive size (-1)
Receive failed
#0 cov: 0 (+0) corp: 0 (0.0b) exec/s: -nan (0 nodes) lastcov: 3.0s crash: 0 timeout: 0 cr3: 0 uptime: 3.0s

Fuzzer node

Initializing the debugger instance.. (this takes a bit of time)
Setting debug register status to zero.
Setting debug register status to zero.
Dialing to tcp://localhost:31337/..
VirtTranslate failed for GVA:0xfffff80711b78600

Here is the dump of my image.

I also did some sanity checks if the code was not swapped out, mentioned here #41, so for instance KERNELBASE!DeviceIoControl disassembles correctly. How can I debug this issue? I tried to create images 10-15 times with more or less the same result. Also, no problem to run the fuzzer with the snapshot you provided. Thanks.

Axel Souchet · Answer 1 · Wed Sep 28 2022 23:57:18 GMT+0800 (China Standard Time)

Hey @mkubx,

thanks for the detailed report! So I haven't had a chance to look at things in details, but here are a few ideas if you'd like to try to debug this yourself (I will take a look at it this week or this week-end):

When encountering those kind of error messages, the best way is to open the dump in windbg and checking things there. In this case, we can clearly see there's no memory:

kd> db 0xfffff80711b78600
fffff807`11b78600  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78610  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78620  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78630  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78640  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78650  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78660  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b78670  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

kd> !address 0xfffff80711b78600
Mapping user range ...
Mapping system range ...
Mapping non addressable range ...
Mapping page tables...
Mapping hyperspace...
Mapping HAL reserved range...
Mapping User Probe Area...
Mapping system shared page...
Mapping system cache working set...
Mapping loader mappings...
Mapping system PTEs...
Mapping system paged pool...
Mapping session space...
Mapping dynamic system space...
Mapping PFN database...
Mapping non paged pool...
Mapping VAD regions...
Mapping module regions...
Mapping process, thread, and stack regions...
Mapping system cache regions...


Usage:                  Module
Base Address:           fffff807`11af0000
End Address:            fffff807`11b7c000
Region Size:            00000000`0008c000
VA Type:                BootLoaded
Module name:            HEVD.sys
Module path:            [\??\C:\Users\user\Downloads\hevd-unpacked\driver\vulnerable\x64\HEVD.sys]

If you would like to understand how the code is attempting to access 0xfffff80711b78600, you should generate a rip trace (--backend bxcpu --trace-type rip and then symbolize the trace w/ symbolizer. This way you can figure out how you end up where you end up very precisely.

Hope this helps for now!

Cheers

mkubx · Answer 2 · Thu Sep 29 2022 18:59:57 GMT+0800 (China Standard Time)

Hey @0vercl0k, thank you for your time and support. I still have no idea what's happening and I am curious to figure it out, here is what I got so far.

0xfffff80711b78600 is not mapped even on the live machine.
I tracked the values with the same IOCTL in my debugger as used during fuzzing and it exited successfully without error.
But there was a difference in compare with my trace logs, after syscall happened:

ntdll!NtDeviceIoControlFile+0x3:
0033:00007ff8`273ad0d3 b807000000      mov     eax,7
kd> 
ntdll!NtDeviceIoControlFile+0x8:
0033:00007ff8`273ad0d8 f604250803fe7f01 test    byte ptr [SharedUserData+0x308 (00000000`7ffe0308)],1
kd> 
ntdll!NtDeviceIoControlFile+0x10:
0033:00007ff8`273ad0e0 7503            jne     ntdll!NtDeviceIoControlFile+0x15 (00007ff8`273ad0e5)
kd> 
ntdll!NtDeviceIoControlFile+0x12:
0033:00007ff8`273ad0e2 0f05            syscall
kd> 
ntdll!NtDeviceIoControlFile+0x14:
0033:00007ff8`273ad0e4 c3              ret
kd> 
KERNELBASE!DeviceIoControl+0x6b:
0033:00007ff8`24eeb0bb 0f1f440000      nop     dword ptr [rax+rax]
kd> r rax
rax=0000000000000000
kd> t
KERNELBASE!DeviceIoControl+0x70:
0033:00007ff8`24eeb0c0 3d03010000      cmp     eax,103h
kd> 
KERNELBASE!DeviceIoControl+0x75:
0033:00007ff8`24eeb0c5 7520            jne     KERNELBASE!DeviceIoControl+0x97 (00007ff8`24eeb0e7)
kd> 
KERNELBASE!DeviceIoControl+0x97:
0033:00007ff8`24eeb0e7 85c0            test    eax,eax
kd> 
KERNELBASE!DeviceIoControl+0x99:
0033:00007ff8`24eeb0e9 0f8808010000    js      KERNELBASE!DeviceIoControl+0x1a7 (00007ff8`24eeb1f7)
kd> 
KERNELBASE!DeviceIoControl+0x9f:
0033:00007ff8`24eeb0ef 488b8c24a0000000 mov     rcx,qword ptr [rsp+0A0h]
kd> 
KERNELBASE!DeviceIoControl+0xa7:
0033:00007ff8`24eeb0f7 4885c9          test    rcx,rcx
kd> 
KERNELBASE!DeviceIoControl+0xaa:
0033:00007ff8`24eeb0fa 7406            je      KERNELBASE!DeviceIoControl+0xb2 (00007ff8`24eeb102)
kd> 
KERNELBASE!DeviceIoControl+0xac:
0033:00007ff8`24eeb0fc 8b442458        mov     eax,dword ptr [rsp+58h]
kd> 
KERNELBASE!DeviceIoControl+0xb0:
0033:00007ff8`24eeb100 8901            mov     dword ptr [rcx],eax
kd> 
KERNELBASE!DeviceIoControl+0xb2:
0033:00007ff8`24eeb102 b801000000      mov     eax,1
kd> 
KERNELBASE!DeviceIoControl+0xb7:
0033:00007ff8`24eeb107 488b5c2470      mov     rbx,qword ptr [rsp+70h]
kd> 
KERNELBASE!DeviceIoControl+0xbc:
0033:00007ff8`24eeb10c 4883c460        add     rsp,60h
kd> 
KERNELBASE!DeviceIoControl+0xc0:
0033:00007ff8`24eeb110 5f              pop     rdi
kd> 
KERNELBASE!DeviceIoControl+0xc1:
0033:00007ff8`24eeb111 c3              ret
kd> 
KERNEL32!DeviceIoControlImplementation+0x81:
0033:00007ff8`267f5611 0f1f440000      nop     dword ptr [rax+rax]
kd> 
KERNEL32!DeviceIoControlImplementation+0x86:
0033:00007ff8`267f5616 488b5c2450      mov     rbx,qword ptr [rsp+50h]

In my symbolized trace, I can see:

ntdll!NtDeviceIoControlFile+0x10
ntdll!NtDeviceIoControlFile+0x12
nt!KiSystemCall64+0x0
nt!KiSystemCall64+0x3
nt!KiSystemCall64+0xc
nt!KiSystemCall64+0x15
nt!KiSystemCall64+0x17
...
nt!IopSynchronousServiceTail+0x12d
nt!IopSynchronousServiceTail+0x137
nt!IopSynchronousServiceTail+0x13d
nt!IopSynchronousServiceTail+0x140
nt!IopSynchronousServiceTail+0x146
GetNameByOffset failed with hr=80004005
ioctl.bin.trace:1090: Symbolization of 0 failed ('0x'), skipping

So looks like the fuzzer makes an invalid deref inside syscall. I used a new state in this output, but the error looks identical.

trace file
symbolized trace file

Axel Souchet · Answer 3 · Fri Sep 30 2022 00:04:34 GMT+0800 (China Standard Time)

Of course, no worries at all, thanks for giving the tool a shot :)

Hehe what you described seems like an interesting investigation - it doesn't seem to look like anything I've seen which is exciting. I'm happy to take it from there to try to figure out what happens.

Will try to have a look at this this weekend :)

Cheers

mkubx · Answer 4 · Fri Sep 30 2022 02:05:59 GMT+0800 (China Standard Time)

Thank you! Meanwhile I'll play with the different targets to see if there is any difference. I am not sure if I am not missing something stupid, since it was working for everyone else here. My setup is pretty ordinary, I just created a brand new VM in Hyper-V, then followed the instructions. Also, if you want me to upload the VM or something else, I am happy to assist you.

mkubx · Answer 5 · Fri Sep 30 2022 22:30:56 GMT+0800 (China Standard Time)

I made it working, still not sure what went wrong. I used a different windows image (with Microsoft Windows [Version 10.0.19044.1288], instead of the previous Microsoft Windows [Version 10.0.19044.2006]) and enabled generation 2 when I created a new VM. Other than this, everything was the same.

Then I tried to take a dump twice, it had around 3.5GB (instead of the usual 1.5GB-1.8GB). I am happy that your fuzzer finally works (it's amazing and thanks for making it public), we can close the issue if you want. Thanks again for your help, I really appreciate it!

Axel Souchet · Answer 6 · Fri Sep 30 2022 23:59:34 GMT+0800 (China Standard Time)

Awesome, I'm glad to hear. I'll keep this issue open until I try to figure out what happened if that's okay - there might be something interesting to learn or maybe an opportunity to fix a bug / improve the tool :) Don't hesitate to reach out if you encounter any other issues though! Cheers

…

On Fri, Sep 30, 2022 at 7:31 AM mkubx ***@***.***> wrote: I made it working, still not sure what went wrong. I used a different windows image (with Microsoft Windows [Version 10.0.19044.1288], instead of the previous Microsoft Windows [Version 10.0.19044.2006]) and enabled generation 2 <https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/should-i-create-a-generation-1-or-2-virtual-machine-in-hyper-v> when I created a new VM. Other than this, everything was the same. Then I tried to take a dump twice, it had around 3.5GB (instead of the usual 1.5GB-1.8GB). I am happy that your fuzzer finally works (it's amazing and thanks for making it public), we can close the issue if you want. Thanks again for your help, I really appreciate it! — Reply to this email directly, view it on GitHub <#134 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALIORPC7SXZJBVJVTVWGRTWA32SVANCNFSM6AAAAAAQXVDVSM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Axel Souchet · Answer 7 · Sat Oct 01 2022 09:25:23 GMT+0800 (China Standard Time)

All right, I had some time to check out this issue after work. The TL'DR is that it seems like HEVD's PAGE section is paged out on your dump.

Here is the story of the longer version; maybe it's also useful for others that want to be able to debug issues they encounter w/ wtf. Put on your seat belt, let's go!

First, I downloaded the provided mem.dmp and regs.json, I was able to reproduce the issue by launching:

> cd 33
> echo A > foo.bin
> wtf.exe run --name hevd --state=. --input foo.bin --backend=bxcpu`
...
Translation of GVA 0xfffff80711b789c0 failed

Okay, cool. One thing that is important to understand is that message is actually printed by wtf itself, not bxcpu or any other component. wtf has logic to be able to interact with virtual and physical memory (VirtRead* / VirtWrite* / PhysRead* / PhysWrite*).
For this to work wtf also needs to know how to parse page tables and do virtual to physical translation. The error above tells us that wtf wasn't able to translate 0xfffff80711b789c0 to a physical address.
Why? We'll let's find out. The first thing to check in those cases is to load up the mem.dmp in Windbg and try to read at this address to see if this is potentially a bug in wtf or not:

> windbgx -z mem.dmp
...
kd> db 0xfffff80711b789c0
fffff807`11b789c0  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
fffff807`11b789d0  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

Okay so at this point, Windbg agrees w/ wtf which is good for wtf. The second logical step is to grab a RIP trace of the testcase to be able to follow what happens precisely, so let's do that:

> wtf.exe run --name hevd --state=. --input foo.bin --backend=bxcpu --trace-type=rip

The RIP trace is basically RIP addresses written into a file without any symbolization which is not super useful. To make it more useful, you need to symbolize it; to do that, I created symbolizer that does just that.

> symbolizer.exe --input foo.bin.trace --crash-dump mem.dmp > foo.bin.trace.symbolized.txt

Now, you can open foo.bin.trace.symbolized.txt in a text editor and follow the flow along. It's always good to skim through the file first to kinda get an idea of where the flows goes. In that case, if you skip to the end you can see something weird:

nt!IopSynchronousServiceTail+0xf6
nt!IopSynchronousServiceTail+0x100
GetNameByOffset failed with hr=80004005
foo.bin.trace:1090: Symbolization of 0 failed ('0x'), skipping
[1 / 1] foo.bin.trace done
Completed symbolization of 1k addresses (1 failed) in 0s across 1 files.

As you can see, it seems an error was hit during symbolization of an address which is not expected. Because the symbolized file and the raw file are basically one address per line, it's easy to go to the same line number to see what's going on; here's what I found:

0xfffff8070be75843
0xfffff8070be75846
0xfffff8070be75850
0x

Okay makes sense. What probably happens is because wtf hits an unrecoverable issue when doing a virt->phys translation it crashes itself which ends the process. The buffer of the trace file is not flushed and so some data is missing.

Cool, so let's just update BochscpuBackend_t::BeforeExecutionHook's code to add a fflush call:

#ifdef WINDOWS
__declspec(safebuffers)
#endif
    void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *Ins) {
  // ...

      fmt::print(TraceFile_, "{:#x}\n", Rip);
      fflush(TraceFile_);

Recompile, re-generate a trace, re-symbolize it and bingo we get a trace that doesn't stop abruptly now. Here is the end of the symbolized trace now:

HEVD+0x855c3
HEVD+0x855c5
HEVD+0x855cb
HEVD+0x855cd
HEVD+0x855d3
HEVD+0x855d5
HEVD+0x855d7
HEVD+0x855d9
HEVD+0x855db
HEVD+0x855e0
HEVD+0x855e7
HEVD+0x855ea
nt!DbgPrintEx+0x0

Okay so the hevd fuzzing module sets a breakpoint on DbgPrintEx that dumps the strings and basically nops those call:

if (!g_Backend->SetBreakpoint("nt!DbgPrintEx", [](Backend_t *Backend) {
    const Gva_t FormatPtr = Backend->GetArgGva(2);
    const std::string &Format = Backend->VirtReadString(FormatPtr);
    DebugPrint("DbgPrintEx: {}", Format);
    Backend->SimulateReturnFromFunction(0);
  })) {
    DebugPrint("Failed to SetBreakpoint DbgPrintEx\n");
    return false;
}

What happens here is that the Backend->VirtReadString() wants to read the format string passed to DbgPrintEx but is failing to do, because it appears that the received virtual address doesn't exist in the page table.
Based on the code above, the argument that is a problem is the 3rd argument or the r8 register; let's check what the caller code looks like in Windbg:

kd> ub fffff80711b755f0
HEVD+0x855d3:
fffff807`11b755db ba03000000      mov     edx,3
fffff807`11b755e0 4c8d05d9330000  lea     r8,[HEVD+0x889c0 (fffff807`11b789c0)]
fffff807`11b755e7 8d4a4a          lea     ecx,[rdx+4Ah]
fffff807`11b755ea ff1518caf7ff    call    qword ptr [HEVD+0x2008 (fffff807`11af2008)]

kd> dqs fffff807`11af2008
fffff807`11af2008  fffff807`0bb7d1d0 nt!DbgPrintEx

Okay so we clearly see the call that will invoke DbgPrintEx, and we can see how the r8 register with 0xfffff80711b789c0 which is the address that doesn't get translated.
Uh, interesting. Let's dump the PE header of HEVD and check where this address is supposed to be pointing at:

kd> !address fffff807`11b78970
...
VA Type:                BootLoaded
Module name:            HEVD.sys
Module path:            [\??\C:\Users\user\Downloads\hevd-unpacked\driver\vulnerable\x64\HEVD.sys]

kd> ? fffff807`11b78970 - hevd
Evaluate expression: 559472 = 00000000`00088970

kd> !dh -a hevd
...
SECTION HEADER #5
    PAGE name
    4BB5 virtual size
   85000 virtual address
    4C00 size of raw data
    1A00 file pointer to raw data
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
60000020 flags
         Code
         (no align specified)
         Execute Read

Okay so the pointer is actually pointing into the PAGE section which is a section that can be paged out on disk as the name suggests. This is probably what happens here, the string pointer is paged out and the content is NOT in the dump so it crashes. On a real system, accessing this address would trigger a page fault handler that would page back in that memory from disk and it would carry on, but wtf doesn't have any disk so that flow is not possible.

Anyways, we can also dig a bit deeper. If I looked at my own build of HEVD I read how the IOCTL handler behaves when it receives the 0xdeadbeef IOCTL code, and it ends-up executing the below code:

PAGE:00000001400855D7                 mov     edx, 3
PAGE:00000001400855DC                 lea     r8, aInvalidIoctlCo ; "[-] Invalid IOCTL Code: 0x%X\n"
PAGE:00000001400855E3                 lea     ecx, [rdx+4Ah]
PAGE:00000001400855E6                 call    cs:__imp_DbgPrintEx
PAGE:00000001400855EC                 mov     esi, 0C0000010h

That makes sense, the driver basically tells the user that they sent a busted code. It also sets an error code that we can also see in the code from the dump:

kd> u HEVD+0x855ea
HEVD+0x855ea:
fffff807`11b755ea ff1518caf7ff    call    qword ptr [HEVD+0x2008 (fffff807`11af2008)]
fffff807`11b755f0 be100000c0      mov     esi,0C0000010h

Cool, everything checks out!

That class of issue is pretty annoying. It happens a LOT in user-mode and that's why I made lockmem. For kernel-mode I actually don't even know how to mimic that same behavior (Actually I've done some experiments on the fbl_kern branch but it hasn't pan out yet). Interestingly I haven't run into that issue on kernel yet - I wonder how much memory you had configured your VM with? I usually set mine at 4GB.

In theory, you can also disable the pagefile (which is the disk file where data gets written to) but I haven't played much with this myself.

Anyways, thanks again for the actionable report!

Cheers

mkubx · Answer 8 · Mon Oct 03 2022 22:32:50 GMT+0800 (China Standard Time)

Thank you very much for a detailed writeup! :) I was using 4GB and since on a new VM the issue reoccurred, I ended up by disabling the pagefile. This helped.

Is there the same issue with KVM? How can I take the snapshot on Linux? I know it's probably not supported yet, but what are the options?

Axel Souchet · Answer 9 · Mon Oct 03 2022 23:44:18 GMT+0800 (China Standard Time)

Interesting! Something that might be confusing is that every backend is meant to run Windows code; KVM included. Dumping / running / fuzzing Linux code is not supported.

If you're really interested in that, there's been some work in #102 but it is experimental. I have personally not tried it out.

Cheers

Axel Souchet · Answer 10 · Sat Oct 22 2022 23:45:08 GMT+0800 (China Standard Time)

I'll close this as the original issue was fixed!

Cheers