Segfault in many newer Ubuntu binaries

Question

Segfault in many newer Ubuntu binaries

invliD opened this issue 2 years ago · comments

Sebastian Brückner commented 2 years ago

Description

Starting with Ubuntu impish (21.10) many common binaries more or less immediately segfault. The most prominent would be bash, since that's the default command for the ubuntu images, but others like apt-get crash as well. dash (sh) is fine in all versions I tested. Ubuntu hirsute (21.04) and before seem to work for all binaries.

Steps to reproduce

Start a runsc container with the ubuntu:impish or ubuntu:jammy images with the default command. It will not start. Override the command to sh. Attach to the container and run bash. It will segfault.

runsc version

runsc version release-20220228.0
spec: 1.0.2-dev

docker version (if using docker)

I am using containerd:

containerd github.com/containerd/containerd 1.5.5-0ubuntu3~20.04.2

uname

Linux <redacted> 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.14", GitCommit:"57a3aa3f13699cf3db9c52d228c18db94fa81876", GitTreeState:"clean", BuildDate:"2021-12-15T14:47:10Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

repo state (if built from source)

No response

runsc debug logs (if available)

runsc.log.20220329-214112.988202.create.log
runsc.log.20220329-214113.064374.start.log
runsc.log.20220329-214113.068508.gofer.log
runsc.log.20220329-214113.252275.state.log
runsc.log.20220329-214113.256368.wait.log
runsc.log.20220329-214113.288286.kill.log
runsc.log.20220329-214113.324176.kill.log
runsc.log.20220329-214113.372256.delete.log

Fabricio Voznika · Answer 1 · Thu Mar 31 2022 02:02:19 GMT+0800 (China Standard Time)

hmm, I can run both ubuntu:impish and ubuntu:jammy without problems. Do you have the runsc.log.*.boot? This would be the most interesting one to find out why it's segfaulting. The rest of the logs look alright. I'm using containerd v1.5.4, but that is likely not the problem.

Sebastian Brückner · Answer 2 · Thu Mar 31 2022 09:18:15 GMT+0800 (China Standard Time)

Oh, I didn't realize the logs from kubernetes' pause container are relevant as well. Here's the boot log from that container (from a new run):
runsc.log.20220330-181033.174692.boot.log

Sebastian Brückner · Answer 3 · Mon Apr 04 2022 06:23:30 GMT+0800 (China Standard Time)

I have some more information. Apparently, running these images works fine on all Intel CPUs I have (newest generation being Skylake). These segfaults only seem to happen on my AMD CPUs (all AMD Ryzen 9 5950X).

Fabricio Voznika · Answer 4 · Tue Apr 05 2022 23:41:16 GMT+0800 (China Standard Time)

hmm...interesting. Does the SIGSEGV also happen when you use the ptrace platform? Are you running KVM with bare-metal or nested virtualization?

Michael Pratt · Answer 5 · Tue Apr 05 2022 23:54:26 GMT+0800 (China Standard Time)

The faulting instruction is c5 7d e7 b7 40 30 00 vmovntdq YMMWORD PTR [rdi+0x3040],ymm14, is surrounded by other similar vmovntdq instructions, and is in libc, so I assume this is memcpy (or similar).

That fault address, 0x55e64c532000, matches rdi + 0x3040, which is a good sign. Interestingly, this address is exactly past the end of the heap VMA. The prior instructions were writing lower addresses, so this looks like an attempt to write something without quite enough space on the heap.

Michael Pratt · Answer 6 · Tue Apr 05 2022 23:56:44 GMT+0800 (China Standard Time)

Are you running KVM with bare-metal or nested virtualization?

@fvoznika As an aside, it might be nice to try to best-effort guess this and log it for future reference. e.g., check if the hypervisor bit is set in CPUID.

Sebastian Brückner · Answer 7 · Wed Apr 06 2022 09:11:46 GMT+0800 (China Standard Time)

Does the SIGSEGV also happen when you use the ptrace platform?

It does not, it only appears on the kvm platform.

Are you running KVM with bare-metal or nested virtualization?

kubernetes, containerd, and thus gVisor are all running bare-metal.

Andrei Vagin · Answer 8 · Thu Apr 07 2022 06:52:32 GMT+0800 (China Standard Time)

   0x00007feb7a5c7554 <__memmove_avx_unaligned_erms+1396>:	vmovntdq %ymm0,(%rdi)
   0x00007feb7a5c7558 <__memmove_avx_unaligned_erms+1400>:	vmovntdq %ymm1,0x20(%rdi)
   0x00007feb7a5c755d <__memmove_avx_unaligned_erms+1405>:	vmovntdq %ymm2,0x40(%rdi)
   0x00007feb7a5c7562 <__memmove_avx_unaligned_erms+1410>:	vmovntdq %ymm3,0x60(%rdi)
   0x00007feb7a5c7567 <__memmove_avx_unaligned_erms+1415>:	vmovntdq %ymm4,0x1000(%rdi)
   0x00007feb7a5c756f <__memmove_avx_unaligned_erms+1423>:	vmovntdq %ymm5,0x1020(%rdi)
   0x00007feb7a5c7577 <__memmove_avx_unaligned_erms+1431>:	vmovntdq %ymm6,0x1040(%rdi)
   0x00007feb7a5c757f <__memmove_avx_unaligned_erms+1439>:	vmovntdq %ymm7,0x1060(%rdi)
   0x00007feb7a5c7587 <__memmove_avx_unaligned_erms+1447>:	vmovntdq %ymm8,0x2000(%rdi)
   0x00007feb7a5c758f <__memmove_avx_unaligned_erms+1455>:	vmovntdq %ymm9,0x2020(%rdi)
   0x00007feb7a5c7597 <__memmove_avx_unaligned_erms+1463>:	vmovntdq %ymm10,0x2040(%rdi)
   0x00007feb7a5c759f <__memmove_avx_unaligned_erms+1471>:	vmovntdq %ymm11,0x2060(%rdi)
   0x00007feb7a5c75a7 <__memmove_avx_unaligned_erms+1479>:	vmovntdq %ymm12,0x3000(%rdi)
   0x00007feb7a5c75af <__memmove_avx_unaligned_erms+1487>:	vmovntdq %ymm13,0x3020(%rdi)
   0x00007feb7a5c75b7 <__memmove_avx_unaligned_erms+1495>:	vmovntdq %ymm14,0x3040(%rdi)
   0x00007feb7a5c75bf <__memmove_avx_unaligned_erms+1503>:	vmovntdq %ymm15,0x3060(%rdi)
   0x00007feb7a5c75c7 <__memmove_avx_unaligned_erms+1511>:	sub    $0xffffffffffffff80,%rdi

Andrei Vagin · Answer 9 · Fri Apr 08 2022 05:27:54 GMT+0800 (China Standard Time)

Here is a small reproducer:

# cat /tmp/sysinfo2.c
#include <sys/sysinfo.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>

#include <sys/types.h>
#include <unistd.h>
#include <sys/wait.h>
#include <signal.h>
#include <sys/ptrace.h>
#include <sys/user.h>

extern long __x86_shared_cache_size_half;

int main()
{
	struct sysinfo *info = calloc(sizeof(*info), 1);
	void *addr = mmap(0, 4096, PROT_WRITE| PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
	fprintf(stderr, "__x86_shared_cache_size_half = %lx\n", __x86_shared_cache_size_half);
	//__x86_shared_cache_size_half =	0x100000;
	memmove(addr /*+ 4096 - sizeof(*info)*/, info, 128);
	_exit(0);
	return 0;
}

# /tmp/runsc --platform kvm --network none --strace --debug-log runsc.log --debug do /tmp/sysinfo2; echo $?
__x86_shared_cache_size_half = 0
139

I0407 14:18:58.995744       1 strace.go:640] [   1:   1] sysinfo2 X mmap(0x0, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0x0 host:[1], 0x0) = 139816759091200 (0x7f29a0426000) (154.773µs)
I0407 14:18:58.996567       1 strace.go:593] [   1:   1] sysinfo2 E write(0x2 host:[3], 0x7f327d7c7140 "__x86_shared_cache_size_half = 0\n", 0x21)
I0407 14:18:58.996789       1 strace.go:631] [   1:   1] sysinfo2 X write(0x2 host:[3], ..., 0x21) = 33 (0x21) (128.943µs)
D0407 14:18:58.997326       1 task_run.go:295] [   1:   1] Unhandled user fault: addr=7f29a0425ff0 ip=431088 access=-w- sig=11 err=bad address
D0407 14:18:58.997611       1 task_log.go:87] [   1:   1] Registers:
D0407 14:18:58.997747       1 task_log.go:94] [   1:   1] Cs       = 0000000000000033
D0407 14:18:58.997827       1 task_log.go:94] [   1:   1] Ds       = 0000000000000000
D0407 14:18:58.997906       1 task_log.go:94] [   1:   1] Eflags   = 0000000000011282
D0407 14:18:58.998027       1 task_log.go:94] [   1:   1] Es       = 0000000000000000
D0407 14:18:58.998103       1 task_log.go:94] [   1:   1] Fs       = 0000000000000000
D0407 14:18:58.998178       1 task_log.go:94] [   1:   1] Fs_base  = 00000000004cb3c0
D0407 14:18:58.998254       1 task_log.go:94] [   1:   1] Gs       = 0000000000000000
D0407 14:18:58.998328       1 task_log.go:94] [   1:   1] Gs_base  = 0000000000000000
D0407 14:18:58.998402       1 task_log.go:94] [   1:   1] Orig_rax = 00007f29a0426000
D0407 14:18:58.998478       1 task_log.go:94] [   1:   1] R10      = 0000000000000000
D0407 14:18:58.998551       1 task_log.go:94] [   1:   1] R11      = 0000000000001246
D0407 14:18:58.998624       1 task_log.go:94] [   1:   1] R12      = 0000000000000001
D0407 14:18:58.998697       1 task_log.go:94] [   1:   1] R13      = 00007f327d7c9448
D0407 14:18:58.998772       1 task_log.go:94] [   1:   1] R14      = 00000000004bf8d0
D0407 14:18:58.998888       1 task_log.go:94] [   1:   1] R15      = 0000000000000001
D0407 14:18:58.998995       1 task_log.go:94] [   1:   1] R8       = 00007f29a0426070
D0407 14:18:58.999077       1 task_log.go:94] [   1:   1] R9       = 00007f299ff59940
D0407 14:18:58.999177       1 task_log.go:94] [   1:   1] Rax      = 00007f29a0426000
D0407 14:18:58.999253       1 task_log.go:94] [   1:   1] Rbp      = 00007f327d7c9270
D0407 14:18:58.999326       1 task_log.go:94] [   1:   1] Rbx      = 00007f327d7c9458
D0407 14:18:58.999400       1 task_log.go:94] [   1:   1] Rcx      = 0000000000000000
D0407 14:18:58.999473       1 task_log.go:94] [   1:   1] Rdi      = 00007f29a0426070
D0407 14:18:58.999547       1 task_log.go:94] [   1:   1] Rdx      = ffffffffffffff70
D0407 14:18:58.999621       1 task_log.go:94] [   1:   1] Rip      = 0000000000431088
D0407 14:18:58.999696       1 task_log.go:94] [   1:   1] Rsi      = 00000000004cc6b0
D0407 14:18:58.999769       1 task_log.go:94] [   1:   1] Rsp      = 00007f327d7c9258
D0407 14:18:58.999874       1 task_log.go:94] [   1:   1] Ss       = 000000000000002b

  431021:       49 29 f1                sub    %rsi,%r9
  431024:       49 39 d1                cmp    %rdx,%r9
  431027:       73 09                   jae    431032 <__memmove_ssse3+0x2a02>
  431029:       49 39 c9                cmp    %rcx,%r9
  43102c:       0f 82 be 00 00 00       jb     4310f0 <__memmove_ssse3+0x2ac0>
  431032:       f3 0f 6f 46 f0          movdqu -0x10(%rsi),%xmm0
  431037:       f3 0f 6f 4e e0          movdqu -0x20(%rsi),%xmm1
  43103c:       f3 0f 6f 56 d0          movdqu -0x30(%rsi),%xmm2
  431041:       f3 0f 6f 5e c0          movdqu -0x40(%rsi),%xmm3
  431046:       f3 0f 6f 66 b0          movdqu -0x50(%rsi),%xmm4
  43104b:       f3 0f 6f 6e a0          movdqu -0x60(%rsi),%xmm5
  431050:       f3 0f 6f 76 90          movdqu -0x70(%rsi),%xmm6
  431055:       f3 0f 6f 7e 80          movdqu -0x80(%rsi),%xmm7
  43105a:       48 8d 76 80             lea    -0x80(%rsi),%rsi
  43105e:       48 81 ea 80 00 00 00    sub    $0x80,%rdx
  431065:       66 0f e7 47 f0          movntdq %xmm0,-0x10(%rdi)
  43106a:       66 0f e7 4f e0          movntdq %xmm1,-0x20(%rdi)
  43106f:       66 0f e7 57 d0          movntdq %xmm2,-0x30(%rdi)
  431074:       66 0f e7 5f c0          movntdq %xmm3,-0x40(%rdi)
  431079:       66 0f e7 67 b0          movntdq %xmm4,-0x50(%rdi)
  43107e:       66 0f e7 6f a0          movntdq %xmm5,-0x60(%rdi)
  431083:       66 0f e7 77 90          movntdq %xmm6,-0x70(%rdi)
  431088:       66 0f e7 7f 80          movntdq %xmm7,-0x80(%rdi)
  43108d:       48 8d 7f 80             lea    -0x80(%rdi),%rdi
  431091:       73 9f                   jae    431032 <__memmove_ssse3+0x2a02>
  431093:       48 83 fa c0             cmp    $0xffffffffffffffc0,%rdx

If we uncomment "__x86_shared_cache_size_half = 0x100000", the program exits with 0.

Andrei Vagin · Answer 10 · Fri Apr 08 2022 06:58:35 GMT+0800 (China Standard Time)

The problem is that we don't handle CPUID(0x80000006): L2 cache information (Intel) properly.