Segfault in many newer Ubuntu binaries
invliD opened this issue · comments
Description
Starting with Ubuntu impish (21.10) many common binaries more or less immediately segfault. The most prominent would be bash
, since that's the default command for the ubuntu images, but others like apt-get
crash as well. dash (sh
) is fine in all versions I tested. Ubuntu hirsute (21.04) and before seem to work for all binaries.
Steps to reproduce
Start a runsc container with the ubuntu:impish
or ubuntu:jammy
images with the default command. It will not start. Override the command to sh
. Attach to the container and run bash
. It will segfault.
runsc version
runsc version release-20220228.0
spec: 1.0.2-dev
docker version (if using docker)
I am using containerd:
containerd github.com/containerd/containerd 1.5.5-0ubuntu3~20.04.2
uname
Linux <redacted> 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.14", GitCommit:"57a3aa3f13699cf3db9c52d228c18db94fa81876", GitTreeState:"clean", BuildDate:"2021-12-15T14:47:10Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
repo state (if built from source)
No response
runsc debug logs (if available)
runsc.log.20220329-214112.988202.create.log
runsc.log.20220329-214113.064374.start.log
runsc.log.20220329-214113.068508.gofer.log
runsc.log.20220329-214113.252275.state.log
runsc.log.20220329-214113.256368.wait.log
runsc.log.20220329-214113.288286.kill.log
runsc.log.20220329-214113.324176.kill.log
runsc.log.20220329-214113.372256.delete.log
hmm, I can run both ubuntu:impish
and ubuntu:jammy
without problems. Do you have the runsc.log.*.boot
? This would be the most interesting one to find out why it's segfaulting. The rest of the logs look alright. I'm using containerd v1.5.4
, but that is likely not the problem.
Oh, I didn't realize the logs from kubernetes' pause container are relevant as well. Here's the boot log from that container (from a new run):
runsc.log.20220330-181033.174692.boot.log
I have some more information. Apparently, running these images works fine on all Intel CPUs I have (newest generation being Skylake). These segfaults only seem to happen on my AMD CPUs (all AMD Ryzen 9 5950X).
hmm...interesting. Does the SIGSEGV also happen when you use the ptrace
platform? Are you running KVM with bare-metal or nested virtualization?
The faulting instruction is c5 7d e7 b7 40 30 00 vmovntdq YMMWORD PTR [rdi+0x3040],ymm14
, is surrounded by other similar vmovntdq
instructions, and is in libc, so I assume this is memcpy
(or similar).
That fault address, 0x55e64c532000
, matches rdi + 0x3040
, which is a good sign. Interestingly, this address is exactly past the end of the heap VMA. The prior instructions were writing lower addresses, so this looks like an attempt to write something without quite enough space on the heap.
Are you running KVM with bare-metal or nested virtualization?
@fvoznika As an aside, it might be nice to try to best-effort guess this and log it for future reference. e.g., check if the hypervisor bit is set in CPUID.
Does the SIGSEGV also happen when you use the
ptrace
platform?
It does not, it only appears on the kvm platform.
Are you running KVM with bare-metal or nested virtualization?
kubernetes, containerd, and thus gVisor are all running bare-metal.
0x00007feb7a5c7554 <__memmove_avx_unaligned_erms+1396>: vmovntdq %ymm0,(%rdi)
0x00007feb7a5c7558 <__memmove_avx_unaligned_erms+1400>: vmovntdq %ymm1,0x20(%rdi)
0x00007feb7a5c755d <__memmove_avx_unaligned_erms+1405>: vmovntdq %ymm2,0x40(%rdi)
0x00007feb7a5c7562 <__memmove_avx_unaligned_erms+1410>: vmovntdq %ymm3,0x60(%rdi)
0x00007feb7a5c7567 <__memmove_avx_unaligned_erms+1415>: vmovntdq %ymm4,0x1000(%rdi)
0x00007feb7a5c756f <__memmove_avx_unaligned_erms+1423>: vmovntdq %ymm5,0x1020(%rdi)
0x00007feb7a5c7577 <__memmove_avx_unaligned_erms+1431>: vmovntdq %ymm6,0x1040(%rdi)
0x00007feb7a5c757f <__memmove_avx_unaligned_erms+1439>: vmovntdq %ymm7,0x1060(%rdi)
0x00007feb7a5c7587 <__memmove_avx_unaligned_erms+1447>: vmovntdq %ymm8,0x2000(%rdi)
0x00007feb7a5c758f <__memmove_avx_unaligned_erms+1455>: vmovntdq %ymm9,0x2020(%rdi)
0x00007feb7a5c7597 <__memmove_avx_unaligned_erms+1463>: vmovntdq %ymm10,0x2040(%rdi)
0x00007feb7a5c759f <__memmove_avx_unaligned_erms+1471>: vmovntdq %ymm11,0x2060(%rdi)
0x00007feb7a5c75a7 <__memmove_avx_unaligned_erms+1479>: vmovntdq %ymm12,0x3000(%rdi)
0x00007feb7a5c75af <__memmove_avx_unaligned_erms+1487>: vmovntdq %ymm13,0x3020(%rdi)
0x00007feb7a5c75b7 <__memmove_avx_unaligned_erms+1495>: vmovntdq %ymm14,0x3040(%rdi)
0x00007feb7a5c75bf <__memmove_avx_unaligned_erms+1503>: vmovntdq %ymm15,0x3060(%rdi)
0x00007feb7a5c75c7 <__memmove_avx_unaligned_erms+1511>: sub $0xffffffffffffff80,%rdi
Here is a small reproducer:
# cat /tmp/sysinfo2.c
#include <sys/sysinfo.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/wait.h>
#include <signal.h>
#include <sys/ptrace.h>
#include <sys/user.h>
extern long __x86_shared_cache_size_half;
int main()
{
struct sysinfo *info = calloc(sizeof(*info), 1);
void *addr = mmap(0, 4096, PROT_WRITE| PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
fprintf(stderr, "__x86_shared_cache_size_half = %lx\n", __x86_shared_cache_size_half);
//__x86_shared_cache_size_half = 0x100000;
memmove(addr /*+ 4096 - sizeof(*info)*/, info, 128);
_exit(0);
return 0;
}
# /tmp/runsc --platform kvm --network none --strace --debug-log runsc.log --debug do /tmp/sysinfo2; echo $?
__x86_shared_cache_size_half = 0
139
I0407 14:18:58.995744 1 strace.go:640] [ 1: 1] sysinfo2 X mmap(0x0, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0x0 host:[1], 0x0) = 139816759091200 (0x7f29a0426000) (154.773µs)
I0407 14:18:58.996567 1 strace.go:593] [ 1: 1] sysinfo2 E write(0x2 host:[3], 0x7f327d7c7140 "__x86_shared_cache_size_half = 0\n", 0x21)
I0407 14:18:58.996789 1 strace.go:631] [ 1: 1] sysinfo2 X write(0x2 host:[3], ..., 0x21) = 33 (0x21) (128.943µs)
D0407 14:18:58.997326 1 task_run.go:295] [ 1: 1] Unhandled user fault: addr=7f29a0425ff0 ip=431088 access=-w- sig=11 err=bad address
D0407 14:18:58.997611 1 task_log.go:87] [ 1: 1] Registers:
D0407 14:18:58.997747 1 task_log.go:94] [ 1: 1] Cs = 0000000000000033
D0407 14:18:58.997827 1 task_log.go:94] [ 1: 1] Ds = 0000000000000000
D0407 14:18:58.997906 1 task_log.go:94] [ 1: 1] Eflags = 0000000000011282
D0407 14:18:58.998027 1 task_log.go:94] [ 1: 1] Es = 0000000000000000
D0407 14:18:58.998103 1 task_log.go:94] [ 1: 1] Fs = 0000000000000000
D0407 14:18:58.998178 1 task_log.go:94] [ 1: 1] Fs_base = 00000000004cb3c0
D0407 14:18:58.998254 1 task_log.go:94] [ 1: 1] Gs = 0000000000000000
D0407 14:18:58.998328 1 task_log.go:94] [ 1: 1] Gs_base = 0000000000000000
D0407 14:18:58.998402 1 task_log.go:94] [ 1: 1] Orig_rax = 00007f29a0426000
D0407 14:18:58.998478 1 task_log.go:94] [ 1: 1] R10 = 0000000000000000
D0407 14:18:58.998551 1 task_log.go:94] [ 1: 1] R11 = 0000000000001246
D0407 14:18:58.998624 1 task_log.go:94] [ 1: 1] R12 = 0000000000000001
D0407 14:18:58.998697 1 task_log.go:94] [ 1: 1] R13 = 00007f327d7c9448
D0407 14:18:58.998772 1 task_log.go:94] [ 1: 1] R14 = 00000000004bf8d0
D0407 14:18:58.998888 1 task_log.go:94] [ 1: 1] R15 = 0000000000000001
D0407 14:18:58.998995 1 task_log.go:94] [ 1: 1] R8 = 00007f29a0426070
D0407 14:18:58.999077 1 task_log.go:94] [ 1: 1] R9 = 00007f299ff59940
D0407 14:18:58.999177 1 task_log.go:94] [ 1: 1] Rax = 00007f29a0426000
D0407 14:18:58.999253 1 task_log.go:94] [ 1: 1] Rbp = 00007f327d7c9270
D0407 14:18:58.999326 1 task_log.go:94] [ 1: 1] Rbx = 00007f327d7c9458
D0407 14:18:58.999400 1 task_log.go:94] [ 1: 1] Rcx = 0000000000000000
D0407 14:18:58.999473 1 task_log.go:94] [ 1: 1] Rdi = 00007f29a0426070
D0407 14:18:58.999547 1 task_log.go:94] [ 1: 1] Rdx = ffffffffffffff70
D0407 14:18:58.999621 1 task_log.go:94] [ 1: 1] Rip = 0000000000431088
D0407 14:18:58.999696 1 task_log.go:94] [ 1: 1] Rsi = 00000000004cc6b0
D0407 14:18:58.999769 1 task_log.go:94] [ 1: 1] Rsp = 00007f327d7c9258
D0407 14:18:58.999874 1 task_log.go:94] [ 1: 1] Ss = 000000000000002b
431021: 49 29 f1 sub %rsi,%r9
431024: 49 39 d1 cmp %rdx,%r9
431027: 73 09 jae 431032 <__memmove_ssse3+0x2a02>
431029: 49 39 c9 cmp %rcx,%r9
43102c: 0f 82 be 00 00 00 jb 4310f0 <__memmove_ssse3+0x2ac0>
431032: f3 0f 6f 46 f0 movdqu -0x10(%rsi),%xmm0
431037: f3 0f 6f 4e e0 movdqu -0x20(%rsi),%xmm1
43103c: f3 0f 6f 56 d0 movdqu -0x30(%rsi),%xmm2
431041: f3 0f 6f 5e c0 movdqu -0x40(%rsi),%xmm3
431046: f3 0f 6f 66 b0 movdqu -0x50(%rsi),%xmm4
43104b: f3 0f 6f 6e a0 movdqu -0x60(%rsi),%xmm5
431050: f3 0f 6f 76 90 movdqu -0x70(%rsi),%xmm6
431055: f3 0f 6f 7e 80 movdqu -0x80(%rsi),%xmm7
43105a: 48 8d 76 80 lea -0x80(%rsi),%rsi
43105e: 48 81 ea 80 00 00 00 sub $0x80,%rdx
431065: 66 0f e7 47 f0 movntdq %xmm0,-0x10(%rdi)
43106a: 66 0f e7 4f e0 movntdq %xmm1,-0x20(%rdi)
43106f: 66 0f e7 57 d0 movntdq %xmm2,-0x30(%rdi)
431074: 66 0f e7 5f c0 movntdq %xmm3,-0x40(%rdi)
431079: 66 0f e7 67 b0 movntdq %xmm4,-0x50(%rdi)
43107e: 66 0f e7 6f a0 movntdq %xmm5,-0x60(%rdi)
431083: 66 0f e7 77 90 movntdq %xmm6,-0x70(%rdi)
431088: 66 0f e7 7f 80 movntdq %xmm7,-0x80(%rdi)
43108d: 48 8d 7f 80 lea -0x80(%rdi),%rdi
431091: 73 9f jae 431032 <__memmove_ssse3+0x2a02>
431093: 48 83 fa c0 cmp $0xffffffffffffffc0,%rdx
If we uncomment "__x86_shared_cache_size_half = 0x100000", the program exits with 0.
The problem is that we don't handle CPUID(0x80000006): L2 cache information (Intel) properly.