Tensorflow Serving dumps core (SIGILL) with the KVM driver

Question

Tensorflow Serving dumps core (SIGILL) with the KVM driver

dimpavloff opened this issue 6 years ago · comments

Dimitar Pavlov commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Bug report

Please provide the following details:

Environment:

Minikube version: v0.24.1
OS: Ubuntu 16.04.3 LTS (Xenial Xerus)
VM Driver: kvm2
ISO version: v0.23.6

What happened:
Followed the instructions in https://github.com/google/kubeflow/blob/master/user_guide.md , one of which creates a k8s Deployment with Tensorflow Serving. The container is crash looping with only the following output:

Illegal instruction (core dumped)

What you expected to happen:
The same steps work fine within Minikube when ran with the virtualbox driver so it would be nice for this to also work with kvm2 without having to rebuild the Tensorflow Serving binary.

How to reproduce it (as minimally and precisely as possible):
Within a Minikube with the kvm2 driver, run:

docker run -it --rm gcr.io/kubeflow/model-server@sha256:f9f61f821fac2a84ad00a4ab834b54cdf14902c9c2d72cb4aab0c57db42f0540 bash -c "/usr/bin/tensorflow_model_server --port=9000 --model_name=inception --model_base
_path=gs://cloud-ml-dev_jlewi/tmp/inception || sleep 1"

(The flags are technically irrelevant since the binary is crashing)

Anything else do we need to know:
I enabled core dumps and ran the binary with gdb, here's the output of the core file:

Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/tensorflow_model_server...(no debugging symbols found)...done.
(gdb) b _start
Breakpoint 1 at 0x4c23e0
(gdb) run
Starting program: /usr/bin/tensorflow_model_server 
warning: Error disabling address space randomization: Operation not permitted
During startup program terminated with signal SIGILL, Illegal instruction.
(gdb) core-file /tmp/dumps/core-tensorflow_mode-4-0-0-465-1514910625 
warning: core file may not match specified executable file.
[New LWP 465]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/tensorflow_model_server'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x000000000224b282 in tensorflow::Env::RegisterFileSystem(std::string const&, std::function<tensorflow::FileSystem* ()>) ()
(gdb) bt
#0  0x000000000224b282 in tensorflow::Env::RegisterFileSystem(std::string const&, std::function<tensorflow::FileSystem* ()>) ()
#1  0x00000000004cdb00 in tensorflow::register_file_system::Register<tensorflow::RetryingGcsFileSystem>::Register(tensorflow::Env*, std::string const&) ()
#2  0x0000000000419900 in _GLOBAL__sub_I__ZN10tensorflow13GcsFileSystemC2Ev ()
#3  0x000000000245f0ad in __libc_csu_init ()
#4  0x00007fbe6c1677bf in __libc_start_main (main=0x4179d0 <main>, argc=1, argv=0x7fff233b4b48, init=0x245f060 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff233b4b38)
    at ../csu/libc-start.c:247
#5  0x00000000004c2409 in _start ()

I tried setting a breakpoint at the entrypoint and stepping through the code but it seems the crash happens before reaching the breakpoint

mozilla/DeepSpeech#912 looks like a similar issue which makes me think it's likely to do with the CPU virtualisation. The fact that the same steps work with the virtualbox driver (and on my host Linux machine) probably supports this. So here's the abridged contents of the /proc/cpuinfo I've on my host machine and on minikube:

host:

processor       : 7                                                                                                                                                                                          
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping        : 3
microcode       : 0x39
cpu MHz         : 799.929
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs            :
bogomips        : 6816.60
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

minikube:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 6
model name      : QEMU Virtual CPU version 2.5+
stepping        : 3
microcode       : 0x1
cpu MHz         : 3407.998
cache size      : 4096 KB
physical id     : 7
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl eagerfpu pni vmx cx16 x2apic hypervisor lahf_lm tpr_shadow vnmi flexpriority
 ept vpid
bugs            :
bogomips        : 6815.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Thanks!

fejta-bot · Answer 1 · Tue Apr 03 2018 01:43:51 GMT+0800 (China Standard Time)

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · Answer 2 · Thu May 03 2018 02:00:04 GMT+0800 (China Standard Time)

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

Michelle Casbon · Answer 3 · Thu May 31 2018 23:49:56 GMT+0800 (China Standard Time)

👍 to seeing a fix for this issue

fejta-bot · Answer 4 · Sun Jul 01 2018 00:05:42 GMT+0800 (China Standard Time)

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close