Node terminated unexpectedly
florianb opened this issue · comments
Environment
- Elixir & Erlang/OTP versions (elixir --version): Erlang/OTP 26 [erts-14.1.1] [source] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns] [x86_64-pc-linux-gnu]
- Operating system: Linux d77d09d47a64 6.1.0-20-amd64 # 1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
- How have you started Livebook (mix phx.server, livebook CLI, Docker, etc): Docker livebook-dev/livebook:latest
- Livebook version (use
git rev-parse HEAD
if running with mix): 0.12.1 - Browsers that reproduce this bug (the more the merrier): Firefox (it's a browser independent issue)
- Include what is logged in the browser console: no errors here
- Include what is logged to the server console:
Mai 06 12:07:37 livebook docker[23169]: [Livebook] Application running at http://localhost:8080/
Mai 06 12:07:58 livebook docker[23169]:
Mai 06 12:07:58 livebook docker[23169]: 10:07:58.064 [debug] Downloading NIF from https://github.com/elixir-nx/explorer/releases/download/v0.8.2/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz
Mai 06 12:08:01 livebook docker[23169]:
Mai 06 12:08:01 livebook docker[23169]: 10:08:01.168 [debug] NIF cached at /home/livebook/.cache/rustler_precompiled/precompiled_nifs/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz and extracted to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so
Current behavior
When running a setup task (here i'm running the "Data transform with Explorer" setup) the server disconnects after the following output:
==> explorer
Compiling 25 files (.ex)
11:17:24.711 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so
I am running the docker container using a dedicated user livebook (1001)
with a systemd unit:
[Unit]
Description=Livebook Server
After=docker.service
Requires=docker.service
[Service]
User=livebook
Group=livebook
Restart=always
Environment=LIVEBOOK_DEBUG=true
ExecStartPre=-/usr/bin/docker stop livebook
ExecStartPre=-/usr/bin/docker rm livebook
ExecStartPre=/usr/bin/docker pull ghcr.io/livebook-dev/livebook
ExecStart=/usr/bin/docker run --name livebook -u 1001:1001 -e "LIVEBOOK_PASSWORD=" -p 8080:8080 -p 8081:8081 -v /home/livebook:/data ghcr.io/livebook-dev/livebook
ExecStop=/usr/bin/docker stop livebook
[Install]
WantedBy=multi-user.target
Creating a folder (123) from the web ui works as intended:
I have no name!@f0aa748746e0:/$ ls -la data
total 16
drwxr-xr-x 4 1001 1001 4096 May 6 11:25 .
drwxr-xr-x 1 root root 4096 May 6 11:22 ..
drwxr-xr-x 2 1001 1001 4096 May 6 11:25 123
The mix folder seems to be created with the correct permissions:
I have no name!@f0aa748746e0:/$ ls -la /home/livebook/
total 28
drwxrwxrwx 1 root root 4096 May 6 11:26 .
drwxr-xr-x 1 root root 4096 Nov 9 11:15 ..
drwxr-xr-x 4 1001 1001 4096 May 6 11:26 .cache
drwxr-xr-x 3 1001 1001 4096 May 6 11:26 .hex
drwxr-xr-x 3 1001 1001 4096 May 6 11:22 .local
drwxrwxrwx 1 root root 4096 Nov 9 11:15 .mix
Also the cached NIF seems to be laid out as expected:
I have no name!@f0aa748746e0:/$ ls -la /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/
total 64844
drwxr-xr-x 2 1001 1001 4096 May 6 11:26 .
drwxr-xr-x 3 1001 1001 4096 May 6 11:26 ..
-rwxr-xr-x 1 1001 1001 66391768 Apr 22 12:40 libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so
Any hint to find out why this doesn't work is very appreciated. I wonder why it crashes without further notice.
- Let know how i can help - thanks a lot!
What happens if you build a similar docker image (matching Elixir/Erlang versions) with iex
instead and then run Mix.install [:explorer]
? Because my gut feeling says it is not a Livebook issue per se.
Thanks for the quick response - i don't think it's Livebook issue, too - i wondered if there's a way to find out why it crashes.
I will try to create a livebook-instance in a different base container and come back as soon as i have new details.
Is it the whole Docker container that crashes, or only the notebook runtime?
Only the notebook runtime - the container keeps running and i can immediately invoke a reconnect from the gui. Unfortunately the logs seem to stay silent and interestingly there are empty log lines.
Either a segmentation fault or the OS somehow killing it because it things it is running out of memory?
Here is the log of two subsequent reconnects & setup:
Mai 06 16:07:47 livebook docker[24339]: 14:07:47.322 [debug] HANDLE EVENT "queue_cell_evaluation" in LivebookWeb.SessionLive
Mai 06 16:07:47 livebook docker[24339]: Parameters: %{"cell_id" => "setup", "disable_dependencies_cache" => false}
Mai 06 16:07:47 livebook docker[24339]: 14:07:47.323 [debug] Replied in 194µs
Mai 06 16:07:48 livebook docker[24339]:
Mai 06 16:07:48 livebook docker[24339]: 14:07:48.541 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so
Mai 06 16:07:51 livebook docker[24339]: 14:07:51.891 [debug] HANDLE EVENT "queue_cell_evaluation" in LivebookWeb.SessionLive
Mai 06 16:07:51 livebook docker[24339]: Parameters: %{"cell_id" => "setup", "disable_dependencies_cache" => false}
Mai 06 16:07:51 livebook docker[24339]: 14:07:51.891 [debug] Replied in 182µs
Mai 06 16:07:53 livebook docker[24339]:
Mai 06 16:07:53 livebook docker[24339]: 14:07:53.100 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so
The container runs in a vm with 12GB memory and there's currently no load in the container.
I have the exact same issue happening to me using livebook-dev/livebook:latest
(I also tried edge, same issue).
In my case I'm running it in my TrueNAS instance.
That's what I suspected, something going on Docker's emulation layer. :( Does the OS version in the NIF match the system you are currently running? What is your host OS?
My CPU is a Ryzen 5 1600:
processor : 11
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen 5 1600 Six-Core Processor
stepping : 2
microcode : 0x800820d
cpu MHz : 3569.248
cache size : 512 KB
physical id : 0
siblings : 12
core id : 6
cpu cores : 6
apicid : 13
initial apicid : 13
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso div0
bogomips : 6399.93
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
But I have my TrueNAS instance running on top of proxmox, so the CPU that the docker image sees is a QEMU Virtual CPU version 2.5+
processor : 2
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : QEMU Virtual CPU version 2.5+
stepping : 1
microcode : 0x1000065
cpu MHz : 3205.846
cache size : 512 KB
physical id : 0
siblings : 12
core id : 2
cpu cores : 12
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl cpuid extd_apicid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall
bugs : fxsave_leak sysret_ss_attrs null_seg swapgs_fence amd_e400 spectre_v1 spectre_v2
bogomips : 6411.69
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
I will see if I can change that to something that doesn ´t crash
Ok, FYI, after I changed the CPU from KVM64 to host, installing explorer started working again.
I'm not sure if @florianb issue is the same as mine, but if it is, then this is not a Livebook issue.
@sezaru - awesome finding! Indeed - it seems like the precompiled NIFs make use of some CPU-features. I ended up using x86-64-v3
to preserve live-migration (we're using PVE as well).
So i guess the issue can be closed!
@josevalim do you want me do file a PR adding some kind of warning to the docs about this issue? For sure its no dedicated Livebook issue but other LB-users might run into this as well..
Wouldn't this be more of an Explorer kind of issue? are we assuming too many CPU features?
I think it's an issue with NIFs in general and what me boggles is the fact that i got no feedback from the stack. "Illegal instruction" is something to work with, but yeah you're right - this should be pushed downwards.
I just assumed that this repo is one likely sink for this issue. But we might add a hint later, after more people stumbled over this. In that time i will try to find out where the cpu restrictions sneaked in, maybe its a thing which should be handled as default in rustler.
Of course, it would be the best if the NIF could indicate issues before the NIF is loaded.