livebook-dev / livebook

Automate code & data workflows with interactive Elixir notebooks

Home Page:https://livebook.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Node terminated unexpectedly

florianb opened this issue · comments

Environment

  • Elixir & Erlang/OTP versions (elixir --version): Erlang/OTP 26 [erts-14.1.1] [source] [64-bit] [smp:6:6] [ds:6:6:10] [async-threads:1] [jit:ns] [x86_64-pc-linux-gnu]
  • Operating system: Linux d77d09d47a64 6.1.0-20-amd64 # 1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
  • How have you started Livebook (mix phx.server, livebook CLI, Docker, etc): Docker livebook-dev/livebook:latest
  • Livebook version (use git rev-parse HEAD if running with mix): 0.12.1
  • Browsers that reproduce this bug (the more the merrier): Firefox (it's a browser independent issue)
  • Include what is logged in the browser console: no errors here
  • Include what is logged to the server console:
Mai 06 12:07:37 livebook docker[23169]: [Livebook] Application running at http://localhost:8080/
Mai 06 12:07:58 livebook docker[23169]: 
Mai 06 12:07:58 livebook docker[23169]: 10:07:58.064 [debug] Downloading NIF from https://github.com/elixir-nx/explorer/releases/download/v0.8.2/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz
Mai 06 12:08:01 livebook docker[23169]: 
Mai 06 12:08:01 livebook docker[23169]: 10:08:01.168 [debug] NIF cached at /home/livebook/.cache/rustler_precompiled/precompiled_nifs/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so.tar.gz and extracted to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so

Current behavior

When running a setup task (here i'm running the "Data transform with Explorer" setup) the server disconnects after the following output:

==> explorer
Compiling 25 files (.ex)

11:17:24.711 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so

I am running the docker container using a dedicated user livebook (1001) with a systemd unit:

[Unit]
Description=Livebook Server
After=docker.service
Requires=docker.service

[Service]
User=livebook
Group=livebook
Restart=always
Environment=LIVEBOOK_DEBUG=true
ExecStartPre=-/usr/bin/docker stop livebook
ExecStartPre=-/usr/bin/docker rm livebook
ExecStartPre=/usr/bin/docker pull ghcr.io/livebook-dev/livebook
ExecStart=/usr/bin/docker run --name livebook -u 1001:1001 -e "LIVEBOOK_PASSWORD=" -p 8080:8080 -p 8081:8081 -v /home/livebook:/data ghcr.io/livebook-dev/livebook
ExecStop=/usr/bin/docker stop livebook

[Install]
WantedBy=multi-user.target

Creating a folder (123) from the web ui works as intended:

I have no name!@f0aa748746e0:/$ ls -la data
total 16
drwxr-xr-x 4 1001 1001 4096 May  6 11:25 .
drwxr-xr-x 1 root root 4096 May  6 11:22 ..
drwxr-xr-x 2 1001 1001 4096 May  6 11:25 123

The mix folder seems to be created with the correct permissions:

I have no name!@f0aa748746e0:/$ ls -la /home/livebook/
total 28
drwxrwxrwx 1 root root 4096 May  6 11:26 .
drwxr-xr-x 1 root root 4096 Nov  9 11:15 ..
drwxr-xr-x 4 1001 1001 4096 May  6 11:26 .cache
drwxr-xr-x 3 1001 1001 4096 May  6 11:26 .hex
drwxr-xr-x 3 1001 1001 4096 May  6 11:22 .local
drwxrwxrwx 1 root root 4096 Nov  9 11:15 .mix

Also the cached NIF seems to be laid out as expected:

I have no name!@f0aa748746e0:/$ ls -la /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/
total 64844
drwxr-xr-x 2 1001 1001     4096 May  6 11:26 .
drwxr-xr-x 3 1001 1001     4096 May  6 11:26 ..
-rwxr-xr-x 1 1001 1001 66391768 Apr 22 12:40 libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so

Any hint to find out why this doesn't work is very appreciated. I wonder why it crashes without further notice.

  • Let know how i can help - thanks a lot!

What happens if you build a similar docker image (matching Elixir/Erlang versions) with iex instead and then run Mix.install [:explorer]? Because my gut feeling says it is not a Livebook issue per se.

Thanks for the quick response - i don't think it's Livebook issue, too - i wondered if there's a way to find out why it crashes.

I will try to create a livebook-instance in a different base container and come back as soon as i have new details.

Is it the whole Docker container that crashes, or only the notebook runtime?

Only the notebook runtime - the container keeps running and i can immediately invoke a reconnect from the gui. Unfortunately the logs seem to stay silent and interestingly there are empty log lines.

Either a segmentation fault or the OS somehow killing it because it things it is running out of memory?

Here is the log of two subsequent reconnects & setup:

Mai 06 16:07:47 livebook docker[24339]: 14:07:47.322 [debug] HANDLE EVENT "queue_cell_evaluation" in LivebookWeb.SessionLive
Mai 06 16:07:47 livebook docker[24339]:   Parameters: %{"cell_id" => "setup", "disable_dependencies_cache" => false}
Mai 06 16:07:47 livebook docker[24339]: 14:07:47.323 [debug] Replied in 194µs
Mai 06 16:07:48 livebook docker[24339]: 
Mai 06 16:07:48 livebook docker[24339]: 14:07:48.541 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so
Mai 06 16:07:51 livebook docker[24339]: 14:07:51.891 [debug] HANDLE EVENT "queue_cell_evaluation" in LivebookWeb.SessionLive
Mai 06 16:07:51 livebook docker[24339]:   Parameters: %{"cell_id" => "setup", "disable_dependencies_cache" => false}
Mai 06 16:07:51 livebook docker[24339]: 14:07:51.891 [debug] Replied in 182µs
Mai 06 16:07:53 livebook docker[24339]: 
Mai 06 16:07:53 livebook docker[24339]: 14:07:53.100 [debug] Copying NIF from cache and extracting to /home/livebook/.cache/mix/installs/elixir-1.15.7-erts-14.1.1/444655ddd876e676876661841d5d38eb/_build/dev/lib/explorer/priv/native/libexplorer-v0.8.2-nif-2.15-x86_64-unknown-linux-gnu.so

The container runs in a vm with 12GB memory and there's currently no load in the container.

I have the exact same issue happening to me using livebook-dev/livebook:latest (I also tried edge, same issue).

In my case I'm running it in my TrueNAS instance.

So, I opened a shell in my docker container, started iex and run Mix.install, the error I get is Illegal instruction:

image

That's what I suspected, something going on Docker's emulation layer. :( Does the OS version in the NIF match the system you are currently running? What is your host OS?

My CPU is a Ryzen 5 1600:

processor       : 11
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 8
model name      : AMD Ryzen 5 1600 Six-Core Processor
stepping        : 2
microcode       : 0x800820d
cpu MHz         : 3569.248
cache size      : 512 KB
physical id     : 0
siblings        : 12
core id         : 6
cpu cores       : 6
apicid          : 13
initial apicid  : 13
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso div0
bogomips        : 6399.93
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

But I have my TrueNAS instance running on top of proxmox, so the CPU that the docker image sees is a QEMU Virtual CPU version 2.5+

processor       : 2
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : QEMU Virtual CPU version 2.5+
stepping        : 1
microcode       : 0x1000065
cpu MHz         : 3205.846
cache size      : 512 KB
physical id     : 0
siblings        : 12
core id         : 2
cpu cores       : 12
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl cpuid extd_apicid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall
bugs            : fxsave_leak sysret_ss_attrs null_seg swapgs_fence amd_e400 spectre_v1 spectre_v2
bogomips        : 6411.69
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

I will see if I can change that to something that doesn ´t crash

Ok, FYI, after I changed the CPU from KVM64 to host, installing explorer started working again.

I'm not sure if @florianb issue is the same as mine, but if it is, then this is not a Livebook issue.

@sezaru - awesome finding! Indeed - it seems like the precompiled NIFs make use of some CPU-features. I ended up using x86-64-v3 to preserve live-migration (we're using PVE as well).

So i guess the issue can be closed!

@josevalim do you want me do file a PR adding some kind of warning to the docs about this issue? For sure its no dedicated Livebook issue but other LB-users might run into this as well..

Wouldn't this be more of an Explorer kind of issue? are we assuming too many CPU features?

I think it's an issue with NIFs in general and what me boggles is the fact that i got no feedback from the stack. "Illegal instruction" is something to work with, but yeah you're right - this should be pushed downwards.

I just assumed that this repo is one likely sink for this issue. But we might add a hint later, after more people stumbled over this. In that time i will try to find out where the cpu restrictions sneaked in, maybe its a thing which should be handled as default in rustler.

Of course, it would be the best if the NIF could indicate issues before the NIF is loaded.