beacon-biosignals / Ray.jl

Julia API for Ray

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

segfault when running on raycluster

glennmoy opened this issue · comments

This is running on a in-house ray cluster on a custom built docker image.

Sanitised logs:

$ ray job submit --address http://localhost:8265 --working-dir cluster -- julia script.jl
2023-10-27 10:55:05,590 DEBUG utils.py:654 -- Using API server address http://localhost:8265.
Job submission server address: http://localhost:8265
2023-10-27 10:55:05,611 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_4e780a0653e96063.zip.
2023-10-27 10:55:05,611 INFO packaging.py:520 -- Creating a file package for local directory 'cluster'.

-------------------------------------------------------
Job 'raysubmit_XEPP81D8hJfQ7kjf' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_XEPP81D8hJfQ7kjf
  Query the status of the job:
    ray job status raysubmit_XEPP81D8hJfQ7kjf
  Request the job to be stopped:
    ray job stop raysubmit_XEPP81D8hJfQ7kjf

Tailing logs until the job exits (disable with --no-wait):

[431] signal (11.1): Segmentation fault
in expression starting at /tmp/ray/session_2023-10-27_10-54-34_483415_1/runtime_resources/working_dir_files/_ray_pkg_4e780a0653e96063/script.jl:7
unknown function (ip: 0x7efd3d3c1ce6)
_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPcEEvT_S7_St20forward_iterator_tag.constprop.0 at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line)
_ZN5jlcxx6detail11CallFunctorIvJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_S7_S7_iN3ray5JobIDES7_RKS7_EE5applyEPKvNS_13WrappedCppPtrESF_SF_SF_iSF_SF_SF_ at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line)
initialize_driver at /usr/local/share/julia-depot/4c99afcd78/packages/CxxWrap/5IZvn/src/CxxWrap.jl:624 [inlined]
#init#13 at /usr/local/share/julia-depot/4c99afcd78/dev/<package>/dev/Ray/src/runtime.jl:99
init at /usr/local/share/julia-depot/4c99afcd78/dev/<package>/dev/Ray/src/runtime.jl:41 [inlined]
#compute_features_ray#16 at /usr/local/share/julia-depot/4c99afcd78/dev/<Package>/src/<Package>.jl:209
compute_features_ray at /usr/local/share/julia-depot/4c99afcd78/dev/<Package>/src/<Package>.jl:207
unknown function (ip: 0x7efca02c1e32)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
do_call at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1903
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1963
include at ./Base.jl:457
jfptr_include_35036.clone_1 at /usr/local/julia/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
exec_options at ./client.jl:307
_start at ./client.jl:522
jfptr__start_40034.clone_1 at /usr/local/julia/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
true_main at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at julia (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 24165798 (Pool: 24148945; Big: 16853); GC: 35
Segmentation fault (core dumped)

---------------------------------------
Job 'raysubmit_XEPP81D8hJfQ7kjf' failed
---------------------------------------

Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
true_main at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at julia (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 24165798 (Pool: 24148945; Big: 16853); GC: 35
Segmentation fault (core dumped)

I've narrowed it down to the changes in caddc0e

I can confirm that 25b0670 is successful while the previous commit 80a7b79 fails

I have no idea why slurping the return of parse_ray_args_from_raylet would cause this.

    # we use session_dir here instead of logs_dir since logs_dir can be set to
    # "" to disable file logging without using env var
-    args = parse_ray_args_from_raylet_out(session_dir)
-    gcs_address = args[3]
-    node_ip_address = args[4]
+    raylet, store, gcs_address, node_ip_address, node_port = parse_ray_args_from_raylet_out(session_dir)
+    # gcs_address = args[3]
+    # node_ip_address = args[4]
     # @info "args: $args"