beacon-biosignals / Ray.jl

Julia API for Ray

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve handling of tasks killed due to low memory

omus opened this issue · comments

While running some multi-node benchmarks I noticed this failure occur at the end of the run on the driver:

ERROR: LoadError: TaskFailedException

    nested task error: ArgumentError: 'Int32' iterates 'Int32' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
    Stacktrace:
     [1] invalidtable(#unused#::Int32, #unused#::Int32)
       @ Tables ~/.julia/packages/Tables/AcRIE/src/tofromdatavalues.jl:41
     [2] iterate
       @ ~/.julia/packages/Tables/AcRIE/src/tofromdatavalues.jl:47 [inlined]
     [3] buildcolumns
       @ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:209 [inlined]
     [4] _columns
       @ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:274 [inlined]
     [5] columns
       @ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:258 [inlined]
     [6] DataFrames.DataFrame(x::Int32; copycols::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/other/tables.jl:57
     [7] append!(df::DataFrames.DataFrame, table::Int32; cols::Symbol, promote::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/other/tables.jl:71
    ...

...and 6 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:445
 [2] macro expansion
   @ ./task.jl:477 [inlined]
 [3] reduce_results(ray_objects::Vector{Any})
    ...
in expression starting at /tmp/ray/session_2023-09-22_07-13-21_412745_8/runtime_resources/working_dir_files/_ray_pkg_0a43fd8298c1f456/migration.jl:5

Looking at the Ray dashboard for this job I saw 7 failed tasks with this error:

Error Type: OUT_OF_MEMORY
Task was killed due to the node running low on memory.
Memory on the node (IP: 10.0.18.21, ID: c1e1fd57171f3c092f46167a78eb2d5007398539847a38edfe7f1829) where the task (task ID: 98937e2987c30bec7128cff4045939da2639045b1f000000, name=v0_4_4.process_segment, pid=162, memory used=3.10GB) was running was 51.31GB / 54.00GB (0.950157), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a4340f66d387df4707564760c0913885d662d19424f2ba87c991daf3) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.18.21`. To see the logs of the worker, use `ray logs worker-a4340f66d387df4707564760c0913885d662d19424f2ba87c991daf3*out -ip 10.0.18.21. Top 10 memory users:
PID	MEM(GB)	COMMAND
161	4.24	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
159	3.53	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
163	3.52	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
432	3.42	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
423	3.33	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
402	3.31	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
164	3.26	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
503	3.25	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
158	3.13	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
162	3.10	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

It would be better if more details in this case were reported back to the driver. Additionally, I'm not sure why these tasks were not rescheduled.

Additionally, I'm not sure why these tasks were not rescheduled.

that's disappointing; I accidentally triggered an OOM locally and tasks were rescheduled so I was hoping that it would Just Work™

We've also hit this again on an internal benchmark (running on a single large EC2 instance), where consistent memory pressure causes Ray to kinda flail and keep recreating workers, until finally something snaps and tasks are not rescheduled adn just fail. So it's possible that these actual failures are tip of the iceberg kinda thing adn there's a lot more OOM killing/rescheduling happening that we're not aware of.

With the change in #180 I saw the following warnings while re-running the MWE for this failure:

┌ Warning: Unhandled RayObject.Metadata: 22
└ @ Ray ~/.julia/dev/Ray/src/ray_serializer.jl:89

This revealed a connection between unprocessed metadata and the OUT_OF_MEMORY error shown in the description.

I'll start by adding support into Ray.jl to process metadata exceptions from the Raylet to start with.

While digging into this I've noticed that when the metadata indicates an error the data sometimes includes more details (depending on the metadata error code). Unfortunately, this data seems to message pack data containing pickled Python data:

I'll need to look into how non-Python languages deal with this but possibly we may just have to a heuristic which extracts the useful string data.

Okay, after doing some spelunking in the source code:

So passing through max_retries to some non-zero value as in #213 should unlock the OOM retry behavior!