Improve handling of tasks killed due to low memory

Question

Improve handling of tasks killed due to low memory

omus opened this issue 10 months ago · comments

While running some multi-node benchmarks I noticed this failure occur at the end of the run on the driver:

ERROR: LoadError: TaskFailedException

    nested task error: ArgumentError: 'Int32' iterates 'Int32' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
    Stacktrace:
     [1] invalidtable(#unused#::Int32, #unused#::Int32)
       @ Tables ~/.julia/packages/Tables/AcRIE/src/tofromdatavalues.jl:41
     [2] iterate
       @ ~/.julia/packages/Tables/AcRIE/src/tofromdatavalues.jl:47 [inlined]
     [3] buildcolumns
       @ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:209 [inlined]
     [4] _columns
       @ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:274 [inlined]
     [5] columns
       @ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:258 [inlined]
     [6] DataFrames.DataFrame(x::Int32; copycols::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/other/tables.jl:57
     [7] append!(df::DataFrames.DataFrame, table::Int32; cols::Symbol, promote::Bool)
       @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/other/tables.jl:71
    ...

...and 6 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:445
 [2] macro expansion
   @ ./task.jl:477 [inlined]
 [3] reduce_results(ray_objects::Vector{Any})
    ...
in expression starting at /tmp/ray/session_2023-09-22_07-13-21_412745_8/runtime_resources/working_dir_files/_ray_pkg_0a43fd8298c1f456/migration.jl:5

Looking at the Ray dashboard for this job I saw 7 failed tasks with this error:

Error Type: OUT_OF_MEMORY
Task was killed due to the node running low on memory.
Memory on the node (IP: 10.0.18.21, ID: c1e1fd57171f3c092f46167a78eb2d5007398539847a38edfe7f1829) where the task (task ID: 98937e2987c30bec7128cff4045939da2639045b1f000000, name=v0_4_4.process_segment, pid=162, memory used=3.10GB) was running was 51.31GB / 54.00GB (0.950157), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a4340f66d387df4707564760c0913885d662d19424f2ba87c991daf3) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.18.21`. To see the logs of the worker, use `ray logs worker-a4340f66d387df4707564760c0913885d662d19424f2ba87c991daf3*out -ip 10.0.18.21. Top 10 memory users:
PID	MEM(GB)	COMMAND
161	4.24	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
159	3.53	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
163	3.52	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
432	3.42	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
423	3.33	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
402	3.31	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
164	3.26	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
503	3.25	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
158	3.13	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
162	3.10	/usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

It would be better if more details in this case were reported back to the driver. Additionally, I'm not sure why these tasks were not rescheduled.

Dave Kleinschmidt · Answer 1 · Tue Sep 26 2023 02:35:59 GMT+0800 (China Standard Time)

Additionally, I'm not sure why these tasks were not rescheduled.

that's disappointing; I accidentally triggered an OOM locally and tasks were rescheduled so I was hoping that it would Just Work™

Dave Kleinschmidt · Answer 2 · Thu Sep 28 2023 04:36:48 GMT+0800 (China Standard Time)

We've also hit this again on an internal benchmark (running on a single large EC2 instance), where consistent memory pressure causes Ray to kinda flail and keep recreating workers, until finally something snaps and tasks are not rescheduled adn just fail. So it's possible that these actual failures are tip of the iceberg kinda thing adn there's a lot more OOM killing/rescheduling happening that we're not aware of.

Curtis Vogt · Answer 3 · Thu Oct 05 2023 04:54:05 GMT+0800 (China Standard Time)

With the change in #180 I saw the following warnings while re-running the MWE for this failure:

┌ Warning: Unhandled RayObject.Metadata: 22
└ @ Ray ~/.julia/dev/Ray/src/ray_serializer.jl:89

This revealed a connection between unprocessed metadata and the OUT_OF_MEMORY error shown in the description.

I'll start by adding support into Ray.jl to process metadata exceptions from the Raylet to start with.

Curtis Vogt · Answer 4 · Sat Oct 07 2023 05:43:27 GMT+0800 (China Standard Time)

While digging into this I've noticed that when the metadata indicates an error the data sometimes includes more details (depending on the metadata error code). Unfortunately, this data seems to message pack data containing pickled Python data:

I'll need to look into how non-Python languages deal with this but possibly we may just have to a heuristic which extracts the useful string data.

Dave Kleinschmidt · Answer 5 · Sat Oct 21 2023 02:31:40 GMT+0800 (China Standard Time)

Okay, after doing some spelunking in the source code:

there's a separate counter for OOM retries in the task manager
its value is -1 by default, which is "retry indefinitely"
it can be set via env RAY_task_oom_retries= if we want to limit retries to some finite number
the counter is set to 0 is max_retries=0 in teh submitted task (which we are currently hard coding)

So passing through max_retries to some non-zero value as in #213 should unlock the OOM retry behavior!