Improve handling of tasks killed due to low memory
omus opened this issue · comments
While running some multi-node benchmarks I noticed this failure occur at the end of the run on the driver:
ERROR: LoadError: TaskFailedException
nested task error: ArgumentError: 'Int32' iterates 'Int32' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
Stacktrace:
[1] invalidtable(#unused#::Int32, #unused#::Int32)
@ Tables ~/.julia/packages/Tables/AcRIE/src/tofromdatavalues.jl:41
[2] iterate
@ ~/.julia/packages/Tables/AcRIE/src/tofromdatavalues.jl:47 [inlined]
[3] buildcolumns
@ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:209 [inlined]
[4] _columns
@ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:274 [inlined]
[5] columns
@ ~/.julia/packages/Tables/AcRIE/src/fallbacks.jl:258 [inlined]
[6] DataFrames.DataFrame(x::Int32; copycols::Bool)
@ DataFrames ~/.julia/packages/DataFrames/LteEl/src/other/tables.jl:57
[7] append!(df::DataFrames.DataFrame, table::Int32; cols::Symbol, promote::Bool)
@ DataFrames ~/.julia/packages/DataFrames/LteEl/src/other/tables.jl:71
...
...and 6 more exceptions.
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:445
[2] macro expansion
@ ./task.jl:477 [inlined]
[3] reduce_results(ray_objects::Vector{Any})
...
in expression starting at /tmp/ray/session_2023-09-22_07-13-21_412745_8/runtime_resources/working_dir_files/_ray_pkg_0a43fd8298c1f456/migration.jl:5
Looking at the Ray dashboard for this job I saw 7 failed tasks with this error:
Error Type: OUT_OF_MEMORY
Task was killed due to the node running low on memory.
Memory on the node (IP: 10.0.18.21, ID: c1e1fd57171f3c092f46167a78eb2d5007398539847a38edfe7f1829) where the task (task ID: 98937e2987c30bec7128cff4045939da2639045b1f000000, name=v0_4_4.process_segment, pid=162, memory used=3.10GB) was running was 51.31GB / 54.00GB (0.950157), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a4340f66d387df4707564760c0913885d662d19424f2ba87c991daf3) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.18.21`. To see the logs of the worker, use `ray logs worker-a4340f66d387df4707564760c0913885d662d19424f2ba87c991daf3*out -ip 10.0.18.21. Top 10 memory users:
PID MEM(GB) COMMAND
161 4.24 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
159 3.53 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
163 3.52 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
432 3.42 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
423 3.33 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
402 3.31 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
164 3.26 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
503 3.25 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
158 3.13 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
162 3.10 /usr/local/julia/bin/julia -Cnative -J/usr/local/julia/lib/julia/sys.so -g1 -e using Ray; start_work...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
It would be better if more details in this case were reported back to the driver. Additionally, I'm not sure why these tasks were not rescheduled.
Additionally, I'm not sure why these tasks were not rescheduled.
that's disappointing; I accidentally triggered an OOM locally and tasks were rescheduled so I was hoping that it would Just Work™
We've also hit this again on an internal benchmark (running on a single large EC2 instance), where consistent memory pressure causes Ray to kinda flail and keep recreating workers, until finally something snaps and tasks are not rescheduled adn just fail. So it's possible that these actual failures are tip of the iceberg kinda thing adn there's a lot more OOM killing/rescheduling happening that we're not aware of.
With the change in #180 I saw the following warnings while re-running the MWE for this failure:
┌ Warning: Unhandled RayObject.Metadata: 22
└ @ Ray ~/.julia/dev/Ray/src/ray_serializer.jl:89
This revealed a connection between unprocessed metadata and the OUT_OF_MEMORY
error shown in the description.
I'll start by adding support into Ray.jl to process metadata exceptions from the Raylet to start with.
While digging into this I've noticed that when the metadata indicates an error the data sometimes includes more details (depending on the metadata error code). Unfortunately, this data seems to message pack data containing pickled Python data:
- https://github.com/ray-project/ray/blob/ray-2.5.1/python/ray/_private/serialization.py#L312-L313
- https://github.com/ray-project/ray/blob/ray-2.5.1/python/ray/_private/serialization.py#L219-L235
- https://github.com/ray-project/ray/blob/ray-2.5.1/python/ray/includes/serialization.pxi#L163-L193
I'll need to look into how non-Python languages deal with this but possibly we may just have to a heuristic which extracts the useful string data.
Okay, after doing some spelunking in the source code:
- there's a separate counter for OOM retries in the task manager
- its value is -1 by default, which is "retry indefinitely"
- it can be set via env
RAY_task_oom_retries=
if we want to limit retries to some finite number - the counter is set to 0 is
max_retries=0
in teh submitted task (which we are currently hard coding)
So passing through max_retries
to some non-zero value as in #213 should unlock the OOM retry behavior!