ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Home Page:https://ray.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Core] Allow us to configure new memory and cpu config upon subsequent retries

raghumdani opened this issue · comments

Description

ray.remote takes in memory, cpu and max_retries and retry_exceptions. We have seen that the most common cause for task failures are OOMs. If we retry them with the same memory config, the task will fail again. Hence, a feature to change the resource config on the subsequent retries of a task would be tremendously useful. We anyway have a workaround to do it on our own but this is a generic improvement that can benefit ray users and can simplify code at our side.

Use case

We will be able to overcome out of memory errors by retrying tasks with increased memory upon subsequent retries.