[Core] Allow us to configure new memory and cpu config upon subsequent retries
raghumdani opened this issue · comments
Description
ray.remote takes in memory
, cpu
and max_retries
and retry_exceptions
. We have seen that the most common cause for task failures are OOMs. If we retry them with the same memory config, the task will fail again. Hence, a feature to change the resource config on the subsequent retries of a task would be tremendously useful. We anyway have a workaround to do it on our own but this is a generic improvement that can benefit ray users and can simplify code at our side.
Use case
We will be able to overcome out of memory errors by retrying tasks with increased memory upon subsequent retries.