ray-project / ray

Description

Goal: From a list of ray remote task futures, I want to be able to check if each of these has errored without needing to call ray.get individually on each element.

This feature is offered by similar async execution APIs:

Current workaround

We have a "check for failure" function in Ray Train, which may incur some unnecessary overhead to fetch objects:

ray/python/ray/train/_internal/utils.py

Lines 49 to 58 in fa61109

    
           for object_ref in finished: 
        
               # Everything in finished has either failed or completed 
        
               # successfully. 
        
               try: 
        
                   ray.get(object_ref) 
        
               except RayActorError as exc: 
        
                   failed_actor_rank = remote_values.index(object_ref) 
        
                   logger.info(f"Worker {failed_actor_rank} has failed.") 
        
                   return False, exc 
        
               except Exception as exc:

Use case

I am implementing a control loop where I want to check on the status of some actor tasks every N seconds. I want to know if these actor tasks have failed as soon as possible so I can trigger some error handling. This involves me running an "error check" in a loop with a small amount of sleep time:

while True:
    ready, remaining = ray.wait(tasks, num_returns=len(tasks), timeout=0.01)

    # I want to be able to collect errored tasks without calling ray.get.
    # I want to distinguish successful tasks vs. errored tasks from the output from ray.wait.
    errors = []
    for task in ready:
        try:
            ray.get(task)
        except Exception as e:
            errors.append(e)

cc: @jjyao @rkooo567

	for object_ref in finished:
	# Everything in finished has either failed or completed
	# successfully.
	try:
	ray.get(object_ref)
	except RayActorError as exc:
	failed_actor_rank = remote_values.index(object_ref)
	logger.info(f"Worker {failed_actor_rank} has failed.")
	return False, exc
	except Exception as exc:

[core] Check if a ray task has errored without calling `ray.get` on it

Description

Current workaround

Use case