[data][train] Bug in SplitCoordinator: "assert self._output_iterator is not None"
raulchen opened this issue · comments
Hao Chen commented
This bug occasionally happens, looks like a race condition issue.
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 271, in prefetch_batches_locally
next_block_ref_and_metadata = next(block_ref_iter)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/util.py", line 898, in __next__
return next(self.it)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 79, in gen_blocks
cur_epoch = ray.get(
ray.exceptions.RayTaskError(AssertionError): [36mray::SplitCoordinator.start_epoch()[39m (pid=96843, ip=172.24.101.168, actor_id=4c22650eb39c06073f62b14408000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79550c01bf40>)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 201, in start_epoch
epoch_id = self._barrier(split_idx)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 280, in _barrier
assert self._output_iterator is not None
AssertionError