FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'
vbsantos opened this issue · comments
Issue: FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'
Description
I am encountering a FileNotFoundError
when trying to run the job_lfw
job using Ray on a Kubernetes cluster. The error occurs after downloading the dataset (and doing some processing for the second time) when Ray tries to open a local file that apparently does not exist. I am new to the Python and Kubernetes ecosystem, so I apologize if this is a basic error.
Details
- The collection in Qdrant was not created.
- The dataset downloads without issues.
- This error occurs when running the command
kubectl apply -f kubernetes/job_lfw.yaml
. - The Kubernetes cluster on Linode was configured with 3 dedicated nodes, each having 32GB of RAM and 8 CPU cores.
- When the first job encounters this error, all subsequent jobs fail to run as they are unable to communicate with Ray.
Error Details
The complete error is as follows:
2024-07-06 17:40:19,695 INFO cli.py:36 -- �[37mJob submission server address�[39m: �[1mhttp://rayjob-lfw-raycluster-2cv5x-head-svc.default.svc.cluster.local:8265�[22m
2024-07-06 17:40:20,438 SUCC cli.py:60 -- �[32m---------------------------------------------�[39m
2024-07-06 17:40:20,438 SUCC cli.py:61 -- �[32mJob 'rayjob-lfw-t4mxs' submitted successfully�[39m
2024-07-06 17:40:20,438 SUCC cli.py:62 -- �[32m---------------------------------------------�[39m
2024-07-06 17:40:20,438 INFO cli.py:274 -- �[36mNext steps�[39m
2024-07-06 17:40:20,438 INFO cli.py:275 -- Query the logs of the job:
2024-07-06 17:40:20,438 INFO cli.py:277 -- �[1mray job logs rayjob-lfw-t4mxs�[22m
2024-07-06 17:40:20,438 INFO cli.py:279 -- Query the status of the job:
2024-07-06 17:40:20,438 INFO cli.py:281 -- �[1mray job status rayjob-lfw-t4mxs�[22m
2024-07-06 17:40:20,438 INFO cli.py:283 -- Request the job to be stopped:
2024-07-06 17:40:20,439 INFO cli.py:285 -- �[1mray job stop rayjob-lfw-t4mxs�[22m
2024-07-06 17:40:20,444 INFO cli.py:292 -- Tailing logs until the job exits (disable with --no-wait):
Downlaod dataset vilsonrodrigues/lfw/lfw_multifaces-ingestion.zip
lfw_multifaces-ingestion.zip: 0%| | 0.00/69.1M [00:00<?, ?B/s]
lfw_multifaces-ingestion.zip: 15%|█▌ | 10.5M/69.1M [00:00<00:00, 89.1MB/s]
lfw_multifaces-ingestion.zip: 46%|████▌ | 31.5M/69.1M [00:00<00:00, 138MB/s]
lfw_multifaces-ingestion.zip: 76%|███████▌ | 52.4M/69.1M [00:00<00:00, 158MB/s]
lfw_multifaces-ingestion.zip: 100%|██████████| 69.1M/69.1M [00:00<00:00, 160MB/s]
lfw_multifaces-ingestion.zip: 100%|██████████| 69.1M/69.1M [00:00<00:00, 149MB/s]
Unzip dataset
Load images with Ray Data
2024-07-06 17:40:24,611 INFO worker.py:1329 -- Using address 10.2.1.35:6379 set in the environment variable RAY_ADDRESS
2024-07-06 17:40:24,611 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.2.1.35:6379...
2024-07-06 17:40:24,618 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at �[1m�[32m10.2.1.35:8265 �[39m�[22m
Start map batch processing
Batch map process finish
2024-07-06 17:40:25,241 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[ReadImage->Map(parse_filename)->MapBatches(<lambda>)->MapBatches(UltraLightORTBatchPredictor)] -> ActorPoolMapOperator[MapBatches(BatchFacePostProcessing)] -> ActorPoolMapOperator[MapBatches(MobileFaceNetORTBatchPredictor)]
2024-07-06 17:40:25,242 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-07-06 17:40:25,242 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-07-06 17:40:25,596 INFO actor_pool_map_operator.py:106 -- ReadImage->Map(parse_filename)->MapBatches(<lambda>)->MapBatches(UltraLightORTBatchPredictor): Waiting for 1 pool actors to start...
2024-07-06 17:40:28,497 INFO actor_pool_map_operator.py:106 -- MapBatches(BatchFacePostProcessing): Waiting for 1 pool actors to start...
2024-07-06 17:40:29,404 INFO actor_pool_map_operator.py:106 -- MapBatches(MobileFaceNetORTBatchPredictor): Waiting for 1 pool actors to start...
Running 0: 0%| | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.1 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.53 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 17.5 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 7.73 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.17 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:05<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.17 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:05<05:27, 5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.09 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:05<05:27, 5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.46 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:07<05:27, 5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 17.49 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:07<05:27, 5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 7.9 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:07<05:27, 5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.16 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:08<05:27, 5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.16 MiB/1.15 GiB object_store_memory: 3%|▎ | 2/65 [00:08<04:15, 4.05s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.08 MiB/1.15 GiB object_store_memory: 3%|▎ | 2/65 [00:08<04:15, 4.05s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.47 MiB/1.15 GiB object_store_memory: 3%|▎ | 2/65 [00:10<04:15, 4.05s/it]
(omited)
Running: 1.3/5.0 CPU, 0.0/0.0 GPU, 16.62 MiB/1.15 GiB object_store_memory: 98%|█████████▊| 64/65 [03:28<00:03, 3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.38 MiB/1.15 GiB object_store_memory: 98%|█████████▊| 64/65 [03:28<00:03, 3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.46 MiB/1.15 GiB object_store_memory: 98%|█████████▊| 64/65 [03:29<00:03, 3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.46 MiB/1.15 GiB object_store_memory: 100%|██████████| 65/65 [03:29<00:00, 3.26s/it]
Running: 0.0/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory: 100%|██████████| 65/65 [03:29<00:00, 3.26s/it]
2024-07-06 17:44:00,933 WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune
2024-07-06 17:44:02,479 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage->Map(parse_filename)->MapBatches(<lambda>)]
2024-07-06 17:44:02,480 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-07-06 17:44:02,480 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
Running 0: 0%| | 0/65 [00:00<?, ?it/s]
Running: 0.0/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:00<?, ?it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.07 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:00<?, ?it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory: 0%| | 0/65 [00:00<?, ?it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:00<00:26, 2.39it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.13 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:00<00:26, 2.39it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory: 2%|▏ | 1/65 [00:00<00:26, 2.39it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory: 3%|▎ | 2/65 [00:00<00:25, 2.43it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.03 MiB/1.15 GiB object_store_memory: 3%|▎ | 2/65 [00:00<00:25, 2.43it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory: 3%|▎ | 2/65 [00:01<00:25, 2.43it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory: 5%|▍ | 3/65 [00:01<00:22, 2.76it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.12 MiB/1.15 GiB object_store_memory: 5%|▍ | 3/65 [00:01<00:22, 2.76it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory: 5%|▍ | 3/65 [00:01<00:22, 2.76it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory: 6%|▌ | 4/65 [00:01<00:32, 1.86it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.13 MiB/1.15 GiB object_store_memory: 6%|▌ | 4/65 [00:02<00:32, 1.86it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory: 6%|▌ | 4/65 [00:02<00:32, 1.86it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory: 8%|▊ | 5/65 [00:02<00:27, 2.19it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.1 MiB/1.15 GiB object_store_memory: 8%|▊ | 5/65 [00:02<00:27, 2.19it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.1 MiB/1.15 GiB object_store_memory: 100%|██████████| 5/5 [00:02<00:00, 2.19it/s]
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
meta = ray.get(next(self._streaming_gen))
File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/job_hf.py", line 153, in <module>
df = ds.to_pandas()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4242, in to_pandas
count = self.count()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 2498, in count
[get_num_rows.remote(block) for block in self.get_internal_block_refs()]
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
blocks = self._plan.execute().get_blocks()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 591, in execute
blocks = execute_to_legacy_block_list(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
block_list = _bundles_to_block_list(bundles)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
for ref_bundle in bundles:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
return self.get_next()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
raise item
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
while self._scheduling_loop_step(self._topology) and not self._shutdown:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
process_completed_tasks(topology)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
active_tasks[ref].on_waitable_ready()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
ex = ray.get(block_ref)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): �[36mray::ReadImage->Map(parse_filename)->MapBatches(<lambda>)()�[39m (pid=233, ip=10.2.0.153)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 122, in apply_transform
iter = transform_fn(iter, ctx)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 263, in __call__
first = next(block_iter, None)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
for data in iter:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
yield from self._row_fn(input, ctx)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 232, in transform_fn
for row in rows:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
for block in blocks:
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 207, in __call__
yield from self._block_fn(input, ctx)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 122, in do_read
yield from read_task()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 225, in __call__
yield from result
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 610, in read_task_fn
yield from make_async_gen(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/util.py", line 769, in make_async_gen
raise next_item
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/util.py", line 746, in execute_computation
for item in fn(thread_safe_generator):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 589, in read_files
with _open_file_with_retry(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 1000, in _open_file_with_retry
raise e from None
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 982, in _open_file_with_retry
return open_file()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 591, in <lambda>
lambda: open_input_source(fs, read_path, **open_stream_args),
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in _open_input_source
return filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
File "pyarrow/_fs.pyx", line 812, in pyarrow._fs.FileSystem.open_input_stream
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'. Detail: [errno 2] No such file or directory
2024-07-06 17:44:06,427 ERR cli.py:68 -- �[31m-----------------------------�[39m
2024-07-06 17:44:06,427 ERR cli.py:69 -- �[31mJob 'rayjob-lfw-t4mxs' failed�[39m
2024-07-06 17:44:06,427 ERR cli.py:70 -- �[31m-----------------------------�[39m
2024-07-06 17:44:06,427 INFO cli.py:83 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 982, in _open_file_with_retry
return open_file()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 591, in <lambda>
lambda: open_input_source(fs, read_path, **open_stream_args),
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in _open_input_source
return filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
File "pyarrow/_fs.pyx", line 812, in pyarrow._fs.FileSystem.open_input_stream
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'. Detail: [errno 2] No such file or directory
Could you please help me understand why this error is occurring and how to resolve it? Any guidance or suggestions would be greatly appreciated.
Thank you in advance for your assistance and support.
Hi Vinícius, you can run execute script job_hf.py in local python (out of kubernetes)?
It appears to be a hardware limitation error:
2024-07-06 17:44:00,933 WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune
in the Kubernetes Job you can increase the hardware specifications that Ray can use, remember to also increase the environment variables that limit Ray's use of the hardware