vilsonrodrigues / face-recognition

A scalable face recognition system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'

vbsantos opened this issue · comments

Issue: FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'

Description

I am encountering a FileNotFoundError when trying to run the job_lfw job using Ray on a Kubernetes cluster. The error occurs after downloading the dataset (and doing some processing for the second time) when Ray tries to open a local file that apparently does not exist. I am new to the Python and Kubernetes ecosystem, so I apologize if this is a basic error.

Details

  • The collection in Qdrant was not created.
  • The dataset downloads without issues.
  • This error occurs when running the command kubectl apply -f kubernetes/job_lfw.yaml.
  • The Kubernetes cluster on Linode was configured with 3 dedicated nodes, each having 32GB of RAM and 8 CPU cores.
  • When the first job encounters this error, all subsequent jobs fail to run as they are unable to communicate with Ray.

Error Details

The complete error is as follows:

2024-07-06 17:40:19,695	INFO cli.py:36 -- �[37mJob submission server address�[39m: �[1mhttp://rayjob-lfw-raycluster-2cv5x-head-svc.default.svc.cluster.local:8265�[22m
2024-07-06 17:40:20,438	SUCC cli.py:60 -- �[32m---------------------------------------------�[39m
2024-07-06 17:40:20,438	SUCC cli.py:61 -- �[32mJob 'rayjob-lfw-t4mxs' submitted successfully�[39m
2024-07-06 17:40:20,438	SUCC cli.py:62 -- �[32m---------------------------------------------�[39m
2024-07-06 17:40:20,438	INFO cli.py:274 -- �[36mNext steps�[39m
2024-07-06 17:40:20,438	INFO cli.py:275 -- Query the logs of the job:
2024-07-06 17:40:20,438	INFO cli.py:277 -- �[1mray job logs rayjob-lfw-t4mxs�[22m
2024-07-06 17:40:20,438	INFO cli.py:279 -- Query the status of the job:
2024-07-06 17:40:20,438	INFO cli.py:281 -- �[1mray job status rayjob-lfw-t4mxs�[22m
2024-07-06 17:40:20,438	INFO cli.py:283 -- Request the job to be stopped:
2024-07-06 17:40:20,439	INFO cli.py:285 -- �[1mray job stop rayjob-lfw-t4mxs�[22m
2024-07-06 17:40:20,444	INFO cli.py:292 -- Tailing logs until the job exits (disable with --no-wait):
Downlaod dataset vilsonrodrigues/lfw/lfw_multifaces-ingestion.zip

lfw_multifaces-ingestion.zip:   0%|          | 0.00/69.1M [00:00<?, ?B/s]
lfw_multifaces-ingestion.zip:  15%|█▌        | 10.5M/69.1M [00:00<00:00, 89.1MB/s]
lfw_multifaces-ingestion.zip:  46%|████▌     | 31.5M/69.1M [00:00<00:00, 138MB/s] 
lfw_multifaces-ingestion.zip:  76%|███████▌  | 52.4M/69.1M [00:00<00:00, 158MB/s]
lfw_multifaces-ingestion.zip: 100%|██████████| 69.1M/69.1M [00:00<00:00, 160MB/s]
lfw_multifaces-ingestion.zip: 100%|██████████| 69.1M/69.1M [00:00<00:00, 149MB/s]
Unzip dataset
Load images with Ray Data
2024-07-06 17:40:24,611	INFO worker.py:1329 -- Using address 10.2.1.35:6379 set in the environment variable RAY_ADDRESS
2024-07-06 17:40:24,611	INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.2.1.35:6379...
2024-07-06 17:40:24,618	INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at �[1m�[32m10.2.1.35:8265 �[39m�[22m
Start map batch processing
Batch map process finish
2024-07-06 17:40:25,241	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[ReadImage->Map(parse_filename)->MapBatches(<lambda>)->MapBatches(UltraLightORTBatchPredictor)] -> ActorPoolMapOperator[MapBatches(BatchFacePostProcessing)] -> ActorPoolMapOperator[MapBatches(MobileFaceNetORTBatchPredictor)]
2024-07-06 17:40:25,242	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-07-06 17:40:25,242	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-07-06 17:40:25,596	INFO actor_pool_map_operator.py:106 -- ReadImage->Map(parse_filename)->MapBatches(<lambda>)->MapBatches(UltraLightORTBatchPredictor): Waiting for 1 pool actors to start...
2024-07-06 17:40:28,497	INFO actor_pool_map_operator.py:106 -- MapBatches(BatchFacePostProcessing): Waiting for 1 pool actors to start...
2024-07-06 17:40:29,404	INFO actor_pool_map_operator.py:106 -- MapBatches(MobileFaceNetORTBatchPredictor): Waiting for 1 pool actors to start...

Running 0:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.1 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.53 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 17.5 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:04<?, ?it/s] 
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 7.73 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:04<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.17 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:05<?, ?it/s]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.17 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:05<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.09 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:05<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.46 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:07<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 17.49 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:07<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 7.9 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:07<05:27,  5.12s/it]  
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.16 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:08<05:27,  5.12s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.16 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:08<04:15,  4.05s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 4.08 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:08<04:15,  4.05s/it]
Running: 2.3/5.0 CPU, 0.0/0.0 GPU, 16.47 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:10<04:15,  4.05s/it]

(omited)

Running: 1.3/5.0 CPU, 0.0/0.0 GPU, 16.62 MiB/1.15 GiB object_store_memory:  98%|█████████▊| 64/65 [03:28<00:03,  3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.38 MiB/1.15 GiB object_store_memory:  98%|█████████▊| 64/65 [03:28<00:03,  3.17s/it] 
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.46 MiB/1.15 GiB object_store_memory:  98%|█████████▊| 64/65 [03:29<00:03,  3.17s/it]
Running: 1.0/5.0 CPU, 0.0/0.0 GPU, 3.46 MiB/1.15 GiB object_store_memory: 100%|██████████| 65/65 [03:29<00:00,  3.26s/it]
Running: 0.0/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory: 100%|██████████| 65/65 [03:29<00:00,  3.26s/it] 
                                                                                                                        
2024-07-06 17:44:00,933	WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune
2024-07-06 17:44:02,479	INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage->Map(parse_filename)->MapBatches(<lambda>)]
2024-07-06 17:44:02,480	INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-07-06 17:44:02,480	INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

Running 0:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 0.0/5.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.07 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   0%|          | 0/65 [00:00<?, ?it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:00<00:26,  2.39it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.13 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:00<00:26,  2.39it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   2%|▏         | 1/65 [00:00<00:26,  2.39it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:00<00:25,  2.43it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.03 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:00<00:25,  2.43it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   3%|▎         | 2/65 [00:01<00:25,  2.43it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.51 MiB/1.15 GiB object_store_memory:   5%|▍         | 3/65 [00:01<00:22,  2.76it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.12 MiB/1.15 GiB object_store_memory:   5%|▍         | 3/65 [00:01<00:22,  2.76it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   5%|▍         | 3/65 [00:01<00:22,  2.76it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   6%|▌         | 4/65 [00:01<00:32,  1.86it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.13 MiB/1.15 GiB object_store_memory:   6%|▌         | 4/65 [00:02<00:32,  1.86it/s] 
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   6%|▌         | 4/65 [00:02<00:32,  1.86it/s]
Running: 4.0/5.0 CPU, 0.0/0.0 GPU, 17.48 MiB/1.15 GiB object_store_memory:   8%|▊         | 5/65 [00:02<00:27,  2.19it/s]
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.1 MiB/1.15 GiB object_store_memory:   8%|▊         | 5/65 [00:02<00:27,  2.19it/s]  
Running: 5.0/5.0 CPU, 0.0/0.0 GPU, 5.1 MiB/1.15 GiB object_store_memory: 100%|██████████| 5/5 [00:02<00:00,  2.19it/s] 
                                                                                                                      
Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
    meta = ray.get(next(self._streaming_gen))
  File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/job_hf.py", line 153, in <module>
    df = ds.to_pandas()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4242, in to_pandas
    count = self.count()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 2498, in count
    [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
    blocks = self._plan.execute().get_blocks()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 591, in execute
    blocks = execute_to_legacy_block_list(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
    for ref_bundle in bundles:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
    raise item
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
    process_completed_tasks(topology)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
    active_tasks[ref].on_waitable_ready()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
    ex = ray.get(block_ref)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): �[36mray::ReadImage->Map(parse_filename)->MapBatches(<lambda>)()�[39m (pid=233, ip=10.2.0.153)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 122, in apply_transform
    iter = transform_fn(iter, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 263, in __call__
    first = next(block_iter, None)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 232, in transform_fn
    for row in rows:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
    for block in blocks:
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 207, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 122, in do_read
    yield from read_task()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 225, in __call__
    yield from result
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 610, in read_task_fn
    yield from make_async_gen(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/util.py", line 769, in make_async_gen
    raise next_item
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/_internal/util.py", line 746, in execute_computation
    for item in fn(thread_safe_generator):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 589, in read_files
    with _open_file_with_retry(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 1000, in _open_file_with_retry
    raise e from None
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 982, in _open_file_with_retry
    return open_file()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 591, in <lambda>
    lambda: open_input_source(fs, read_path, **open_stream_args),
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in _open_input_source
    return filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
  File "pyarrow/_fs.pyx", line 812, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'. Detail: [errno 2] No such file or directory
2024-07-06 17:44:06,427	ERR cli.py:68 -- �[31m-----------------------------�[39m
2024-07-06 17:44:06,427	ERR cli.py:69 -- �[31mJob 'rayjob-lfw-t4mxs' failed�[39m
2024-07-06 17:44:06,427	ERR cli.py:70 -- �[31m-----------------------------�[39m
2024-07-06 17:44:06,427	INFO cli.py:83 -- Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 982, in _open_file_with_retry
    return open_file()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 591, in <lambda>
    lambda: open_input_source(fs, read_path, **open_stream_args),
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 253, in _open_input_source
    return filesystem.open_input_stream(path, buffer_size=buffer_size, **open_args)
  File "pyarrow/_fs.pyx", line 812, in pyarrow._fs.FileSystem.open_input_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'dataset/lfw_multifaces-ingestion/Albert_Costa_0001.jpg'. Detail: [errno 2] No such file or directory

Could you please help me understand why this error is occurring and how to resolve it? Any guidance or suggestions would be greatly appreciated.

Thank you in advance for your assistance and support.

Hi Vinícius, you can run execute script job_hf.py in local python (out of kubernetes)?

It appears to be a hardware limitation error:

2024-07-06 17:44:00,933 WARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune

in the Kubernetes Job you can increase the hardware specifications that Ray can use, remember to also increase the environment variables that limit Ray's use of the hardware