[Issue] latest run_pseudo_labelling.py

Question

[Issue] latest run_pseudo_labelling.py

ckcraig01 opened this issue 4 months ago · comments

Dear Author,

Thanks for your great works, The pseudo_labeling worked fine with previous implementation (a month ago). But as I updated the codebase to the latest main branch and followed the readme. I encountered the below issues:

my command line:

accelerate launch run_pseudo_labelling.py
--model_name_or_path "openai/whisper-medium"
--dataset_name "mozilla-foundation/common_voice_16_1"
--dataset_config_name "zh-TW"
--dataset_split_name "train+validation+test"
--text_column_name "sentence"
--id_column_name "path"
--output_dir "./common_voice_16_1_zh_tw_pseudo_labelled"
--wandb_project "distil-whisper-labelling"
--per_device_eval_batch_size 8
--dtype "bfloat16"
--attn_implementation "sdpa"
--logging_steps 500
--max_label_length 256
--concatenate_audio
--preprocessing_batch_size 500
--preprocessing_num_workers 8
--dataloader_num_workers 8
--language "zh"
--task "transcribe"
--return_timestamps
--streaming False
--generation_num_beams 1 \

Error message:

04/01/2024 09:12:23 - INFO - __main__ - ***** Running Labelling *****
04/01/2024 09:12:23 - INFO - __main__ -   Instantaneous batch size per device = 8
04/01/2024 09:12:23 - INFO - __main__ -   Total eval batch size (w. parallel & distributed) = 16
04/01/2024 09:12:23 - INFO - __main__ -   Predict labels with timestamps = True
Evaluating train...:   0%|                                                                                                                                           | 0/52 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
    main()
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
    eval_step_with_save(split=split)
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
    for step, batch in enumerate(batches):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1169, in __iter__
    for obj in iterable:
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
           ^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Traceback (most recent call last):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1027, in <module>
    main()
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 1012, in main
    eval_step_with_save(split=split)
  File "/alghome/craig.hsin/framework/distil-whisper/training/run_pseudo_labelling.py", line 900, in eval_step_with_save
    for step, batch in enumerate(batches):
  File "/myenv/distil_whisper/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
           ^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/transformers/feature_extraction_utils.py", line 229, in to
    if torch.is_floating_point(v):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: is_floating_point(): argument 'input' (position 1) must be Tensor, not list
Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/myenv/distil_whisper/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/myenv/distil_whisper/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^wandb: You can sync this run to the cloud by running:
wandb: wandb sync /alghome/craig.hsin/framework/distil-whisper/training/wandb/offline-run-20240401_091200-0oe1zyh2
wandb: Find logs at: ./wandb/offline-run-20240401_091200-0oe1zyh2/logs
[2024-04-01 09:12:31,572] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2703427) of binary: /myenv/distil_whisper/bin/python
Traceback (most recent call last):
  File "/myenv/distil_whisper/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
    multi_gpu_launcher(args)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
    distrib_run.run(args)
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/myenv/distil_whisper/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_pseudo_labelling.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-01_09:12:31
  host      : alg4
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2703428)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-01_09:12:31
  host      : alg4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2703427)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

some of my environment info.
Name Version Build Channel
python 3.11.8 h955ad1f_0
torch 2.1.1+cu118 pypi_0 pypi
transformers 4.39.1 pypi_0 pypi

May you provide some suggestion on how could I proceed the investigations? Thanks.