while trying to fine tune model on kaggle this error appear :ValueError: Type fp16 is not supported.

Question

while trying to fine tune model on kaggle this error appear :ValueError: Type fp16 is not supported.

mohamed-em2m opened this issue a month ago · comments

mohamed-em2m commented a month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

max_steps is given, it will override any value given in num_train_epochs
Traceback (most recent call last):
File "/kaggle/working/MiniCPM-V/finetune/finetune.py", line 328, in
train()
File "/kaggle/working/MiniCPM-V/finetune/finetune.py", line 318, in train
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2045, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1291, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1758, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 240, in init
self._do_sanity_check()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1032, in _do_sanity_check
raise ValueError("Type fp16 is not supported.")
ValueError: Type fp16 is not supported.
[2024-06-19 00:17:44,932] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 244) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-06-19_00:17:44
host : d90c1cf96f39
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 244)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

while trying to fine tune model on kaggle this error appear :ValueError: Type fp16 is not supported.

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

finetune.py FAILED

Failures: <NO_OTHER_FAILURES>

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

Failures:
<NO_OTHER_FAILURES>