OpenGVLab / LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error when pretrin the llama-adapterv2-multimodal

adda1221 opened this issue · comments

[08:15:21.504933] read dataset config from configs/data/pretrain/EN.yaml
[08:15:21.513275] DATASET CONFIG:
[08:15:21.513295] {'META': ['/HOME/llama-adapter/datasets/cc3m.csv']}
[08:18:21.093524] /HOME/llama-adapter/datasets/cc3m.csv: len 3318333
[08:18:22.476513] total length: 3318333
[08:18:23.899807] <data.dataset.PretrainDataset object at 0x7f16d0076790>
[08:18:23.899933] Sampler_train = <util.misc.DistributedSubEpochSampler object at 0x7f16d00760d0>
[08:18:24.745975] Start training for 400 epochs
[08:18:24.753625] log_dir: ./output
Traceback (most recent call last):
File "main_pretrain.py", line 202, in
main(args)
File "main_pretrain.py", line 171, in main
train_stats = train_one_epoch(
File "/HOME/llama-adapter/llama_adapter_v2_multimodal/engine_pretrain.py", line 31, in train_one_epoch
for data_iter_step, (examples, labels, example_mask, imgs) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
File "/HOME/llama-adapter/llama_adapter_v2_multimodal/util/misc.py", line 149, in log_every
for obj in iterable:
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 441, in iter
return self._get_iterator()
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1084, in init
self._reset(loader, first_iter=True)
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1117, in _reset
self._try_put_index()
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1351, in _try_put_index
index = self._next_index()
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 623, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/root/miniconda3/envs/llama_adapter_v2/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 244, in iter
sampler_iter = iter(self.sampler)
File "/HOME/llama-adapter/llama_adapter_v2_multimodal/util/misc.py", line 380, in iter
g.manual_seed(self.seed + self.epoch // self.split_epoch)
AttributeError: 'DistributedSubEpochSampler' object has no attribute 'epoch'

how to solve it?

Hi, it seems that your experiment was not launched in the distributed mode. Specifically, the epoch attribute is expected to be set here:

data_loader_train.sampler.set_epoch(epoch)

However, args.distributed seems to be False within your run. I would guess that you are running something like python main_pretrain.py .... You may try using torchrun or other distributed launching commands instead. Here is a tutorial.