关于Pipeline Parallel的data_iter的问题
Coobiw opened this issue · comments
When I use the Pipeline Parallel feature, I find a error about the data_iter. Following is my code:
train_dataloader = DataLoader(datasets['train'],
shuffle=True,
drop_last=True,
batch_size=ds_cfg.train_micro_batch_size_per_gpu,
generator=g,
collate_fn=collate_func,
)
engine, _, _, _ = deepspeed.initialize(model=model, config=OmegaConf.to_container(ds_cfg), model_parameters=[p for p in model.parameters() if p.requires_grad])
for step in range(cfg.run_cfg.max_epoch * num_update_steps_per_epoch):
with (torch.cuda.amp.autocast(dtype=model_dtype,cache_enabled=False) if model_dtype != torch.float32 else contextlib.nullcontext()):
loss = engine.train_batch(data_iter=train_dataloader)
The collate_fn
is customed, which returns: Tuple[Tensor, Tensor, Tensor, Tensor]
, like:
return (new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask'])
I print size of each item, like:
train_iter = iter(train_dataloader)
debug_data = next(train_iter)
for d in debug_data:
print(d.size())
print("==="*16)
torch.Size([1, 3, 224, 224])
================================================
torch.Size([1, 256])
================================================
torch.Size([1, 256])
================================================
torch.Size([1, 256])
================================================
However, I find that the input of first pipe layer only collect the next(iter(training_batch))[0]
i.e.new_batch['image']
, caused:
ValueError: not enough values to unpack (expected 4, got 1)
This makes me confused. How can I solve it? Thanks for help!