关于Pipeline Parallel的data_iter的问题

Question

关于Pipeline Parallel的data_iter的问题

Coobiw opened this issue 5 months ago · comments

When I use the Pipeline Parallel feature, I find a error about the data_iter. Following is my code:

train_dataloader = DataLoader(datasets['train'],
                            shuffle=True,
                            drop_last=True,
                            batch_size=ds_cfg.train_micro_batch_size_per_gpu,
                            generator=g,
                            collate_fn=collate_func,
                        )
engine, _, _, _ = deepspeed.initialize(model=model, config=OmegaConf.to_container(ds_cfg), model_parameters=[p for p in model.parameters() if p.requires_grad])

for step in range(cfg.run_cfg.max_epoch * num_update_steps_per_epoch):
        with (torch.cuda.amp.autocast(dtype=model_dtype,cache_enabled=False) if model_dtype != torch.float32 else contextlib.nullcontext()):
            loss = engine.train_batch(data_iter=train_dataloader)

The collate_fn is customed, which returns: Tuple[Tensor, Tensor, Tensor, Tensor], like:

return (new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask'])

I print size of each item, like:

train_iter = iter(train_dataloader)
debug_data = next(train_iter)
for d in debug_data:
    print(d.size())
    print("==="*16)

torch.Size([1, 3, 224, 224])
================================================
torch.Size([1, 256])
================================================
torch.Size([1, 256])
================================================
torch.Size([1, 256])
================================================

However, I find that the input of first pipe layer only collect the next(iter(training_batch))[0] i.e.new_batch['image'] , caused:

ValueError: not enough values to unpack (expected 4, got 1)

This makes me confused. How can I solve it? Thanks for help!

Coobiw · Answer 1 · Tue Feb 06 2024 16:40:58 GMT+0800 (China Standard Time)

Solved, the source code of deepspeed:

batch[0] for inputs, batch[1] for labels.

So the collate_fn should output:

return ((new_batch['image'], data_dict['input_ids'],data_dict['labels'],data_dict['attention_mask']),
                data_dict['labels']
        )

Sorry for interrupt!