Dataloader shuffle_rng logic bug under multi-gpu settings?

Question

Dataloader shuffle_rng logic bug under multi-gpu settings?

codedeft opened this issue 10 months ago · comments

It seems that each process under a multi-gpu train run uses a different seed for data shuffling (42 + process_rank as seen in dataloader.h line 172). This results in different random permutations of shard_indices as well as intra_shard_indices for processes and potentially leads to overlapping data load.

Is this expected? I would have expected random seeds for data shuffling to be the same for all processes.