Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

Question

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

fujidaiti opened this issue 10 months ago · comments

According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True) as follows:

import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader

train = DataLoader(
    d.DBpedia(split="train", root=".cache"),
    batch_size=10000,
    shuffle=True,
)

the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.

for labels, texts in train:
    print(len(set(labels.tolist())))

The output of the above code is:

How can I fix this? Or is my implementation wrong?

P.S.
Interactive code is available on GoogleColab