pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch

Home Page:https://pytorch.org/text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

fujidaiti opened this issue · comments

According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True) as follows:

import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader

train = DataLoader(
    d.DBpedia(split="train", root=".cache"),
    batch_size=10000,
    shuffle=True,
)

the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.

for labels, texts in train:
    print(len(set(labels.tolist())))

The output of the above code is:

1
1
1
2
2
2
2
3
3
3
3
4
4
3
3
.
.
.

How can I fix this? Or is my implementation wrong?

P.S.
Interactive code is available on GoogleColab