There seems to be something strange when the data is loading

Question

There seems to be something strange when the data is loading

MellowMemories opened this issue 6 months ago · comments

Look at this official configuration written by USB, which means that for every training epoch, it will perform 1024 iterations, and each iteration will use 8 labeled images. Therefore, we can conclude that for each epoch, we would need 1024 * 8 = 8192 labeled images .
However, in this configuration, we only have 400 labeled images . I don’t understand this. How can it work? Is it reasonable?
By the way, I am a complete novice in deep learning and semi-supervised learning.
Thanks a lot!

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Hao · Answer 1 · Tue Feb 20 2024 23:25:36 GMT+0800 (China Standard Time)

Distributedsampler will replicate the data to fulfill training iterations in one epoch

MellowMemories · Answer 2 · Tue Feb 20 2024 23:54:58 GMT+0800 (China Standard Time)

I am training in a single-machine, single-GPU environment. With the default configuration, I am using DistributedSampler to load data. However, the expected behavior is to iterate through n labeled data points and ulabel_ratio * n unlabeled data points within one epoch. Upon inspecting the code of DistributedSampler, I found that it iterates through batch_size * train_iters_per_epoch labeled data points in a single epoch, which far exceeds the number of labeled data points I have. Does this imply that some data points are being reused? Is this reasonable in a semi-supervised classification task? From my limited understanding, all data points should only be iterated over once in a single epoch. Thank you very much for your time and attention.

Aurélien GAUFFRE · Answer 3 · Wed Feb 28 2024 23:34:10 GMT+0800 (China Standard Time)

In semi-supervised learning, figuring out what counts as an "epoch" is tricky. Classical semi-supervised methods, as implemented in this USB package, use batches that contain both labeled and unlabeled examples in a particular ratio (often called $\mu$ in the literature, or 'uratio' in USB, set to 1 in your case). Because of this mixing, the balance of labeled and unlabeled data in your batches generally doesn't match the balance of the original data set. So when you try to complete an "epoch", you'll inevitably end up going over some data points more than once, whether they're labeled or not, just to make sure the model sees everything. Even one of the FixMatch creators mentioned that they sort of just picked a way to define an epoch based roughly on how many unlabeled examples there are in CIFAR-100, which shows that you should not pay too much attention to that definition. This is also why you usually don't see the term "epoch" in the semi-supervised literature, but rather use a number of "steps".

PS : I may be wrong, but I believe that the definition of one epoch being everywhere 1024 steps in USB might originate from this FixMatch original choice on Cifar-100

MellowMemories · Answer 4 · Wed Feb 28 2024 23:59:19 GMT+0800 (China Standard Time)

Thank you very much for clarifying my doubts.

I now have a clear understanding of the code organization and program execution flow in this repository, and I have read through all the recent papers on semi-supervised learning. I have gained a preliminary understanding of the methods used in the field of semi-supervised learning: supervised loss + auxiliary loss + pseudo-labeling loss. Building upon this foundation, the 'USB' code has done an excellent job abstracting the workflow for semi-supervised learning. You and your team have done great work.

Regarding data loading, with my own practice and your guidance, I believe I have grasped it quite well. Currently, I have divided my dataset into training set, validation set, and test set in a ratio of 7:1:2. In the training set, 20% of the data is labeled while 80% is unlabeled. Since in the 'train_step' function of the program, data is loaded based on labeled data as a reference point, all I need to do is divide the size of my labeled data by 'train_batch_size' to obtain 'num_train_iters'. This ensures that each labeled data will be used once within one epoch only. Based on this method of data loading, I am also pursuing my own work.

Once again, thank you for your explanations!

Zahra'a Hamwi · Answer 5 · Tue Mar 05 2024 06:06:29 GMT+0800 (China Standard Time)

Thank you very much for clarifying my doubts.

I now have a clear understanding of the code organization and program execution flow in this repository, and I have read through all the recent papers on semi-supervised learning. I have gained a preliminary understanding of the methods used in the field of semi-supervised learning: supervised loss + auxiliary loss + pseudo-labeling loss. Building upon this foundation, the 'USB' code has done an excellent job abstracting the workflow for semi-supervised learning. You and your team have done great work.

Regarding data loading, with my own practice and your guidance, I believe I have grasped it quite well. Currently, I have divided my dataset into training set, validation set, and test set in a ratio of 7:1:2. In the training set, 20% of the data is labeled while 80% is unlabeled. Since in the 'train_step' function of the program, data is loaded based on labeled data as a reference point, all I need to do is divide the size of my labeled data by 'train_batch_size' to obtain 'num_train_iters'. This ensures that each labeled data will be used once within one epoch only. Based on this method of data loading, I am also pursuing my own work.

Once again, thank you for your explanations!

Thank you for opening this issue, it has enlightened me. As someone new to the field, I'm currently facing difficulty understanding the execution flow within this repository, particularly regarding how the label ratio is utilized in training the SSL algorithms && deciding how to choose the num_labels parameter. Is there any intuition behind this?.
It would be immensely helpful if you could provide a screenshot of the configuration used in the example you mentioned in your comment.

Additionally, I'm curious about your preferred method for running the code. Did you rely on the notebooks such as Beginner_Example.ipynb or Custom_Dataset.ipynb found in the notebooks folder, or is there a better approach?

Any guidance you can offer would be greatly appreciated. Thanks a lot.

github-actions · Answer 6 · Sat May 04 2024 16:28:06 GMT+0800 (China Standard Time)

Stale issue message