Will the training framework do upsampling when train-num-samples is far more than the amount of actual data
zwsjink opened this issue · comments
In the Datacomp paper,
there is only 14M data left after image-based filtering & clip score thresholding step under medium scale , however the train-num-samples is equal to 128M/5epoch = 25600000 / epoch , which is larger than 14M, so I suppose the open_clip will do an upsampling to the data, am I correct?
by saying "upsampling", I mean some data will appear more than once during an individual epoch.
Hey @zwsjink. Yes, that will happen (regardless of the sampling strategy, by the pigeonhole principle). Also note that the training script samples datapoints with replacement.
Hey @zwsjink. Yes, that will happen (regardless of the sampling strategy, by the pigeonhole principle). Also note that the training script samples datapoints with replacement.
I see. So when you are saying "with replacement" ,do you mean this
parser.add_argument(
"--dataset-resampled",
default=False,
action="store_true",
help="Whether to use sampling with replacement for webdataset shard selection."
)
option, so that we not only shuffle between shards but also within shards to achieve more randomness?
Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.
Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.
Thanks for the clarification. :D
Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.
sorry to bother, but what is the difference between two checkpoints i get after training ? I see one named epoch_5.pt and another epoch_latest.pt , I do see the original config for number of training epochs is equal to 5, so I believe these two checkpoints are same? but the file size looks different
That's a bit of a quirk from open_clip, the checkpoints should be the same
Closing for now, feel free to re-open if you have other questions!