Will the training framework do upsampling when train-num-samples is far more than the amount of actual data

Question

Will the training framework do upsampling when train-num-samples is far more than the amount of actual data

zwsjink opened this issue 9 months ago · comments

In the Datacomp paper,

there is only 14M data left after image-based filtering & clip score thresholding step under medium scale , however the train-num-samples is equal to 128M/5epoch = 25600000 / epoch , which is larger than 14M, so I suppose the open_clip will do an upsampling to the data, am I correct?

Jinkai · Answer 1 · Fri Sep 08 2023 21:41:49 GMT+0800 (China Standard Time)

by saying "upsampling", I mean some data will appear more than once during an individual epoch.

Gabriel Ilharco · Answer 2 · Fri Sep 08 2023 23:55:51 GMT+0800 (China Standard Time)

Hey @zwsjink. Yes, that will happen (regardless of the sampling strategy, by the pigeonhole principle). Also note that the training script samples datapoints with replacement.

Jinkai · Answer 3 · Sat Sep 09 2023 00:43:45 GMT+0800 (China Standard Time)

Hey @zwsjink. Yes, that will happen (regardless of the sampling strategy, by the pigeonhole principle). Also note that the training script samples datapoints with replacement.

I see. So when you are saying "with replacement" ,do you mean this

parser.add_argument(
        "--dataset-resampled",
        default=False,
        action="store_true",
        help="Whether to use sampling with replacement for webdataset shard selection."
    )

option, so that we not only shuffle between shards but also within shards to achieve more randomness?

Gabriel Ilharco · Answer 4 · Sat Sep 09 2023 00:48:11 GMT+0800 (China Standard Time)

Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.

Jinkai · Answer 5 · Sat Sep 09 2023 01:04:01 GMT+0800 (China Standard Time)

Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.

Thanks for the clarification. :D

Jinkai · Answer 6 · Tue Sep 12 2023 10:01:54 GMT+0800 (China Standard Time)

Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.

sorry to bother, but what is the difference between two checkpoints i get after training ? I see one named epoch_5.pt and another epoch_latest.pt , I do see the original config for number of training epochs is equal to 5, so I believe these two checkpoints are same? but the file size looks different

Gabriel Ilharco · Answer 7 · Thu Sep 28 2023 20:57:34 GMT+0800 (China Standard Time)

That's a bit of a quirk from open_clip, the checkpoints should be the same

Gabriel Ilharco · Answer 8 · Thu Sep 28 2023 20:57:58 GMT+0800 (China Standard Time)

Closing for now, feel free to re-open if you have other questions!