mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets

Home Page:http://datacomp.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Will the training framework do upsampling when train-num-samples is far more than the amount of actual data

zwsjink opened this issue · comments

In the Datacomp paper,
image
there is only 14M data left after image-based filtering & clip score thresholding step under medium scale , however the train-num-samples is equal to 128M/5epoch = 25600000 / epoch , which is larger than 14M, so I suppose the open_clip will do an upsampling to the data, am I correct?

by saying "upsampling", I mean some data will appear more than once during an individual epoch.

Hey @zwsjink. Yes, that will happen (regardless of the sampling strategy, by the pigeonhole principle). Also note that the training script samples datapoints with replacement.

Hey @zwsjink. Yes, that will happen (regardless of the sampling strategy, by the pigeonhole principle). Also note that the training script samples datapoints with replacement.

I see. So when you are saying "with replacement" ,do you mean this

parser.add_argument(
        "--dataset-resampled",
        default=False,
        action="store_true",
        help="Whether to use sampling with replacement for webdataset shard selection."
    )

option, so that we not only shuffle between shards but also within shards to achieve more randomness?

Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.

Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.

Thanks for the clarification. :D

Yes, every time a worker needs a new shard, they sample uniformly from the pool of all shards. The samples in a shard are used to fill a shuffle buffer, from which samples are drawn at random.

sorry to bother, but what is the difference between two checkpoints i get after training ? I see one named epoch_5.pt and another epoch_latest.pt , I do see the original config for number of training epochs is equal to 5, so I believe these two checkpoints are same? but the file size looks different

That's a bit of a quirk from open_clip, the checkpoints should be the same

Closing for now, feel free to re-open if you have other questions!