facebookresearch / MetaCLIP

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confusion regarding construction of 400M dataset

varadgunjal opened this issue · comments

Based on the details provided in the MetaCLIP paper, I understand that the magic number t = 20k is a threshold used to limit the number of texts/pairs for each entry. Entries with fewer than t pairs (tail entries) retain all associated pairs, while entries with more than t pairs (head entries) are sub-sampled to t pairs.

It is also mentioned that there are ~16K entries in the 500K query list which have counts >= 20K and they account for 5.35B / 5.67B total matches. This implies that the remaining ~484K have counts < 20K and they account for 5.67B - 5.35B ≈ 320M matches. I have checked these numbers using the entry_counts_400m.json (https://github.com/facebookresearch/MetaCLIP/blob/main/metaclip/entry_counts_400m.json) file and they line up.

Based on these 2 pieces of information, I understand that we would take 20K pairs for each of the 16K entries that have counts >= 20K => ~20K * 16K ≈ 320M samples. And then add in all matches for the remaining 484K entries which from above = 320M. This gives a dataset of size 320M + 320M = 640M.

Is my understanding and the demonstrated calculation correct? And if, yes, how is this subsampled to 400M?

thx for the confirming and checking the match of paper and released distribution.

Let me explain as the following:
(1) 400M is a (lucky) result of running curation, not the goal; the goal of curation is to ensure pool quantity/quality, not a resulting training set (which cannot precisely control though human wish to compare training scale);
(2) as a result, when we want to ablate on 400M (ablation purpose only), we need to increase pool size if resulting set is less than 400M and otherwise decrease. We did so by changing number of (uniformly shuffled) shards (we collect a bit more than needed) to precisely meet 400M. We run curation algorithm multiple times on different number of shards to estimate OpenAI CLIP's original pool size (which we don't know);
(3) now go to your math i wish to mention one important detail: one text has multiple matches of entries (so cannot infer "5.67B - 5.35B (counts of entry match) ≈ 320M (count of pairs/texts)"); multiple entries' 20K matches may point to the same pair out of 400M. We assume 400M has no duplicates (not mentioned by CLIP paper clearly but obviously...)
(4) (if our paper writing has some error/in-precise terms do let us know, thx).

Thanks! This explanation makes sense. A few other follow-up questions -

(1) Was any simple text preprocessing / image preprocessing / deduplication done on the sanples from the CC shards before running curation on them?
(2) In substring matching, do you worry about matching case? Based on the code here -

def substr_matching(text, metadata):
- it seems like you match the case of the query as is, but just wanted to double check if everything is converted to lowercase beforehand?
(3) Was the performance verified across 400M sets curated from multiple 1.6B pools (collecting from different random CC shards)?

Thx for reply and those are indeed good questions. Overall, our goal is to be extremely simple and be as raw as possible, we didn't run any preprocessing unless required for legal reason or specified by OpenAI CLIP paper.

Ans. to
(1) We ran curation:balancing twice: one before image downloading to save Internet traffic/storage with a much higher t and one before training (re-calibration) (t=20k for 400M or t=170k for 2.5B).
We only run dedup images and dedup text belonging in-between these two balancings; we are confident no magic in these steps and tuning of preprocessing may improve performance further but we didn't do so to keep it simple and be more general/task-agnostic.
(2) matching original cases are very important to ensure quality of texts/captions (DOG has a higher chance of from a bad source like spam email), please don't convert to lower case (people are lured to quantity but quality matters first in this paper)
(3) no, re-collection is expensive and we only have one collection for 400M (from 1.6B pool); but we will have multiple seeds of balancing from 1.6B reported in an updated version of this paper.

Hope these answered your question and let us know if any more question.

Yes this was very helpful. Thank you!

One follow-up question : Can you share the exact text sources were used for constructing the 500K query set - more specifically, when considering "uni-grams from the English version of Wikipedia occurring at least 100 times" can you point me to the link of the dataset that contains Wikipedia text you used?