facebookresearch / MetaCLIP

ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Experts via Clustering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about curating from scratch

simon-ging opened this issue · comments

Dear authors,

First of all, thanks for this very interesting paper and code release.

I am working on building a small datasets with your pipeline (from CommonCrawl using queries) and came across the following questions:

  1. How do you do NSFW filtering?
  2. How do you deduplicate?

Any pointers about how your process looks like would help alot in reproducing your pipeline.

Thanks,

thx for your interests. We use our internal NSFW filters and dedup system. You may consider some open source solutions like ones from DataComp (be aware they use OpenAI CLIP then may not very from scratch)?