Questions about curating from scratch
simon-ging opened this issue · comments
S. Ging commented
Dear authors,
First of all, thanks for this very interesting paper and code release.
I am working on building a small datasets with your pipeline (from CommonCrawl using queries) and came across the following questions:
- How do you do NSFW filtering?
- How do you deduplicate?
Any pointers about how your process looks like would help alot in reproducing your pipeline.
Thanks,
Hu Xu commented
thx for your interests. We use our internal NSFW filters and dedup system. You may consider some open source solutions like ones from DataComp (be aware they use OpenAI CLIP then may not very from scratch)?