Questions about curating from scratch

Question

Questions about curating from scratch

simon-ging opened this issue 9 months ago · comments

Dear authors,

First of all, thanks for this very interesting paper and code release.

I am working on building a small datasets with your pipeline (from CommonCrawl using queries) and came across the following questions:

How do you do NSFW filtering?
How do you deduplicate?

Any pointers about how your process looks like would help alot in reproducing your pipeline.

Thanks,

Hu Xu · Answer 1 · Sat Oct 14 2023 02:19:12 GMT+0800 (China Standard Time)

thx for your interests. We use our internal NSFW filters and dedup system. You may consider some open source solutions like ones from DataComp (be aware they use OpenAI CLIP then may not very from scratch)?