Scripts for dedup and filter Common Crawl?

Question

Scripts for dedup and filter Common Crawl?

shangw-nvidia opened this issue 3 years ago · comments

Hi,

I notice that the download URL for the CommonCrawlDataset is http://eaidata.bmk.sh/data/pile_cc_filtered_deduped.jsonl.zst. In other words, this CC dataset is already deduplicated and filtered? However, it doesn't seem like https://github.com/leogao2/commoncrawl_downloader in the README included the scripts for deduplication and filtering. I'm wondering where I can find out exactly how deduplication and filtering for Pile CC is done?

Thanks!

Shang Wang · Answer 1 · Fri Feb 25 2022 04:53:16 GMT+0800 (China Standard Time)

Additional question: it seems like the_pile/pile.py only downloads and interleave the data from various data sources. processing_scripts contains many processing scripts, however, how do we know which script is supposed to be run on which data source, and how those scripts are supposed to be run?