EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scripts for dedup and filter Common Crawl?

shangw-nvidia opened this issue · comments

Hi,

I notice that the download URL for the CommonCrawlDataset is http://eaidata.bmk.sh/data/pile_cc_filtered_deduped.jsonl.zst. In other words, this CC dataset is already deduplicated and filtered? However, it doesn't seem like https://github.com/leogao2/commoncrawl_downloader in the README included the scripts for deduplication and filtering. I'm wondering where I can find out exactly how deduplication and filtering for Pile CC is done?

Thanks!

Additional question: it seems like the_pile/pile.py only downloads and interleave the data from various data sources. processing_scripts contains many processing scripts, however, how do we know which script is supposed to be run on which data source, and how those scripts are supposed to be run?