EleutherAI / pile_dedupe

Pile Deduplication Code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pile Dedupe

Prerequisites

Download the Pile distribution. Relevant files are in train.

Install

git clone https://github.com/EleutherAI/pile_dedupe.git
pip install -r requirements.txt
ln -s PILE_LOCATION pile

Usage

Step Overview Details
1 Prerequisites Download the pile.
2 Install Clone the repo, install requirements, symlink to the location of the downloaded train directory
3 Generate Minhashes python generate_minhashes.py --process_count PROCESS_COUNT Recommend one process per logical core.
4 Verify Minhashes (Optional) python working_with_minhashes.py
5 Dedupe Pile python dedupe.py --lsh_threshold LSH_THRESHOLD It's fairly safe to leave lsh_threshold default (0.5) if you don't mind a bit of extra dedupe.
6 Inspect duplicates python working_with_duplicates.py --inspect_duplicates

I'm Done - Give Me A Generator

from yield_deduped_pile import yield_deduped_pile

pile_directory = "pile"
duplicates_directory = "pile_duplicates"
yield_deduped_pile(pile_directory, duplicates_directory)

Further Documentation

Each file is described at the top.

About

Pile Deduplication Code

License:MIT License


Languages

Language:Python 100.0%