togethercomputer/RedPajama-Data Issues
Exact dedup details
Updatedregarding to deduplication
Updated 5Other language data
Updated 4Thresholds for all quality signals
Updated 2what does the prefix "rps_" mean?
Closed 2Spanish artifact building error
Updated 2About the final result
Updated 2Unavailable Parameters
Updatedwhat's the specific meaning of dsir?
Updated 4possibly missing shard from host
Closed 2Are shards randomly created?
Closed 1Low Data Downloading Speed
Closed 1Train a new wikiref model
Closed 1Token counts
Updated 2regarding to quality classifier
Updated 2How is the SHA1 digest computed?
Closed 2Invalid argument when running cc_net
Updated 2Executing V2 issues
Updated 6Issue on book datasets download
Updated 2Failed building wheel for cc-net
Closed 2ArXiv cleaning issue
Closed 1cc_net processing local wet file
Closed 1New Features
UpdatedSpecifying arxiv dates
Updated 1Understanding the quality filter
Updated 5Fine tuning RedPajama Model
Updated 1