huggingface/datatrove Issues
Support for Llama 3 Tokenizer
Closed 2Support for Batch Processing
Closed 3Standard paradigm for grouping data
Updated 6Exact deduplication
Updated 3Fastwarc reader
Updated 1Small issue on typeshelper
Closed 1URL dedup of two datasets
Closed 1Memory overhead in multiprocessing
Updated 2Log progress
ClosedPeriodical logging of stats
Updated 2OpensearchWriter
Updated 2PineconeWriter
Updated[Feature] Packing
Closed 1Local fasttext model
Closed 3Bug in url filter
Closed 12Unreadable log
Closed 1Need help for url filter
Closed 2Potential issues in substring dedup
Updated 10LocalExecutor Speedup
Closed 2