google-research / deduplicate-text-datasets

google-research/deduplicate-text-datasets Issues

called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Updated 14 days ago1
[Question] An error with the same repo guideline
Updated 2 months ago5
Distributed running
Updated 2 months ago
does finish_dedup_wiki40b.py has some wrong?
Updated 2 months ago
does this tool can process Chinese?
Updated 2 months ago1
[Bug] Out of range error when counting occurrences on a custom suffix array
Closed 2 months ago1
Accessing the duplicates and their counts
Closed 2 years ago13
Question: Upper Bound
Closed 3 months ago1
Count_occurrence does not work with tokenizer?
Closed 3 months ago2
question about wstring_equal function
Closed 3 months ago2
是否可以提供一个纯python版本的，相信很多研究者在服务器上没有权限安装gcc
Closed 4 months ago1
when i use tokenizer , I obtained many patterns that span across the data, which is quite strange.
Updated 4 months ago4
customized dataset deduplication
Closed 4 months ago1
where the data is?
Closed 4 months ago1
Adjust TensorFlow version to fix cuDNN, cuFFT, cuBLAS errors.
Closed 5 months ago4
what is the input dataset format for custom dataset?
Closed 5 months ago1
how to deduplicate huggingface datasets
Closed 5 months ago7
cargo build error.Could you upload cargo.lock file?
Updated 5 months ago2
Incomplete Sentences
Closed 5 months ago1
remove_ex in finish_dedup_wiki40b
Closed 5 months ago1
How to restore the result data after deduplication (remove invisible characters)
Closed 5 months ago1
Fix to issue #17 limits cmd_merge to be single-threaded
Closed 5 months ago3
Error when running the code
Closed 5 months ago15
Retain one instance per duplicate
Closed 9 months ago2
RAM crash when use collect method
Updated 8 months ago2
Inplementation of NearDup(approximate match)
Closed a year ago1
[Paper Question] Why use w-shingles over k-shingles?
Closed a year ago1
Simple test
Closed a year ago
Off-by-1 error in `collect`?
Updated 2 years ago
question about deduplication cluster size
Closed 2 years ago2
one bug when I use
Closed 2 years ago2
Should newline char be removed
Closed 2 years ago1
Unexpected behavior with ending symbols
Closed 2 years ago2
"failed to fill whole buffer" errors
Closed 2 years ago2
Error with table size not being divisible by text size
Closed 2 years ago7
Can the tool run on plain text files?
Closed 2 years ago20
false positives
Closed 2 years ago1
How to dedup between two datasets?
Closed 3 years ago7
How to dedup subtring in one dataset?
Closed 3 years ago9
Error on self deduplication
Closed 3 years ago10
Why not use Simhash?
Closed 3 years ago3