Giters
google-research
/
deduplicate-text-datasets
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
1063
Watchers:
13
Issues:
41
Forks:
105
google-research/deduplicate-text-datasets Issues
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Updated
14 days ago
Comments count
1
[Question] An error with the same repo guideline
Updated
2 months ago
Comments count
5
Distributed running
Updated
2 months ago
does finish_dedup_wiki40b.py has some wrong?
Updated
2 months ago
does this tool can process Chinese?
Updated
2 months ago
Comments count
1
[Bug] Out of range error when counting occurrences on a custom suffix array
Closed
2 months ago
Comments count
1
Accessing the duplicates and their counts
Closed
2 years ago
Comments count
13
Question: Upper Bound
Closed
3 months ago
Comments count
1
Count_occurrence does not work with tokenizer?
Closed
3 months ago
Comments count
2
question about wstring_equal function
Closed
3 months ago
Comments count
2
是否可以提供一个纯python版本的,相信很多研究者在服务器上没有权限安装gcc
Closed
4 months ago
Comments count
1
when i use tokenizer , I obtained many patterns that span across the data, which is quite strange.
Updated
4 months ago
Comments count
4
customized dataset deduplication
Closed
4 months ago
Comments count
1
where the data is?
Closed
4 months ago
Comments count
1
Adjust TensorFlow version to fix cuDNN, cuFFT, cuBLAS errors.
Closed
5 months ago
Comments count
4
what is the input dataset format for custom dataset?
Closed
5 months ago
Comments count
1
how to deduplicate huggingface datasets
Closed
5 months ago
Comments count
7
cargo build error.Could you upload cargo.lock file?
Updated
5 months ago
Comments count
2
Incomplete Sentences
Closed
5 months ago
Comments count
1
remove_ex in finish_dedup_wiki40b
Closed
5 months ago
Comments count
1
How to restore the result data after deduplication (remove invisible characters)
Closed
5 months ago
Comments count
1
Fix to issue #17 limits cmd_merge to be single-threaded
Closed
5 months ago
Comments count
3
Error when running the code
Closed
5 months ago
Comments count
15
Retain one instance per duplicate
Closed
9 months ago
Comments count
2
RAM crash when use collect method
Updated
8 months ago
Comments count
2
Inplementation of NearDup(approximate match)
Closed
a year ago
Comments count
1
[Paper Question] Why use w-shingles over k-shingles?
Closed
a year ago
Comments count
1
Simple test
Closed
a year ago
Off-by-1 error in `collect`?
Updated
2 years ago
question about deduplication cluster size
Closed
2 years ago
Comments count
2
one bug when I use
Closed
2 years ago
Comments count
2
Should newline char be removed
Closed
2 years ago
Comments count
1
Unexpected behavior with ending symbols
Closed
2 years ago
Comments count
2
"failed to fill whole buffer" errors
Closed
2 years ago
Comments count
2
Error with table size not being divisible by text size
Closed
2 years ago
Comments count
7
Can the tool run on plain text files?
Closed
2 years ago
Comments count
20
false positives
Closed
2 years ago
Comments count
1
How to dedup between two datasets?
Closed
3 years ago
Comments count
7
How to dedup subtring in one dataset?
Closed
3 years ago
Comments count
9
Error on self deduplication
Closed
3 years ago
Comments count
10
Why not use Simhash?
Closed
3 years ago
Comments count
3