Giters
EleutherAI
/
dps
Data processing system for polyglot
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
83
Watchers:
6
Issues:
31
Forks:
23
EleutherAI/dps Issues
Bug in the function `remove_repeated_text`
Closed
8 months ago
[ja] `.filter` is used instead of `.map` for non-filter methods
Updated
9 months ago
Comments count
1
Chiese dedup memory error
Updated
9 months ago
Comments count
1
[ja] replace Japanese PII
Updated
10 months ago
[ja] reduce emoticon
Closed
10 months ago
Comments count
1
[ja] spam word filter
Updated
10 months ago
[ja] refactor MinHashLSH-based near deduplication method
Updated
10 months ago
Japanese pre-procesesing - remove text with low rate of Japanese stopwords
Closed
a year ago
Comments count
4
Refactor RDD process to Dataframe process
Updated
a year ago
Need to add ignore null or empty text during korean text process
Updated
a year ago
Improve Korean preprocessing algorithm
Closed
a year ago
Add pre-processing for Japanese texts
Closed
a year ago
Replace html2text from Beautifulsoup
Closed
a year ago
Comments count
1
Task consideration
Closed
a year ago
Comments count
3
Implement minhash dedup module
Closed
a year ago
Add huggingface tokenizers for data length statistics
Closed
a year ago
Add job to separate train and validate data
Closed
a year ago
Add statistics by data category
Closed
a year ago
Add Toxic text labeler
Closed
a year ago
Add Text length Stats for datasets
Closed
a year ago
MassiveText Quality Filtering
Closed
a year ago
Comments count
3
Add function for processing empty string
Closed
a year ago
Update additional preprocess function
Closed
a year ago
Comments count
1
Remove `soynlp` library
Closed
a year ago
Add normalize `?,:"!` in common preprocess job
Closed
a year ago
Add general text refinement job
Closed
2 years ago
Comments count
2
Add scripts to run hadoop cluster
Updated
2 years ago
Add requirements-dev.txt
Closed
2 years ago
Add guides to run dps jobs
Closed
2 years ago
Add build news paper dataset as long text data form
Updated
2 years ago