Giters
llm-jp
/
llm-jp-corpus
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
37
Watchers:
5
Issues:
20
Forks:
6
llm-jp/llm-jp-corpus Issues
Python 3.10 has some issues in downloading data from huggingface: use Python 3.9
Closed
7 months ago
Comments count
1
Introduce category-based filtering to Wikipedia
Closed
8 months ago
Exclude the Book3 portion in the Pile dataset
Closed
9 months ago
Comments count
1
Quantitatively assess the quality of the filtering
Closed
9 months ago
Use Hojichar for better filtering of Japanese mC4
Closed
9 months ago
Apply ethical filtering to Japanese Wikipedia
Closed
10 months ago
Create a validation split
Closed
10 months ago
Comments count
1
Add the `token_ids` field
Closed
10 months ago
Comments count
1
Apply ethical filtering to the Japanese mC4 dataset
Closed
10 months ago
Comments count
1
Apply filtering to the Stack dataset
Closed
10 months ago
Comments count
1
Improve Wikipedia text extraction
Closed
10 months ago
Comments count
1
Expired links to Wikipedia dumps
Closed
10 months ago
Comments count
1
Construct the corpus ver. 1
Closed
a year ago
Comments count
1
Use the HF datasets library for tokenization
Closed
a year ago
Comments count
1
Use Python 3.11
Closed
a year ago
Determine the license
Closed
a year ago
Comments count
2
Use the tokenizer provided by the Fugaku project
Closed
a year ago
Comments count
1
Improve dataset downloading for memory efficiency
Closed
a year ago
Japanese Wikipedia
Closed
a year ago
English Wikipedia
Closed
a year ago