Giters
sambanova
/
generative_data_prep
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
56
Watchers:
3
Issues:
26
Forks:
7
sambanova/generative_data_prep Issues
Apply Chat Template Error
Updated
a month ago
Support user assistant chat dataset format
Updated
a month ago
Empty Articles become Sequences of all Padding tokens
Closed
a year ago
Comments count
1
Child processes killed silently, causes code to hang
Closed
9 months ago
Comments count
2
Progress Bar
Closed
6 months ago
Comments count
1
Tracking dataset metrics
Closed
9 months ago
Comments count
1
Greedy Truncate Right Bug
Closed
9 months ago
Comments count
1
tokenization time remaining not working
Closed
2 months ago
Comments count
1
Cannot launch generative_data_prep from different directory
Closed
8 months ago
Comments count
1
Add pytests for EOS and BOS tokens, and ensuring Llama tokenizer works properly OOB
Closed
a month ago
Comments count
1
Improve the error handling for invalid jsonl line file input
Updated
a month ago
README examples default to --output_path of empty directory so that they do not fail out
Updated
a month ago
Specify how to pick the number of training splits
Closed
a month ago
Comments count
1
Create FAQ section
Updated
a month ago
Comments count
1
Prompt_prefix not interpreted correctly
Updated
a month ago
No Protections for runtime (wall clock time) or RAM usage
Updated
a month ago
Comments count
1
Tokenization Is Not Optimal, should use batched encoding
Updated
a month ago
Comments count
1
Documentation About Input Packing Config
Updated
a month ago
Comments count
1
Creation of jsonl files
Closed
a month ago
Comments count
3
apply chat_template
Closed
a month ago
Comments count
1
Add pytests to ensure that tokenizer metrics are correct, including for larger datasets
Updated
5 months ago
Token Metrics Incorrect For Large Datasets
Closed
5 months ago
Comments count
1
KeyError during balancing if fewer lines in input file than splits
Closed
6 months ago
Comments count
1
Bug when running python -m generative_data_prep data_prep
Closed
7 months ago
Update padding tokens to use padding token from tokenizer if it exists
Closed
9 months ago
Comments count
1