sambanova / generative_data_prep

sambanova/generative_data_prep Issues

Apply Chat Template Error
Updated a month ago
Support user assistant chat dataset format
Updated a month ago
Empty Articles become Sequences of all Padding tokens
Closed a year ago1
Child processes killed silently, causes code to hang
Closed 9 months ago2
Progress Bar
Closed 6 months ago1
Tracking dataset metrics
Closed 9 months ago1
Greedy Truncate Right Bug
Closed 9 months ago1
tokenization time remaining not working
Closed 2 months ago1
Cannot launch generative_data_prep from different directory
Closed 8 months ago1
Add pytests for EOS and BOS tokens, and ensuring Llama tokenizer works properly OOB
Closed a month ago1
Improve the error handling for invalid jsonl line file input
Updated a month ago
README examples default to --output_path of empty directory so that they do not fail out
Updated a month ago
Specify how to pick the number of training splits
Closed a month ago1
Create FAQ section
Updated a month ago1
Prompt_prefix not interpreted correctly
Updated a month ago
No Protections for runtime (wall clock time) or RAM usage
Updated a month ago1
Tokenization Is Not Optimal, should use batched encoding
Updated a month ago1
Documentation About Input Packing Config
Updated a month ago1
Creation of jsonl files
Closed a month ago3
apply chat_template
Closed a month ago1
Add pytests to ensure that tokenizer metrics are correct, including for larger datasets
Updated 5 months ago
Token Metrics Incorrect For Large Datasets
Closed 5 months ago1
KeyError during balancing if fewer lines in input file than splits
Closed 6 months ago1
Bug when running python -m generative_data_prep data_prep
Closed 7 months ago
Update padding tokens to use padding token from tokenizer if it exists
Closed 9 months ago1