How to load a dataset with the output a tokenizer?
Jeronymous opened this issue · comments
I planned to use datatrove to apply my tokenizer so that data is ready to use with nanotron.
I am using DocumentTokenizer[Merger] which produces *.ds and *ds.index binary files, although, from what I understood, nanotron is expecting datasets (with "input_ids" keys).
I see that things like ParquetWriter cannot be piped after DocumentTokenizer.
Am I missing a piece?
Are there some helpers to convert ds files into parquet files (or something loadable with datasets) for a given context size?
Hi, I pinged the nanotron team internally and they are working on moving support of datatrove's .ds files to the public repo :)