How to load a dataset with the output a tokenizer?

Question

How to load a dataset with the output a tokenizer?

Jeronymous opened this issue 3 months ago · comments

I planned to use datatrove to apply my tokenizer so that data is ready to use with nanotron.
I am using DocumentTokenizer[Merger] which produces *.ds and *ds.index binary files, although, from what I understood, nanotron is expecting datasets (with "input_ids" keys).
I see that things like ParquetWriter cannot be piped after DocumentTokenizer.

Am I missing a piece?
Are there some helpers to convert ds files into parquet files (or something loadable with datasets) for a given context size?

Guilherme Penedo · Answer 1 · Thu Feb 29 2024 18:20:54 GMT+0800 (China Standard Time)

Hi, I pinged the nanotron team internally and they are working on moving support of datatrove's .ds files to the public repo :)