huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to load a dataset with the output a tokenizer?

Jeronymous opened this issue · comments

I planned to use datatrove to apply my tokenizer so that data is ready to use with nanotron.
I am using DocumentTokenizer[Merger] which produces *.ds and *ds.index binary files, although, from what I understood, nanotron is expecting datasets (with "input_ids" keys).
I see that things like ParquetWriter cannot be piped after DocumentTokenizer.

Am I missing a piece?
Are there some helpers to convert ds files into parquet files (or something loadable with datasets) for a given context size?

Hi, I pinged the nanotron team internally and they are working on moving support of datatrove's .ds files to the public repo :)