huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fastwarc reader

jordane95 opened this issue · comments

Can we add a new warc reader using the fastwarc?

It is said to be much more efficient than warcio