huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark support

jordane95 opened this issue · comments

I'm wondering if it is possible to add support for other popular large-scale data processing frameworks like spark, since most operations are compatible with the map operation in spark. This would greatly improve the efficiency and scability of the processing pipeline when working with large datasets.

Is there any update on this? @guipenedo