huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Don't use np.tofile/np.fromfile when interacting with fs

hynky1999 opened this issue · comments

Problem

Numpy requires filesystem to implement fileno, when using np.tofile/np.fromfile, however s3fs doesn't implement fileno in it's implementation of AbstractFileSystem.
Since we use np.tofile in sentence deduplication, when used with s3 for signatures, an error is raised:

io.UnsupportedOperation: fileno

Fix

Use Struct.pack, instead of numpy implementation

Struct is at least one order of magnitude slower. A simpler alternative is to use np.from_buffer while reading the file data directly

Ahhh, I wasn't aware of speed implications.
Let's go with np.tobytes/ np.frombuffer then