huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when running minhash

jordane95 opened this issue · comments

Have you ever seen this bug when running minhash? What might be the cause?

File "/output/datatrove/src/datatrove/pipeline/dedup/minhash.py", line 139, in read_sigs
File "/opt/conda/envs/datatrove/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
raise value
File "/output/datatrove/examples/minhash_deduplication_mp.py", line 157, in main
File "/output/datatrove/src/datatrove/pipeline/base.py", line 122, in __call__
Traceback (most recent call last):
File "/output/datatrove/src/datatrove/executor/base.py", line 77, in _run_for_rank
"""
File "/output/datatrove/src/datatrove/pipeline/dedup/minhash.py", line 378, in run
AssertionError: Hash order error. f.tell()=13504008, min_hash=167858917, sigdata=(62530469, 17634173, 42943397, 32616677, 9946320, 97645252, 33852496, 40487027, 13335797, 46577224, 99341049, 65232832, 98314078), last=(168841106, 1007538214, 22584412, 260559064, 494471935, 336374100, 632602342, 773108968, 87337671, 1064337302, 10811556, 112410251, 404805684)
File "/output/datatrove/src/datatrove/executor/local.py", line 112, in run
stats = list(
The above exception was the direct cause of the following exception: