huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when running exact_substrings

jordane95 opened this issue Β· comments

I follow the instructions in the code to use the script in this repo for building suffix array and generate byterange. But I get the following error when running step3.

(/home/user/env/datatrove) dev-dialogue-gpu-8k# python exact_substrings_test.py 
2024-02-01 11:50:52.260 | INFO     | datatrove.utils.logging:add_task_logger:24 - Launching pipeline for rank=0
2024-02-01 11:50:52.261 | INFO     | datatrove.utils.logging:log_pipeline:37 - 
--- πŸ› οΈ PIPELINE πŸ› 
πŸ«‚ - DEDUP: πŸͺž - exact-substrings stage 3
2024-02-01 11:50:52.262 | INFO     | datatrove.pipeline.dedup.exact_substrings:get_sequence_bytes_offset:182 - self.rank=0, -> self.sequence_bytes_offset[self.rank]=0
2024-02-01 11:50:52.387 | INFO     | datatrove.pipeline.readers.base:read_files_shard:95 - Reading input file part-00000-sample.jsonl
part-00000-sample.jsonl
2024-02-01 11:50:52.385 | ERROR    | datatrove.executor.base:_run_for_rank:74 - One or more duplicate ranges have not been used
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           β”‚          β”‚        β”” [14, 15, 16, 19, 20, 21]
           β”‚          β”” 9
           β”” <function _serve_one at 0x7f7cdf776e60>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           β”‚     β”‚     β”‚        β”” 5
           β”‚     β”‚     β”” 9
           β”‚     β”” <function _main at 0x7f7cdf776170>
           β”” <module 'multiprocess.spawn' from '/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/...
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
           β”‚    β”‚          β”” 5
           β”‚    β”” <function BaseProcess._bootstrap at 0x7f7cdfa89cf0>
           β”” <ForkServerProcess name='ForkServerPoolWorker-5' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    β”‚    β”” <function BaseProcess.run at 0x7f7cdfa89360>
    β”” <ForkServerProcess name='ForkServerPoolWorker-5' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    β”‚    β”‚        β”‚    β”‚        β”‚    β”” {}
    β”‚    β”‚        β”‚    β”‚        β”” <ForkServerProcess name='ForkServerPoolWorker-5' parent=1625393 started daemon>
    β”‚    β”‚        β”‚    β”” (<multiprocess.queues.SimpleQueue object at 0x7f7cdf4ec1f0>, <multiprocess.queues.SimpleQueue object at 0x7f7c3cd22170>, None...
    β”‚    β”‚        β”” <ForkServerProcess name='ForkServerPoolWorker-5' parent=1625393 started daemon>
    β”‚    β”” <function worker at 0x7f7c3cd1b490>
    β”” <ForkServerProcess name='ForkServerPoolWorker-5' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    β”‚     β”‚       β”” {}
                    β”‚     β”” (1,)
                    β”” functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor...

  File "/home/user/code/datatrove/src/datatrove/executor/local.py", line 46, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           β”‚    β”‚             β”‚     β”” 1
           β”‚    β”‚             β”” 1
           β”‚    β”” <function PipelineExecutor._run_for_rank at 0x7f7cde9b84c0>
           β”” <datatrove.executor.local.LocalPipelineExecutor object at 0x7f7c3cd22350>

> File "/home/user/code/datatrove/src/datatrove/executor/base.py", line 62, in _run_for_rank
    deque(pipelined_data, maxlen=0)
    β”‚     β”” <generator object DedupReader.run at 0x7f7c3cd6c970>
    β”” <class 'collections.deque'>

  File "/home/user/code/datatrove/src/datatrove/pipeline/dedup/exact_substrings.py", line 344, in run
    assert self.exhausted_ranges, "One or more duplicate ranges have not been used"
           β”‚    β”” False
           β”” πŸ«‚ - DEDUP: πŸͺž - exact-substrings stage 3

AssertionError: One or more duplicate ranges have not been used
2024-02-01 11:50:52.441 | ERROR    | datatrove.executor.base:_run_for_rank:74 - One or more duplicate ranges have not been used
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           β”‚          β”‚        β”” [13, 14, 15, 16, 19, 20]
           β”‚          β”” 9
           β”” <function _serve_one at 0x7f7cdf776e60>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           β”‚     β”‚     β”‚        β”” 5
           β”‚     β”‚     β”” 9
           β”‚     β”” <function _main at 0x7f7cdf776170>
           β”” <module 'multiprocess.spawn' from '/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/...
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
           β”‚    β”‚          β”” 5
           β”‚    β”” <function BaseProcess._bootstrap at 0x7f7cdfa89cf0>
           β”” <ForkServerProcess name='ForkServerPoolWorker-4' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    β”‚    β”” <function BaseProcess.run at 0x7f7cdfa89360>
    β”” <ForkServerProcess name='ForkServerPoolWorker-4' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    β”‚    β”‚        β”‚    β”‚        β”‚    β”” {}
    β”‚    β”‚        β”‚    β”‚        β”” <ForkServerProcess name='ForkServerPoolWorker-4' parent=1625393 started daemon>
    β”‚    β”‚        β”‚    β”” (<multiprocess.queues.SimpleQueue object at 0x7f7cdf4ec1f0>, <multiprocess.queues.SimpleQueue object at 0x7f7c3cd22170>, None...
    β”‚    β”‚        β”” <ForkServerProcess name='ForkServerPoolWorker-4' parent=1625393 started daemon>
    β”‚    β”” <function worker at 0x7f7c3cd1b490>
    β”” <ForkServerProcess name='ForkServerPoolWorker-4' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    β”‚     β”‚       β”” {}
                    β”‚     β”” (2,)
                    β”” functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor...

  File "/home/user/code/datatrove/src/datatrove/executor/local.py", line 46, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           β”‚    β”‚             β”‚     β”” 2
           β”‚    β”‚             β”” 2
           β”‚    β”” <function PipelineExecutor._run_for_rank at 0x7f7cde9b84c0>
           β”” <datatrove.executor.local.LocalPipelineExecutor object at 0x7f7c3cd22350>

> File "/home/user/code/datatrove/src/datatrove/executor/base.py", line 62, in _run_for_rank
    deque(pipelined_data, maxlen=0)
    β”‚     β”” <generator object DedupReader.run at 0x7f7c3cd6c970>
    β”” <class 'collections.deque'>

  File "/home/user/code/datatrove/src/datatrove/pipeline/dedup/exact_substrings.py", line 344, in run
    assert self.exhausted_ranges, "One or more duplicate ranges have not been used"
           β”‚    β”” False
           β”” πŸ«‚ - DEDUP: πŸͺž - exact-substrings stage 3

AssertionError: One or more duplicate ranges have not been used
2024-02-01 11:50:52.464 | ERROR    | datatrove.executor.base:_run_for_rank:74 - One or more duplicate ranges have not been used
Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/forkserver.py", line 273, in main
    code = _serve_one(child_r, fds,
           β”‚          β”‚        β”” [12, 13, 14, 15, 16, 19]
           β”‚          β”” 9
           β”” <function _serve_one at 0x7f7cdf776e60>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/forkserver.py", line 312, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
           β”‚     β”‚     β”‚        β”” 5
           β”‚     β”‚     β”” 9
           β”‚     β”” <function _main at 0x7f7cdf776170>
           β”” <module 'multiprocess.spawn' from '/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/...
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
           β”‚    β”‚          β”” 5
           β”‚    β”” <function BaseProcess._bootstrap at 0x7f7cdfa89cf0>
           β”” <ForkServerProcess name='ForkServerPoolWorker-3' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
    β”‚    β”” <function BaseProcess.run at 0x7f7cdfa89360>
    β”” <ForkServerProcess name='ForkServerPoolWorker-3' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    β”‚    β”‚        β”‚    β”‚        β”‚    β”” {}
    β”‚    β”‚        β”‚    β”‚        β”” <ForkServerProcess name='ForkServerPoolWorker-3' parent=1625393 started daemon>
    β”‚    β”‚        β”‚    β”” (<multiprocess.queues.SimpleQueue object at 0x7f7cdf4ec1f0>, <multiprocess.queues.SimpleQueue object at 0x7f7c3cd22170>, None...
    β”‚    β”‚        β”” <ForkServerProcess name='ForkServerPoolWorker-3' parent=1625393 started daemon>
    β”‚    β”” <function worker at 0x7f7c3cd1b490>
    β”” <ForkServerProcess name='ForkServerPoolWorker-3' parent=1625393 started daemon>
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    β”‚     β”‚       β”” {}
                    β”‚     β”” (3,)
                    β”” functools.partial(<bound method LocalPipelineExecutor._launch_run_for_rank of <datatrove.executor.local.LocalPipelineExecutor...

  File "/home/user/code/datatrove/src/datatrove/executor/local.py", line 46, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
           β”‚    β”‚             β”‚     β”” 3
           β”‚    β”‚             β”” 3
           β”‚    β”” <function PipelineExecutor._run_for_rank at 0x7f7cde9b84c0>
           β”” <datatrove.executor.local.LocalPipelineExecutor object at 0x7f7c3cd22350>

> File "/home/user/code/datatrove/src/datatrove/executor/base.py", line 62, in _run_for_rank
    deque(pipelined_data, maxlen=0)
    β”‚     β”” <generator object DedupReader.run at 0x7f7c3cd6c970>
    β”” <class 'collections.deque'>

  File "/home/user/code/datatrove/src/datatrove/pipeline/dedup/exact_substrings.py", line 344, in run
    assert self.exhausted_ranges, "One or more duplicate ranges have not been used"
           β”‚    β”” False
           β”” πŸ«‚ - DEDUP: πŸͺž - exact-substrings stage 3

AssertionError: One or more duplicate ranges have not been used
2024-02-01 11:50:52.495 | INFO     | datatrove.executor.local:_launch_run_for_rank:51 - 1/4 tasks completed.
multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/user/code/datatrove/src/datatrove/executor/local.py", line 46, in _launch_run_for_rank
    return self._run_for_rank(rank, local_rank)
  File "/home/user/code/datatrove/src/datatrove/executor/base.py", line 75, in _run_for_rank
    raise e
  File "/home/user/code/datatrove/src/datatrove/executor/base.py", line 62, in _run_for_rank
    deque(pipelined_data, maxlen=0)
  File "/home/user/code/datatrove/src/datatrove/pipeline/dedup/exact_substrings.py", line 344, in run
    assert self.exhausted_ranges, "One or more duplicate ranges have not been used"
AssertionError: One or more duplicate ranges have not been used
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/code/datatrove/examples/exact_substrings_test.py", line 96, in <module>
    run_step_3()
  File "/home/user/code/datatrove/examples/exact_substrings_test.py", line 91, in run_step_3
    print(executor_3.run())
  File "/home/user/code/datatrove/src/datatrove/executor/local.py", line 80, in run
    stats = list(
  File "/home/user/env/datatrove/lib/python3.10/site-packages/multiprocess/pool.py", line 873, in next
    raise value
AssertionError: One or more duplicate ranges have not been used

Again, I think this bug relates to the corner case where one worker is idle and did nothing in the for loop before to change the exhausted_ranges status...

Fixed by PR #73