huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for skipping documents

rantav opened this issue · comments

Readers have support for limit, but it'd also be useful to add support for skip.
Does that support already exist?

If not then I'm happy to contribute it.
What might be even more useful is to add a more generic pipeline step which can be used anywhere along the pipeline, not just in the readers. It would typically be used right after a reader, but not necessarily.
Something like:

class Skipper(PipelineStep):
  def __init__(self, skip: int = 0):
  """
  @param skip: How many documents to skip
  """
  self.skip = skip

  def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:
    skipped = 0
    for d in data:
      if skipped >= self.skip:
        yield d
      skipped += 1

Hi!
I like the idea of adding skip support to readers, but can you give an example use case where it would make sense to have a Skipper after a non reader block?

Thanks @guipenedo , the only use case I have in mind right now is to skip while reading, e.g. add this ability to the different readers.
I thought it would be a bit more modular and generic to implement a Skipper but I don't have a clear use case other than as mentioned, when reading.
I was inspired by the SamplerFilter, a feature that could also be implemented inside a reader but was implemented as a separate step, a Skipper could behave in a similar fashion.

Indeed that's a good point regarding the SamplerFilter. I have used a samplerfilter after other filtering steps in the past but indeed it would have some similarities with the skip
In any case I think people might want to skip data on a specific source (when you chain a few readers together for example) so I think the easiest approach would really be to add it to the BaseReader. Is this something you would be willing to PR?

Yes, happy to do it, I'll work on it

ok, here's a first attempt #167