Add support for skipping documents

Question

Add support for skipping documents

rantav opened this issue a month ago · comments

Readers have support for limit, but it'd also be useful to add support for skip.
Does that support already exist?

If not then I'm happy to contribute it.
What might be even more useful is to add a more generic pipeline step which can be used anywhere along the pipeline, not just in the readers. It would typically be used right after a reader, but not necessarily.
Something like:

class Skipper(PipelineStep):
  def __init__(self, skip: int = 0):
  """
  @param skip: How many documents to skip
  """
  self.skip = skip

  def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:
    skipped = 0
    for d in data:
      if skipped >= self.skip:
        yield d
      skipped += 1

Guilherme Penedo · Answer 1 · Thu May 02 2024 17:16:47 GMT+0800 (China Standard Time)

Hi!
I like the idea of adding skip support to readers, but can you give an example use case where it would make sense to have a Skipper after a non reader block?

Ran Tavory · Answer 2 · Thu May 02 2024 19:48:56 GMT+0800 (China Standard Time)

Thanks @guipenedo , the only use case I have in mind right now is to skip while reading, e.g. add this ability to the different readers.
I thought it would be a bit more modular and generic to implement a Skipper but I don't have a clear use case other than as mentioned, when reading.
I was inspired by the SamplerFilter, a feature that could also be implemented inside a reader but was implemented as a separate step, a Skipper could behave in a similar fashion.

Guilherme Penedo · Answer 3 · Thu May 02 2024 19:52:33 GMT+0800 (China Standard Time)

Indeed that's a good point regarding the SamplerFilter. I have used a samplerfilter after other filtering steps in the past but indeed it would have some similarities with the skip
In any case I think people might want to skip data on a specific source (when you chain a few readers together for example) so I think the easiest approach would really be to add it to the BaseReader. Is this something you would be willing to PR?

Ran Tavory · Answer 4 · Thu May 02 2024 20:16:05 GMT+0800 (China Standard Time)

Yes, happy to do it, I'll work on it

Ran Tavory · Answer 5 · Fri May 03 2024 04:21:33 GMT+0800 (China Standard Time)

ok, here's a first attempt #167