Add support for skipping documents
rantav opened this issue · comments
Readers have support for limit
, but it'd also be useful to add support for skip
.
Does that support already exist?
If not then I'm happy to contribute it.
What might be even more useful is to add a more generic pipeline step which can be used anywhere along the pipeline, not just in the readers. It would typically be used right after a reader, but not necessarily.
Something like:
class Skipper(PipelineStep):
def __init__(self, skip: int = 0):
"""
@param skip: How many documents to skip
"""
self.skip = skip
def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:
skipped = 0
for d in data:
if skipped >= self.skip:
yield d
skipped += 1
Hi!
I like the idea of adding skip
support to readers, but can you give an example use case where it would make sense to have a Skipper
after a non reader block?
Thanks @guipenedo , the only use case I have in mind right now is to skip while reading, e.g. add this ability to the different readers.
I thought it would be a bit more modular and generic to implement a Skipper
but I don't have a clear use case other than as mentioned, when reading.
I was inspired by the SamplerFilter
, a feature that could also be implemented inside a reader but was implemented as a separate step, a Skipper could behave in a similar fashion.
Indeed that's a good point regarding the SamplerFilter
. I have used a samplerfilter after other filtering steps in the past but indeed it would have some similarities with the skip
In any case I think people might want to skip data on a specific source (when you chain a few readers together for example) so I think the easiest approach would really be to add it to the BaseReader. Is this something you would be willing to PR?
Yes, happy to do it, I'll work on it
ok, here's a first attempt #167