libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)

Home Page:https://ffcv.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Indexing (to Subset) Loader Class without having to generate beton files again

meghbhalerao opened this issue · comments

commented

Hi, anyone know if there is any way I can get a certain subset of images and corresponding labels from .beton files - for example what I mean is if i want to access a subset of the standard pytorch Dataset class, I can use the Subset class defined within the torch.utils.data, such that I can basically do subset_data = Subset(whole_trainset, subset_idxs), but say I have a Loader class in ffcv, is there any way of doing so?

The worst case would be to generate .beton files again for a subset indexed by the indices, but was wondering if there is any way I can index the Loader object directly?

Thanks and please let me know if anything is unclear.

Hi! the loader object takes in an indices argument that should do what you want.

commented

Thanks @andrewilyas - so when I am making a dataloader this way -
loader_1 = Loader(filepath, indices = list_of_subset_idxs)
it works and I am able to index the subset.
However, say I have an existing object of the Loader class, say, called loader_2, and I want to index it, I do the following -
loader_2_subset = setattr(loader_2, 'indices', list_of_subset_idxs)
it does not work and while iterating through the dataloader, it iterates through the whole dataset.
Am I doing something wrong?
Please let me know and thanks for your time.

I think it's a bit tough to do that since there's a lot of pre-loading that happens inside the initialization of the loader class. I can't think of a use case where once can't just re-initialize the loader class though - is there a specific use case where that's necessary?

commented

My use case is as follows - I have already defined and instantiated an object of the Loader class (called obj), and the I do some processing using obj which returns a set of indices to me. I now want to use the same obj, but I want to only iterate through this subset.
Of course I could just reinstantiate it, using indices = subset_indices, and all the parameters that I have passed to it initially, or I can just setattr the indices variable, which would result in a cleaner code. The workaround that I am doing is mentioned in this issue - #316 - but as I have mentioned there, it seems like there are some problems with that.
This would just make my codebase more convenient and easier to use, for my purposes.