Add helper function to unify how we subsample data

Question

Add helper function to unify how we subsample data

isaac-chung opened this issue a month ago · comments

Piggy-backing off @imenelydiaker 's comment:

When the dataset split is beyond 2048, subsampling is often used in this manner without stratification by label:

self.dataset["test"] = (
    self.dataset["test"].shuffle(seed=self.seed).select(range(N_SAMPLES))
)

I've primarily seen this in classification datasets, but I imagine this would apply to other tasks as well.

A helper function can be introduced to help ease the contribution efforts when dealing with larger datasets, and to help unify how we subsample data. Example usage:

self.dataset['test'] = stratified_subsampling(self.dataset['test'], N_SAMPLES)

Imene Kerboua · Answer 1 · Tue Apr 23 2024 15:07:23 GMT+0800 (China Standard Time)

Totally agree, we can use train_test_split from datasets ? Or we can create a the core of the function. There may exist better options here.

I'd add a variable to choose on what column to stratify (I assume the method is defined in AbsTask):

self.stratified_subsampling(self.dataset, label, n_samples)

Should the method be defined in AbsTask?

Isaac Chung · Answer 2 · Tue Apr 23 2024 16:09:58 GMT+0800 (China Standard Time)

Yes that's a good spot. I noticed that .select is being used in more than 1 task. I can open a PR for this.
I think for now we can point new contributions to this when needed. As for the previous usage of .shuffle or .select, it may require running of results, and there are a lot to go through.

How should we handle that?

Imene Kerboua · Answer 3 · Tue Apr 23 2024 16:19:42 GMT+0800 (China Standard Time)

We should totally point new contributors towards this yes!

Although I think it's posible to replace old subsampling functions, there are not a lot of them for the moment. We also should be able to run evals again on multilingual-e5-small and paraphrase-multilingual-MiniLM-L12-v2.

If you want we can split the work, you can open a PR with the new subsampling function proposition, and I'll update tasks using the old version and run evaluations again, wdyt?

Isaac Chung · Answer 4 · Tue Apr 23 2024 16:30:10 GMT+0800 (China Standard Time)

Ya let's do that 💪

Kenneth Enevoldsen · Answer 5 · Tue Apr 23 2024 17:28:47 GMT+0800 (China Standard Time)

Wonderful suggestion @isaac-chung and @imenelydiaker!