Add helper function to unify how we subsample data
isaac-chung opened this issue · comments
Piggy-backing off @imenelydiaker 's comment:
When the dataset split is beyond 2048, subsampling is often used in this manner without stratification by label:
self.dataset["test"] = (
self.dataset["test"].shuffle(seed=self.seed).select(range(N_SAMPLES))
)
I've primarily seen this in classification datasets, but I imagine this would apply to other tasks as well.
A helper function can be introduced to help ease the contribution efforts when dealing with larger datasets, and to help unify how we subsample data. Example usage:
self.dataset['test'] = stratified_subsampling(self.dataset['test'], N_SAMPLES)
Totally agree, we can use train_test_split
from datasets
? Or we can create a the core of the function. There may exist better options here.
I'd add a variable to choose on what column to stratify (I assume the method is defined in AbsTask
):
self.stratified_subsampling(self.dataset, label, n_samples)
Should the method be defined in AbsTask
?
Yes that's a good spot. I noticed that .select
is being used in more than 1 task. I can open a PR for this.
I think for now we can point new contributions to this when needed. As for the previous usage of .shuffle
or .select
, it may require running of results, and there are a lot to go through.
How should we handle that?
We should totally point new contributors towards this yes!
Although I think it's posible to replace old subsampling functions, there are not a lot of them for the moment. We also should be able to run evals again on multilingual-e5-small
and paraphrase-multilingual-MiniLM-L12-v2
.
If you want we can split the work, you can open a PR with the new subsampling function proposition, and I'll update tasks using the old version and run evaluations again, wdyt?
Ya let's do that 💪
Wonderful suggestion @isaac-chung and @imenelydiaker!