embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add helper function to unify how we subsample data

isaac-chung opened this issue · comments

Piggy-backing off @imenelydiaker 's comment:

When the dataset split is beyond 2048, subsampling is often used in this manner without stratification by label:

self.dataset["test"] = (
    self.dataset["test"].shuffle(seed=self.seed).select(range(N_SAMPLES))
)

I've primarily seen this in classification datasets, but I imagine this would apply to other tasks as well.

A helper function can be introduced to help ease the contribution efforts when dealing with larger datasets, and to help unify how we subsample data. Example usage:

self.dataset['test'] = stratified_subsampling(self.dataset['test'], N_SAMPLES)

Totally agree, we can use train_test_split from datasets ? Or we can create a the core of the function. There may exist better options here.

I'd add a variable to choose on what column to stratify (I assume the method is defined in AbsTask):

self.stratified_subsampling(self.dataset, label, n_samples)

Should the method be defined in AbsTask?

Yes that's a good spot. I noticed that .select is being used in more than 1 task. I can open a PR for this.
I think for now we can point new contributions to this when needed. As for the previous usage of .shuffle or .select, it may require running of results, and there are a lot to go through.

How should we handle that?

We should totally point new contributors towards this yes!

Although I think it's posible to replace old subsampling functions, there are not a lot of them for the moment. We also should be able to run evals again on multilingual-e5-small and paraphrase-multilingual-MiniLM-L12-v2.

If you want we can split the work, you can open a PR with the new subsampling function proposition, and I'll update tasks using the old version and run evaluations again, wdyt?

Ya let's do that 💪

Wonderful suggestion @isaac-chung and @imenelydiaker!