embeddings-benchmark / mteb

I'm opening this issue to discuss changes to how some datasets are loaded

I wrote more details about this issue here #386 (comment). The TLDR is that when you have a dataset with a high number of subsets (in our case this is often language tags), they are currently loaded iteratively one subset at a time, with a load_dataset call:

mteb/mteb/abstasks/MultilingualTask.py

Lines 26 to 30 in 5370b44

    
           for lang in self.langs: 
        
               self.dataset[lang] = datasets.load_dataset( 
        
                   name=lang, 
        
                   **self.metadata_dict.get("dataset", None), 
        
               )

mteb/mteb/abstasks/CrosslingualTask.py

Lines 24 to 27 in 5370b44

    
           for lang in self.langs: 
        
               self.dataset[lang] = datasets.load_dataset( 
        
                   name=lang, **self.metadata_dict["dataset"] 
        
               )

This turns out to be way slower than simply loading the same amount of data in one chunk. The reason is in part because many network request must be done to query each file, but mostly because the datasets lib has a constant overhead when loading datasets where it checks for newer versions of each files. Importantly, even when the dataset is cached, loading it is still slow: in this PR #330, loading a dataset with 250 subsets would take 15 minutes from the network, and 12 minutes from cache

This issue has been there from the very beginning, but isn't too noticeable when the number of subsets/languages is low, which is the case for most datasets. However for crosslingual tasks like bitext mining where the number of language pairs can be very large, some datasets can take hours to load (#386, #330), so it would be relevant to solve this in the context of MMTEB and speeding up things #381

I ran some experiments with this dataset https://huggingface.co/datasets/loicmagne/open-subtitles-bitext-mining , which contains 1759 language pairs (= subsets), where each subset contains 1000 sentence pairs, which in total is ~300MB of data. Loading it in the iterative way from the network takes around an hour (I don't have the exact number but it's long)

Solution 1

From huggingface/datasets#6800 there is a way to load multiple subsets all at once with load_datasets:

data_files = "data/*.jsonl"
ds = load_dataset("loicmagne/open-subtitles-250-bitext-mining", data_files=data_files, split="train")

or using this loading config:

  - config_name: all
    data_files: data/*.jsonl

When doing so, the loading time is now 17 minutes. The issue is that the result doesn't contain the name of the subset anymore:

>>> ds
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2'],
        num_rows: 247809
    })
})

The only way I have found to fix this is to add a "lang" feature to each row, which increases slightly the size of the dataset

Solution 2

17 minutes is already a good speedup but it's still way higher than loading 350MB of data, this is because the dataset is split into several files for each language. The 2nd solution is to merge all the data in a single file, in my case a single .jsonl file where each row looks like {'sentence1': ..., 'sentence2': ..., 'lang': ...}. This brings the loading time of the dataset from the network to 30 seconds.

The drawback of this method is that since all the subsets are merged, the dataset will no longer support the subsets features from the HuggingFace Hub, like visualizing each subsets with the dataset viewer, or downloading one subset at a time.

Recovering each subsets

One remaining issue with these two solutions is that the resulting dataset looks like his:

Dataset({
    features: ['sentence1', 'sentence2', 'lang'],
    num_rows: 247809
})

In the case of bitext mining, we need to recover each subsets and group rows with the same lang key. I haven't found any way to do it natively on HF datasets, so you have to filter each language one by one like this:

ds_split = {}
for lang in ds.unique("lang"):
    ds_split[lang] = ds.filter(lambda x: x['lang'] == lang)

This is again very slow, and it takes around an hour. Luckily there has been a recent Polars integration with HF Datasets huggingface/datasets#6531 which allows to performs quick queries on a dataset, in our case we can use the group_by operation to split the rows by language:

ds_split = ds.to_polars().group_by('lang')

This operation takes 0.3s on my dataset, but requires to add the polars library as a dependency

Summary

There is a solution to bring the loading time of datasets with a high subset count from multiple hours to <1min. This solution requires a specific dataset format, and loading procedure, so it should be an opt-in thing that we only activate on the specific datasets that requires it. There are two drawbacks that I can identify:

It requires to add polars as a dependency
The speedup only works when loading all the subsets, so when evaluating on specific subsets you would over-download data

This would unlock two PRs (#386, #330) and speedup some existing tasks, let me know WDYT and if you have any suggestions
@KennethEnevoldsen

thanks for this @loicmagne. My thoughts:

Seems like there is a general issue here and it might be worth looking into a PR to datasets if that makes sense (that would be the best place to fix it)
- I am actually unsure if the problem is HF or the loading script?
Solution 2: the existing datasets used for mteb are already not built for viewing (no dataset sheet ect.), so that solution seems reasonable.
- I don't think polars support is a big issue (I haven't had any compatibility issues with it)
- overdownload is only really a problem for truly large datasets (which we might want to avoid anyway)
- This would also allow us to remove "trust_remote_code", which is probably reasonable for security (I can't seem to find any way to have a multilingual dataset on HF without this, which seems problematic?)

This would take quite a while to reformat all MTEB datasets to this format. A solution would be to just do it for bitext mining where the overhead is by far the largest.

Seems like there is a general issue here and it might be worth looking into a PR to datasets if that makes sense (that would be the best place to fix it)

I am actually unsure if the problem is HF or the loading script?

I think there is some overhead from the HF datasets side that could be reduced although I don't know the details of the internals, but I'll try to look into it

This would take quite a while to reformat all MTEB datasets to this format. A solution would be to just do it for bitext mining where the overhead is by far the largest.

Yes I didn't plan to reformat every datasets, in fact for datasets with <10 languages the speedup would probably be negligible. I was thinking of implementing it as an alternative loading method for CrosslingualTask that could be optionally toggled, and only use it on a few datasets where the speedup is significant

Super nice investigation @loicmagne!

Just thinking out loud here, but where do we think the slow-down occurs? I would imagine the number of network requests should be quite reasonable. E.g. for:

Importantly, even when the dataset is cached, loading it is still slow: in this PR #330, loading a dataset with 250 subsets would take 15 minutes from the network, and 12 minutes from cache

Indicates that it spends (12 minutes · 60 s/min)/250 subsets = 3 seconds per subset. Perhaps this means all the calls are being made synchronously, but even so, 3 seconds seems like a lot for a round-trip?

I could imagine slowdowns happening at:

Network latency (to-from huggingface)
- Shouldn't make this large a difference. Typical round-trip should be less than 100 ms. However,
Huggingface-hub processing/selection of file/CDN
- I could imagine latency on their storage solution could be pretty high, so this seems reasonable to me. Does not match that loading all subsets in one chunk is much faster, though, which supports the network latency argument).
Processing of dataset on huggingface before download (e.g. load_dataset file on huggingface)
- From what I can see from your example dataset, this doesn't have such a file, so doesn't explain the slowdown here.
Processing of dataset locally (if anything happens here?)

If, indeed, network latency is a large part of the problem, perhaps we could solve it using strategic async code?

With that said, we have discussed adding an MTEB mirror of some datasets anyway, so adding mirrors of slow/large datasets that conform to a fast format seems quite reasonable. Then we "just" have to ensure that they keep sufficiently aligned with the source (i.e. a cache-invalidation problem).

Again, thanks a ton for your investigation and a very nice report!

Async code might not be a bad idea, actually - it should be fast enough to try out.

We improved speed in datasets 2.19 btw, see comments at huggingface/datasets#6800 :)

A bit late, but my thoughts are: go for solution 2 and use polars.
I think it's okay to have a language feature, it would maybe require an update of BitextMiningAbsTask? Or we can just handle it in the data_transform function of the dataset.

The issue occurs also with Flores, it contains more than 40K language pairs, since it's a collection of English texts translated to 200 languages (200^2).

From this I would probably suggest the following:

Set a lower bound of datasets to 2.19 (optional really, but the decrease in download speed is significant)
Create a script that downloads all current bitext tasks adds the "lang" column to it and re-uploads them to the mteb org (this is something I have already considered we do for all tasks)
redo the loading function to use solution 1, adding polars as a dependency for the groupby

let me know what you guys think

I ran experiments on the latest 2.19 datasets release and it's indeed very good, I think it makes solution 2 useless, merging files together doesn't provide a significant speedup. I would go with solution 1, simply adding a "lang" column to every current files

From this I would probably suggest the following:

0. Set a lower bound of datasets to 2.19 (optional really, but the decrease in download speed is significant)

1. Create a script that downloads all current bitext tasks adds the "lang" column to it and re-uploads them to the mteb org (this is something I have already considered we do for all tasks)

2. redo the loading function to use solution 1, adding polars as a dependency for the groupby

let me know what you guys think

Sounds great I'll start the PR 👍

Wonderful @loicmagne! I agree let us go for that

I think it'd be useful to do the same for some multilingual tasks btw, like the Massive classification datasets take 5 minutes to load

	for lang in self.langs:
	self.dataset[lang] = datasets.load_dataset(
	name=lang,
	**self.metadata_dict.get("dataset", None),
	)

	for lang in self.langs:
	self.dataset[lang] = datasets.load_dataset(
	name=lang, **self.metadata_dict["dataset"]
	)

Slow loading for datasets with a high number of language pairs

Solution 1

Solution 2

Recovering each subsets

Summary