huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Caching map result of DatasetDict.

MostHumble opened this issue · comments

Hi!

I'm currenty using the map function to tokenize a somewhat large dataset, so I need to use the cache to save ~25 mins.

Changing num_proc incduces the recomputation of the map, I'm not sure why and if this is excepted behavior?

here it says, that cached files are loaded sequentially:

num_proc (`int`, *optional*, defaults to `None`):
Max number of processes when generating cache. Already cached shards are loaded sequentially.

it seems like I can pass in a fingerprint, and load it directly:

if new_fingerprint is None:
# we create a unique hash from the function,
# current dataset file and the mapping args
transform = format_transform_for_fingerprint(Dataset._map_single)
kwargs_for_fingerprint = format_kwargs_for_fingerprint(Dataset._map_single, (), dataset_kwargs)
kwargs_for_fingerprint["fingerprint_name"] = "new_fingerprint"
new_fingerprint = update_fingerprint(self._fingerprint, transform, kwargs_for_fingerprint)
else:
validate_fingerprint(new_fingerprint)
dataset_kwargs["new_fingerprint"] = new_fingerprint
if self.cache_files:
if cache_file_name is None:
cache_file_name = self._get_cache_file_path(new_fingerprint)
dataset_kwargs["cache_file_name"] = cache_file_name
def load_processed_shard_from_cache(shard_kwargs):
"""Load a processed shard from cache if it exists, otherwise throw an error."""

Environment Setup:

  • Python 3.11.9
  • datasets 2.19.1 conda-forge
  • Linux 6.1.83-1.el9.elrepo.x86_64

MRE

fixed raw_datasets
fixed tokenize_function

tokenized_datasets = raw_datasets.map(
                        tokenize_function,
                        batched=True,
                        num_proc=9,
                        remove_columns=['text'],
                        load_from_cache_file= True,
                        desc="Running tokenizer on dataset line_by_line",
                    )


tokenized_datasets = raw_datasets.map(
                        tokenize_function,
                        batched=True,
                        num_proc=5,
                        remove_columns=['text'],
                        load_from_cache_file= True,
                        desc="Running tokenizer on dataset line_by_line",
                    )