huggingface / datasets

Hi!

I'm currenty using the map function to tokenize a somewhat large dataset, so I need to use the cache to save ~25 mins.

Changing num_proc incduces the recomputation of the map, I'm not sure why and if this is excepted behavior?

here it says, that cached files are loaded sequentially:

Lines 3005 to 3006 in bb2664c

    
                       num_proc (`int`, *optional*, defaults to `None`): 
        
                           Max number of processes when generating cache. Already cached shards are loaded sequentially.

it seems like I can pass in a fingerprint, and load it directly:

datasets/src/datasets/arrow_dataset.py

Lines 3108 to 3125 in bb2664c

    
           if new_fingerprint is None: 
        
               # we create a unique hash from the function, 
        
               # current dataset file and the mapping args 
        
               transform = format_transform_for_fingerprint(Dataset._map_single) 
        
               kwargs_for_fingerprint = format_kwargs_for_fingerprint(Dataset._map_single, (), dataset_kwargs) 
        
               kwargs_for_fingerprint["fingerprint_name"] = "new_fingerprint" 
        
               new_fingerprint = update_fingerprint(self._fingerprint, transform, kwargs_for_fingerprint) 
        
           else: 
        
               validate_fingerprint(new_fingerprint) 
        
           dataset_kwargs["new_fingerprint"] = new_fingerprint 
        
           if self.cache_files: 
        
               if cache_file_name is None: 
        
                   cache_file_name = self._get_cache_file_path(new_fingerprint) 
        
           dataset_kwargs["cache_file_name"] = cache_file_name 
        
           def load_processed_shard_from_cache(shard_kwargs): 
        
               """Load a processed shard from cache if it exists, otherwise throw an error."""

Environment Setup:

Python 3.11.9
datasets 2.19.1 conda-forge
Linux 6.1.83-1.el9.elrepo.x86_64

MRE

fixed raw_datasets
fixed tokenize_function

tokenized_datasets = raw_datasets.map(
                        tokenize_function,
                        batched=True,
                        num_proc=9,
                        remove_columns=['text'],
                        load_from_cache_file= True,
                        desc="Running tokenizer on dataset line_by_line",
                    )


tokenized_datasets = raw_datasets.map(
                        tokenize_function,
                        batched=True,
                        num_proc=5,
                        remove_columns=['text'],
                        load_from_cache_file= True,
                        desc="Running tokenizer on dataset line_by_line",
                    )

	num_proc (`int`, optional, defaults to `None`):
	Max number of processes when generating cache. Already cached shards are loaded sequentially.

	if new_fingerprint is None:
	# we create a unique hash from the function,
	# current dataset file and the mapping args
	transform = format_transform_for_fingerprint(Dataset._map_single)
	kwargs_for_fingerprint = format_kwargs_for_fingerprint(Dataset._map_single, (), dataset_kwargs)
	kwargs_for_fingerprint["fingerprint_name"] = "new_fingerprint"
	new_fingerprint = update_fingerprint(self._fingerprint, transform, kwargs_for_fingerprint)
	else:
	validate_fingerprint(new_fingerprint)
	dataset_kwargs["new_fingerprint"] = new_fingerprint

	if self.cache_files:
	if cache_file_name is None:
	cache_file_name = self._get_cache_file_path(new_fingerprint)
	dataset_kwargs["cache_file_name"] = cache_file_name

	def load_processed_shard_from_cache(shard_kwargs):
	"""Load a processed shard from cache if it exists, otherwise throw an error."""

Caching map result of DatasetDict.