Regression bug: `NonMatchingSplitsSizesError` for (possibly) overwritten dataset

Question

Regression bug: `NonMatchingSplitsSizesError` for (possibly) overwritten dataset

finiteautomata opened this issue 2 months ago · comments

Juan Manuel Pérez commented 2 months ago

Describe the bug

While trying to load the dataset https://huggingface.co/datasets/pysentimiento/spanish-tweets-small, I get this error:

---------------------------------------------------------------------------
NonMatchingSplitsSizesError               Traceback (most recent call last)
[<ipython-input-1-d6a3c721d3b8>](https://localhost:8080/#) in <cell line: 3>()
      1 from datasets import load_dataset
      2 
----> 3 ds = load_dataset("pysentimiento/spanish-tweets-small")

3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2150 
   2151     # Download and prepare data
-> 2152     builder_instance.download_and_prepare(
   2153         download_config=download_config,
   2154         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    946                         if num_proc is not None:
    947                             prepare_split_kwargs["num_proc"] = num_proc
--> 948                         self._download_and_prepare(
    949                             dl_manager=dl_manager,
    950                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1059 
   1060         if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:
-> 1061             verify_splits(self.info.splits, split_dict)
   1062 
   1063         # Update the info object with the splits.

[/usr/local/lib/python3.10/dist-packages/datasets/utils/info_utils.py](https://localhost:8080/#) in verify_splits(expected_splits, recorded_splits)
     98     ]
     99     if len(bad_splits) > 0:
--> 100         raise NonMatchingSplitsSizesError(str(bad_splits))
    101     logger.info("All the splits matched successfully.")
    102 

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=82649695458, num_examples=597433111, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=3358310095, num_examples=24898932, shard_lengths=[3626991, 3716991, 4036990, 3506990, 3676990, 3716990, 2616990], dataset_name='spanish-tweets-small')}]

I think I had this dataset updated, might be related to #6271

It is working fine as late in 2.10.0 , but not in 2.13.0 onwards.

Steps to reproduce the bug

from datasets import load_dataset

ds = load_dataset("pysentimiento/spanish-tweets-small")

You can run it in this notebook

Expected behavior

Load the dataset without any error

Environment info

datasets version: 2.13.0
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.3
PyArrow version: 14.0.2
Pandas version: 2.0.3