Regression bug: `NonMatchingSplitsSizesError` for (possibly) overwritten dataset
finiteautomata opened this issue · comments
Describe the bug
While trying to load the dataset https://huggingface.co/datasets/pysentimiento/spanish-tweets-small
, I get this error:
---------------------------------------------------------------------------
NonMatchingSplitsSizesError Traceback (most recent call last)
[<ipython-input-1-d6a3c721d3b8>](https://localhost:8080/#) in <cell line: 3>()
1 from datasets import load_dataset
2
----> 3 ds = load_dataset("pysentimiento/spanish-tweets-small")
3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
2150
2151 # Download and prepare data
-> 2152 builder_instance.download_and_prepare(
2153 download_config=download_config,
2154 download_mode=download_mode,
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
946 if num_proc is not None:
947 prepare_split_kwargs["num_proc"] = num_proc
--> 948 self._download_and_prepare(
949 dl_manager=dl_manager,
950 verification_mode=verification_mode,
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1059
1060 if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:
-> 1061 verify_splits(self.info.splits, split_dict)
1062
1063 # Update the info object with the splits.
[/usr/local/lib/python3.10/dist-packages/datasets/utils/info_utils.py](https://localhost:8080/#) in verify_splits(expected_splits, recorded_splits)
98 ]
99 if len(bad_splits) > 0:
--> 100 raise NonMatchingSplitsSizesError(str(bad_splits))
101 logger.info("All the splits matched successfully.")
102
NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=82649695458, num_examples=597433111, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=3358310095, num_examples=24898932, shard_lengths=[3626991, 3716991, 4036990, 3506990, 3676990, 3716990, 2616990], dataset_name='spanish-tweets-small')}]
I think I had this dataset updated, might be related to #6271
It is working fine as late in 2.10.0
, but not in 2.13.0
onwards.
Steps to reproduce the bug
from datasets import load_dataset
ds = load_dataset("pysentimiento/spanish-tweets-small")
You can run it in this notebook
Expected behavior
Load the dataset without any error
Environment info
datasets
version: 2.13.0- Platform: Linux-6.1.58+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.3
- PyArrow version: 14.0.2
- Pandas version: 2.0.3