Fail to load "stas/c4-en-10k" dataset since 2.16 version
guch8017 opened this issue Β· comments
Describe the bug
When update datasets library to version 2.16+ ( I test it on 2.16, 2.19.0 and 2.19.1), using the following code to load stas/c4-en-10k dataset
from datasets import load_dataset, Dataset
dataset = load_dataset('stas/c4-en-10k')
and then it raise UnicodeDecodeError like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset
builder_instance = load_dataset_builder(
File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 2195, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 1846, in dataset_module_factory
raise e1 from None
File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 1798, in dataset_module_factory
can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
File "/home/*/conda3/envs/watermark/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I found that fs.open loads a gzip file and parses it like plain text using utf-8 encoder.
fs = HfFileSystem('https://huggingface.co')
fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb")
data = fs.read() # data is gzip bytes begin with b'\x1f\x8b\x08\x00\x00\tn\x88\x00...'
data2 = unzip_gzip_bytes(data) # data2 is what we want: '# coding=utf-8\n# Copyright 2020 The HuggingFace Datasets...'
Steps to reproduce the bug
- Install datasets between version 2.16 and 2.19
- Use
datasets.load_dataset
method to loadstas/c4-en-10k
dataset.
Expected behavior
Load dataset normally.
Environment info
Platform = Linux-5.4.0-159-generic-x86_64-with-glibc2.35
Python = 3.10.14
Datasets = 2.19
I am not able to reproduce the error with datasets 2.19.1:
In [1]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", streaming=True); item = next(iter(ds["train"])); item
Out[1]: {'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'}
In [2]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", download_mode="force_redownload"); ds
Downloading data: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 13.3M/13.3M [00:00<00:00, 18.7MB/s]
Generating train split: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10000/10000 [00:00<00:00, 78548.55 examples/s]
Out[2]:
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 10000
})
})
Looking at your error traceback, I notice that the code line numbers do not correspond to the ones of datasets 2.19.1.
Additionally, I can't reproduce the issue with HfFileSystem
:
In [1]: from huggingface_hub import HfFileSystem
In [2]: fs = HfFileSystem()
In [3]: with fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb") as f:
...: data = f.read()
...:
In [4]: data[:20]
Out[4]: b'# coding=utf-8\n# Cop'
Could you please verify the datasets
and huggingface_hub
versions you are indeed using?
import datasets; print(datasets.__version__)
import huggingface_hub; print(huggingface_hub.__version__)
Thanks for your reply! After I update the datasets version from 2.15.0 back to 2.19.1 again, it seems everything work well. Sorry for bordering you!