Fail to load "stas/c4-en-10k" dataset since 2.16 version

Question

Fail to load "stas/c4-en-10k" dataset since 2.16 version

guch8017 opened this issue 2 months ago · comments

Describe the bug

When update datasets library to version 2.16+ ( I test it on 2.16, 2.19.0 and 2.19.1), using the following code to load stas/c4-en-10k dataset

from datasets import load_dataset, Dataset
dataset = load_dataset('stas/c4-en-10k')

and then it raise UnicodeDecodeError like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 2523, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 2195, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 1846, in dataset_module_factory
    raise e1 from None
  File "/home/*/conda3/envs/watermark/lib/python3.10/site-packages/datasets/load.py", line 1798, in dataset_module_factory
    can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
  File "/home/*/conda3/envs/watermark/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I found that fs.open loads a gzip file and parses it like plain text using utf-8 encoder.

fs = HfFileSystem('https://huggingface.co')
fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb")
data = fs.read()    # data is gzip bytes begin with b'\x1f\x8b\x08\x00\x00\tn\x88\x00...' 
data2 = unzip_gzip_bytes(data)    #  data2 is what we want: '# coding=utf-8\n# Copyright 2020 The HuggingFace Datasets...'

Steps to reproduce the bug

Install datasets between version 2.16 and 2.19
Use datasets.load_dataset method to load stas/c4-en-10k dataset.

Expected behavior

Load dataset normally.

Environment info

Platform = Linux-5.4.0-159-generic-x86_64-with-glibc2.35
Python = 3.10.14
Datasets = 2.19

Albert Villanova del Moral · Answer 1 · Wed May 22 2024 20:49:28 GMT+0800 (China Standard Time)

I am not able to reproduce the error with datasets 2.19.1:

In [1]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", streaming=True); item = next(iter(ds["train"])); item
Out[1]: {'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.'}

In [2]: from datasets import load_dataset; ds = load_dataset("stas/c4-en-10k", download_mode="force_redownload"); ds
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.3M/13.3M [00:00<00:00, 18.7MB/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 78548.55 examples/s]
Out[2]: 
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 10000
    })
})

Looking at your error traceback, I notice that the code line numbers do not correspond to the ones of datasets 2.19.1.

Additionally, I can't reproduce the issue with HfFileSystem:

In [1]: from huggingface_hub import HfFileSystem

In [2]: fs = HfFileSystem()

In [3]: with fs.open("datasets/stas/c4-en-10k/c4-en-10k.py", "rb") as f:
   ...:     data = f.read()
   ...: 

In [4]: data[:20]
Out[4]: b'# coding=utf-8\n# Cop'

Could you please verify the datasets and huggingface_hub versions you are indeed using?

import datasets; print(datasets.__version__)

import huggingface_hub; print(huggingface_hub.__version__)

Gu Chao · Answer 2 · Fri May 24 2024 18:58:09 GMT+0800 (China Standard Time)

Thanks for your reply! After I update the datasets version from 2.15.0 back to 2.19.1 again, it seems everything work well. Sorry for bordering you!