huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

load_dataset() should load all subsets, if no specific subset is specified

windmaple opened this issue · comments

Feature request

Currently load_dataset() is forcing users to specify a subset. Example

from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA")

ValueError                                Traceback (most recent call last)
[<ipython-input-10-c0cb49385da6>](https://localhost:8080/#) in <cell line: 2>()
      1 from datasets import load_dataset
----> 2 dataset = load_dataset("m-a-p/COIG-CQIA")

3 frames
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _create_builder_config(self, config_name, custom_features, **config_kwargs)
    582                     if not config_kwargs:
    583                         example_of_usage = f"load_dataset('{self.dataset_name}', '{self.BUILDER_CONFIGS[0].name}')"
--> 584                         raise ValueError(
    585                             "Config name is missing."
    586                             f"\nPlease pick one among the available configs: {list(self.builder_configs.keys())}"

ValueError: Config name is missing.
Please pick one among the available configs: ['chinese_traditional', 'coig_pc', 'exam', 'finance', 'douban', 'human_value', 'logi_qa', 'ruozhiba', 'segmentfault', 'wiki', 'wikihow', 'xhs', 'zhihu']
Example of usage:
	`load_dataset('coig-cqia', 'chinese_traditional')`

This means a dataset cannot contain all the subsets at the same time. I guess one workaround is to manually specify the subset files like in here, which is clumsy.

Motivation

Ideally, if not subset is specified, the API should just try to load all subsets. This makes it much easier to handle datasets w/ subsets.

Your contribution

Not sure since I'm not familiar w/ the lib src.