dataset script missing error

Question

dataset script missing error

segaranp opened this issue 9 months ago · comments

Hi,

I'm using this project with my own custom dataset. I created a sample data in the dataset folder as specified in the README.md with a folder for (test/validation/train) with the metadata

Then i ran this command:

python train.py --config config/train_cord.yaml --pretrained_model_name_or_path "naver-clova-ix/donut-base" --dataset_name_or_paths 'C:\ocr\2\donut\dataset' --exp_version "test_experiment"

But i'm getting this error:

  File "C:\ocr\2\donut\train.py", line 176, in <module>
    train(config)
  File "C:\ocr\2\donut\train.py", line 104, in train
    DonutDataset(
  File "C:\ocr\2\donut\donut\util.py", line 64, in __init__
    self.dataset = load_dataset(dataset_name_or_path, split=self.split)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thaba\miniconda3\Lib\site-packages\datasets\load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thaba\miniconda3\Lib\site-packages\datasets\load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thaba\miniconda3\Lib\site-packages\datasets\load.py", line 1508, in dataset_module_factory
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at C:\ocr\2\donut\C\C.py or any data file in the same directory. Couldn't find 'C' on the Hugging Face Hub either: FileNotFoundError: Dataset 'C' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.

Anyone know how to resolve this?

benjaminfh · Answer 1 · Sun Nov 05 2023 01:52:35 GMT+0800 (China Standard Time)

This will help: https://huggingface.co/docs/datasets/image_load

The easiest way is to ensure you have the following directory/file structure in your dataset folder:
configs:

config_name: default
data_files:
split: train
path: "train"
split: test
path: "test"
split: validation
path: "validation"

a README.md that includes the above.