clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Home Page:https://arxiv.org/abs/2111.15664

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dataset script missing error

segaranp opened this issue · comments

Hi,

I'm using this project with my own custom dataset. I created a sample data in the dataset folder as specified in the README.md with a folder for (test/validation/train) with the metadata

Then i ran this command:

python train.py --config config/train_cord.yaml --pretrained_model_name_or_path "naver-clova-ix/donut-base" --dataset_name_or_paths 'C:\ocr\2\donut\dataset' --exp_version "test_experiment"

But i'm getting this error:

  File "C:\ocr\2\donut\train.py", line 176, in <module>
    train(config)
  File "C:\ocr\2\donut\train.py", line 104, in train
    DonutDataset(
  File "C:\ocr\2\donut\donut\util.py", line 64, in __init__
    self.dataset = load_dataset(dataset_name_or_path, split=self.split)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thaba\miniconda3\Lib\site-packages\datasets\load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thaba\miniconda3\Lib\site-packages\datasets\load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\thaba\miniconda3\Lib\site-packages\datasets\load.py", line 1508, in dataset_module_factory
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at C:\ocr\2\donut\C\C.py or any data file in the same directory. Couldn't find 'C' on the Hugging Face Hub either: FileNotFoundError: Dataset 'C' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.

Anyone know how to resolve this?

This will help: https://huggingface.co/docs/datasets/image_load

The easiest way is to ensure you have the following directory/file structure in your dataset folder:
configs:

  • config_name: default
    data_files:
  • split: train
    path: "train"
  • split: test
    path: "test"
  • split: validation
    path: "validation"
  • a README.md that includes the above.