allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

Home Page:https://allenai.github.io/dolma/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenizer name or path must be found error

RohitRathore1 opened this issue · comments

I am experiencing an issue while tokenizing the Wikipedia dataset mentioned in the following step. I am having my tokenizer file in the root of this repository and my relative path is following: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json.

The traceback of the error is following:

~/dolma$ dolma tokens \
>     --documents "wikipedia/example0/documents/*.gz" \
>     --tokenizer_name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
>     --destination wikipedia/example0/tokens \
>     --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
  bos_token_id: null
  eos_token_id: null
  name_or_path: null
  pad_token_id: null
  segment_before_tokenization: false
tokenizer_name_or_path: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
work_dir:
  input: null
  output: null
Traceback (most recent call last):
  File "/home/TeAmP0is0N/anaconda3/envs/dolma/bin/dolma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/__main__.py", line 91, in main
    return cli.run_from_args(args=args, config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/__init__.py", line 190, in run_from_args
    return cls.run(parsed_config)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/tokenizer.py", line 181, in run
    raise DolmaConfigError("Tokenizer name or path must be provided.")
dolma.core.errors.DolmaConfigError: Tokenizer name or path must be provided.
commented

Hi @RohitRathore1 ! Met the same issue, and I think maybe we should change tokenizer_name_or_path to tokenizer.name_or_path.

here's my quick fixed script:

dolma tokens \
    --documents "wikipedia/example0/documents/*.gz" \
    --tokenizer.name_or_path "EleutherAI/gpt-neox-20b" \
    --tokenizer.bos_token_id 0 \
    --destination wikipedia/example0/tokens \
    --processes 16

Hi @RohitRathore1 ! Met the same issue, and I think maybe we should change tokenizer_name_or_path to tokenizer.name_or_path.

here's my quick fixed script:

dolma tokens \
    --documents "wikipedia/example0/documents/*.gz" \
    --tokenizer.name_or_path "EleutherAI/gpt-neox-20b" \
    --tokenizer.bos_token_id 0 \
    --destination wikipedia/example0/tokens \
    --processes 16

Hi, @koalazf99. Thanks! Yes, you are right. We should use tokenizer.name_or_path and there are some typo issues in the documents. By the way can you verify this? After this dolma tokens command I got these results:

dolma tokens \
>     --documents "wikipedia/example0/documents/*.gz" \
>     --tokenizer.name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
>     --tokenizer.bos_token_id 0 \
>     --destination wikipedia/example0/tokens \
>     --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
  bos_token_id: 0
  eos_token_id: null
  name_or_path: EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
  pad_token_id: null
  segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
  input: null
  output: null
files: 0.00f [00:00, ?f/s]  2024-02-04 09:07:49,914 WARNING SpawnPoolWorker-15.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,916 WARNING SpawnPoolWorker-13.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,917 WARNING SpawnPoolWorker-16.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,919 WARNING SpawnPoolWorker-3.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,919 WARNING SpawnPoolWorker-5.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-12.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-1.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-11.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-9.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-14.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-6.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,958 WARNING SpawnPoolWorker-2.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,967 WARNING SpawnPoolWorker-10.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,983 WARNING SpawnPoolWorker-4.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,987 WARNING SpawnPoolWorker-15.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:49,994 WARNING SpawnPoolWorker-8.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:50,030 WARNING SpawnPoolWorker-14.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,032 WARNING SpawnPoolWorker-5.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,034 WARNING SpawnPoolWorker-2.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,034 WARNING SpawnPoolWorker-7.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:50,035 WARNING SpawnPoolWorker-13.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,037 WARNING SpawnPoolWorker-12.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,041 WARNING SpawnPoolWorker-9.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,043 WARNING SpawnPoolWorker-3.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,047 WARNING SpawnPoolWorker-16.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,049 WARNING SpawnPoolWorker-11.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
files: 0.00f [00:00, ?f/s]     2024-02-04 09:07:50,056 WARNING SpawnPoolWorker-1.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,056 WARNING SpawnPoolWorker-10.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,062 WARNING SpawnPoolWorker-4.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,067 WARNING SpawnPoolWorker-6.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,084 WARNING SpawnPoolWorker-8.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,130 WARNING SpawnPoolWorker-7.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
memmaps: 16.0m [00:00, 20.0m/s]
tokens: 0.00t [00:00, ?t/s]/s]
documents: 0.00d [00:00, ?d/s]
files: 1.00f [00:00, 1.25f/s]s]

Those warnings are expected if you are not providing pad_token_id. You provably want to add --tokenizer.pad_token_id 1 when calling CLI.