Tokenizer name or path must be found error
RohitRathore1 opened this issue · comments
I am experiencing an issue while tokenizing the Wikipedia dataset mentioned in the following step. I am having my tokenizer file in the root of this repository and my relative path is following: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
.
The traceback of the error is following:
~/dolma$ dolma tokens \
> --documents "wikipedia/example0/documents/*.gz" \
> --tokenizer_name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
> --destination wikipedia/example0/tokens \
> --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
bos_token_id: null
eos_token_id: null
name_or_path: null
pad_token_id: null
segment_before_tokenization: false
tokenizer_name_or_path: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
work_dir:
input: null
output: null
Traceback (most recent call last):
File "/home/TeAmP0is0N/anaconda3/envs/dolma/bin/dolma", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/TeAmP0is0N/dolma/python/dolma/cli/__main__.py", line 91, in main
return cli.run_from_args(args=args, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/TeAmP0is0N/dolma/python/dolma/cli/__init__.py", line 190, in run_from_args
return cls.run(parsed_config)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/TeAmP0is0N/dolma/python/dolma/cli/tokenizer.py", line 181, in run
raise DolmaConfigError("Tokenizer name or path must be provided.")
dolma.core.errors.DolmaConfigError: Tokenizer name or path must be provided.
Hi @RohitRathore1 ! Met the same issue, and I think maybe we should change tokenizer_name_or_path
to tokenizer.name_or_path
.
here's my quick fixed script:
dolma tokens \
--documents "wikipedia/example0/documents/*.gz" \
--tokenizer.name_or_path "EleutherAI/gpt-neox-20b" \
--tokenizer.bos_token_id 0 \
--destination wikipedia/example0/tokens \
--processes 16
Hi @RohitRathore1 ! Met the same issue, and I think maybe we should change
tokenizer_name_or_path
totokenizer.name_or_path
.here's my quick fixed script:
dolma tokens \ --documents "wikipedia/example0/documents/*.gz" \ --tokenizer.name_or_path "EleutherAI/gpt-neox-20b" \ --tokenizer.bos_token_id 0 \ --destination wikipedia/example0/tokens \ --processes 16
Hi, @koalazf99. Thanks! Yes, you are right. We should use tokenizer.name_or_path
and there are some typo issues in the documents. By the way can you verify this? After this dolma tokens
command I got these results:
dolma tokens \
> --documents "wikipedia/example0/documents/*.gz" \
> --tokenizer.name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
> --tokenizer.bos_token_id 0 \
> --destination wikipedia/example0/tokens \
> --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
bos_token_id: 0
eos_token_id: null
name_or_path: EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
pad_token_id: null
segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
input: null
output: null
files: 0.00f [00:00, ?f/s] 2024-02-04 09:07:49,914 WARNING SpawnPoolWorker-15.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,916 WARNING SpawnPoolWorker-13.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,917 WARNING SpawnPoolWorker-16.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,919 WARNING SpawnPoolWorker-3.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,919 WARNING SpawnPoolWorker-5.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-12.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-1.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,920 WARNING SpawnPoolWorker-11.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-9.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-14.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,921 WARNING SpawnPoolWorker-6.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,958 WARNING SpawnPoolWorker-2.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,967 WARNING SpawnPoolWorker-10.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,983 WARNING SpawnPoolWorker-4.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:49,987 WARNING SpawnPoolWorker-15.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:49,994 WARNING SpawnPoolWorker-8.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:50,030 WARNING SpawnPoolWorker-14.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,032 WARNING SpawnPoolWorker-5.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,034 WARNING SpawnPoolWorker-2.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,034 WARNING SpawnPoolWorker-7.dolma.dolma.tokenizer.executor pad_token_id not provided, using eos_token_id
2024-02-04 09:07:50,035 WARNING SpawnPoolWorker-13.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,037 WARNING SpawnPoolWorker-12.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,041 WARNING SpawnPoolWorker-9.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,043 WARNING SpawnPoolWorker-3.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,047 WARNING SpawnPoolWorker-16.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,049 WARNING SpawnPoolWorker-11.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
files: 0.00f [00:00, ?f/s] 2024-02-04 09:07:50,056 WARNING SpawnPoolWorker-1.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,056 WARNING SpawnPoolWorker-10.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,062 WARNING SpawnPoolWorker-4.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,067 WARNING SpawnPoolWorker-6.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,084 WARNING SpawnPoolWorker-8.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
2024-02-04 09:07:50,130 WARNING SpawnPoolWorker-7.dolma.dolma.tokenizer.tokenizer No pad token ID provided; using EOS token ID None.
memmaps: 16.0m [00:00, 20.0m/s]
tokens: 0.00t [00:00, ?t/s]/s]
documents: 0.00d [00:00, ?d/s]
files: 1.00f [00:00, 1.25f/s]s]
Those warnings are expected if you are not providing pad_token_id
. You provably want to add --tokenizer.pad_token_id 1
when calling CLI.