What is the appropriate vocabulary length setting?

Question

What is the appropriate vocabulary length setting?

2088208 opened this issue 7 months ago · comments

2088208 commented 7 months ago

import sentencepiece as sp
sp.SentencePieceTrainer.train( input='./data/corpus.txt', model_prefix='tokenizer', vocab_size=3000, character_coverage=1.0, model_type="bpe" )
As above, what is the appropriate setting for vocab_size? Is it determined based on the training corpus or something?

Taku Kudo commented 6 months ago

#678