What is the appropriate vocabulary length setting?
2088208 opened this issue · comments
2088208 commented
import sentencepiece as sp
sp.SentencePieceTrainer.train( input='./data/corpus.txt', model_prefix='tokenizer', vocab_size=3000, character_coverage=1.0, model_type="bpe" )
As above, what is the appropriate setting for vocab_size
? Is it determined based on the training corpus or something?