google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is the appropriate vocabulary length setting?

2088208 opened this issue · comments

import sentencepiece as sp
sp.SentencePieceTrainer.train( input='./data/corpus.txt', model_prefix='tokenizer', vocab_size=3000, character_coverage=1.0, model_type="bpe" )
As above, what is the appropriate setting for vocab_size? Is it determined based on the training corpus or something?