Max Sequence Length
Hossein-1991 opened this issue · comments
Hossein Salahshoor Gavalan commented
Hi,
The Max Sequence Length for the all-MiniLM-L6-v2
model is 256. What does that mean? Does it mean the total number of tokens must be 256? If it is the case, then the kind of tokenizer we use will be important, am I right?
Maarten Grootendorst commented
It means that the model can handle at most 256 tokens, after that the text will be truncated. There is already a tokenizer integrated within all-MiniLM-L6-v2
that does the tokenization for you with respect to creating the embeddings. The tokenizer that is passed to KeyBERT is used for creating candidate keywords and keyphrases that will be compared to the input document.