Max Sequence Length

Question

Max Sequence Length

Hossein-1991 opened this issue a year ago · comments

Hossein Salahshoor Gavalan commented a year ago

Hi,

The Max Sequence Length for the all-MiniLM-L6-v2 model is 256. What does that mean? Does it mean the total number of tokens must be 256? If it is the case, then the kind of tokenizer we use will be important, am I right?

Maarten Grootendorst · Answer 1 · Fri Feb 17 2023 13:03:53 GMT+0800 (China Standard Time)

It means that the model can handle at most 256 tokens, after that the text will be truncated. There is already a tokenizer integrated within all-MiniLM-L6-v2 that does the tokenization for you with respect to creating the embeddings. The tokenizer that is passed to KeyBERT is used for creating candidate keywords and keyphrases that will be compared to the input document.