MaartenGr / KeyBERT

Minimal keyword extraction with BERT

Home Page:https://MaartenGr.github.io/KeyBERT/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Max Sequence Length

Hossein-1991 opened this issue · comments

Hi,

The Max Sequence Length for the all-MiniLM-L6-v2 model is 256. What does that mean? Does it mean the total number of tokens must be 256? If it is the case, then the kind of tokenizer we use will be important, am I right?

It means that the model can handle at most 256 tokens, after that the text will be truncated. There is already a tokenizer integrated within all-MiniLM-L6-v2 that does the tokenization for you with respect to creating the embeddings. The tokenizer that is passed to KeyBERT is used for creating candidate keywords and keyphrases that will be compared to the input document.