MaartenGr / KeyBERT

Minimal keyword extraction with BERT

Home Page:https://MaartenGr.github.io/KeyBERT/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How can I use the KeyBERT if I have tokenized Chinese documents by myself?

shuxian12 opened this issue · comments

commented

I have seen the #45 and the api document, but I only find the usage of using customize tokenizer before extracting keyword.
I'm wondering if I could use the extract_embedding without giving the vectorizer, that is I pass the Chinese documents that have been tokenized.
Does this work well, or the current model must use the tokenizer?
If possible, could you give me some example?

Thanks a lot~

The model does expect a tokenizer to be used. However, you could pre-tokenize the documents and simply split them with a space and then use a tokenizer that splits the documents on spaces.