How can I use the KeyBERT if I have tokenized Chinese documents by myself?

Question

How can I use the KeyBERT if I have tokenized Chinese documents by myself?

shuxian12 opened this issue 7 months ago · comments

I have seen the #45 and the api document, but I only find the usage of using customize tokenizer before extracting keyword.
I'm wondering if I could use the extract_embedding without giving the vectorizer, that is I pass the Chinese documents that have been tokenized.
Does this work well, or the current model must use the tokenizer?
If possible, could you give me some example?

Thanks a lot~

Maarten Grootendorst · Answer 1 · Thu Nov 02 2023 20:59:30 GMT+0800 (China Standard Time)

The model does expect a tokenizer to be used. However, you could pre-tokenize the documents and simply split them with a space and then use a tokenizer that splits the documents on spaces.