How can I use the KeyBERT if I have tokenized Chinese documents by myself?
shuxian12 opened this issue · comments
Xian commented
I have seen the #45 and the api document, but I only find the usage of using customize tokenizer before extracting keyword.
I'm wondering if I could use the extract_embedding
without giving the vectorizer, that is I pass the Chinese documents that have been tokenized.
Does this work well, or the current model must use the tokenizer?
If possible, could you give me some example?
Thanks a lot~
Maarten Grootendorst commented
The model does expect a tokenizer to be used. However, you could pre-tokenize the documents and simply split them with a space and then use a tokenizer that splits the documents on spaces.