Question about KeyLLM + KeyBERT

Question

Question about KeyLLM + KeyBERT

lfoppiano opened this issue 6 months ago · comments

I have two questions about KeyLLM + KeyBERT (https://maartengr.github.io/KeyBERT/guides/keyllm.html#5-efficient-keyllm-keybert).

Basically, did I understand correctly that KeyBERT is used for fetching the candidates keywords and then refined with LLM using the clustering?

How does it handle the list of documents? I have to process something like a batch of 6000 abstracts, should I manage them by myself I can I send them directly?

I do something like this:

abstracts = [work['abstract'] if 'abstract' in work and work['abstract'] is not None else "" for work in
                 works]
embeddings_abstracts = model.encode(abstracts, convert_to_tensor=True)
keywords_abstracts = kw_model.extract_keywords(abstracts, embeddings=embeddings_abstracts, threshold=0.7)

titles = [work['title'] if 'title' in work and work['title'] is not None else "" for work in works]
embeddings_titles = model.encode(titles, convert_to_tensor=True)
keywords_titles = kw_model.extract_keywords(titles, embeddings=embeddings_titles, threshold=0.7)

Calculating the embedding should not be needed if I use the standard model from Sentence BERT, right?

Maarten Grootendorst · Answer 1 · Thu Dec 07 2023 20:31:42 GMT+0800 (China Standard Time)

Basically, did I understand correctly that KeyBERT is used for fetching the candidates keywords and then refined with LLM using the clustering?

That is correct! Do note that clustering will not always be performed but depends on the threshold.

How does it handle the list of documents? I have to process something like a batch of 6000 abstracts, should I manage them by myself I can I send them directly?

If they are just abstracts, I would advise sending them directly without doing any work before that. Titles are much less descriptive compared to abstracts, so I would focus mainly on the abstracts.

Calculating the embedding should not be needed if I use the standard model from Sentence BERT, right?

If you are not calculating the embeddings yourself, you can use KeyBERT instead. Either way, embeddings need to be calculated.

Luca Foppiano · Answer 2 · Fri Dec 08 2023 09:40:18 GMT+0800 (China Standard Time)

OK thanks!

Just to be sure, when say "using keyBERT" means to:

pass the llm to keybert when instantiating and
not passing the embeddings when calling kw_model.extract_keywords()

correct?

Thank you

Maarten Grootendorst · Answer 3 · Fri Dec 08 2023 14:57:54 GMT+0800 (China Standard Time)

Yes, that's correct! You could also just pass the embeddings, there should not be a difference between those steps.

Luca Foppiano · Answer 4 · Fri Dec 08 2023 15:22:31 GMT+0800 (China Standard Time)

Thanks!

One more question, is there a way to know how many clusters are generated?

In our use case, we don't know at the moment which is the best value for the threshold, for us, however, we would like to see at least how many clusters and/or how big the clusters are.

Maarten Grootendorst · Answer 5 · Fri Dec 08 2023 16:47:36 GMT+0800 (China Standard Time)

You can see how many clusters are created after running KeyLLM by finding lists that have the same keywords. So essentially counting the number of documents that have the exact same keywords.

Luca Foppiano · Answer 6 · Sat Dec 09 2023 09:39:13 GMT+0800 (China Standard Time)

I see, yes this is the solution I'm doing, and I was hoping for something more efficient 😉
Thanks anyway