MaartenGr / KeyBERT

Minimal keyword extraction with BERT

Home Page:https://MaartenGr.github.io/KeyBERT/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about KeyLLM + KeyBERT

lfoppiano opened this issue · comments

I have two questions about KeyLLM + KeyBERT (https://maartengr.github.io/KeyBERT/guides/keyllm.html#5-efficient-keyllm-keybert).

Basically, did I understand correctly that KeyBERT is used for fetching the candidates keywords and then refined with LLM using the clustering?

How does it handle the list of documents? I have to process something like a batch of 6000 abstracts, should I manage them by myself I can I send them directly?

I do something like this:

abstracts = [work['abstract'] if 'abstract' in work and work['abstract'] is not None else "" for work in
                 works]
embeddings_abstracts = model.encode(abstracts, convert_to_tensor=True)
keywords_abstracts = kw_model.extract_keywords(abstracts, embeddings=embeddings_abstracts, threshold=0.7)

titles = [work['title'] if 'title' in work and work['title'] is not None else "" for work in works]
embeddings_titles = model.encode(titles, convert_to_tensor=True)
keywords_titles = kw_model.extract_keywords(titles, embeddings=embeddings_titles, threshold=0.7)

Calculating the embedding should not be needed if I use the standard model from Sentence BERT, right?

Basically, did I understand correctly that KeyBERT is used for fetching the candidates keywords and then refined with LLM using the clustering?

That is correct! Do note that clustering will not always be performed but depends on the threshold.

How does it handle the list of documents? I have to process something like a batch of 6000 abstracts, should I manage them by myself I can I send them directly?

If they are just abstracts, I would advise sending them directly without doing any work before that. Titles are much less descriptive compared to abstracts, so I would focus mainly on the abstracts.

Calculating the embedding should not be needed if I use the standard model from Sentence BERT, right?

If you are not calculating the embeddings yourself, you can use KeyBERT instead. Either way, embeddings need to be calculated.

OK thanks!

Just to be sure, when say "using keyBERT" means to:

  1. pass the llm to keybert when instantiating and
  2. not passing the embeddings when calling kw_model.extract_keywords()

correct?

Thank you

Yes, that's correct! You could also just pass the embeddings, there should not be a difference between those steps.

Thanks!

One more question, is there a way to know how many clusters are generated?

In our use case, we don't know at the moment which is the best value for the threshold, for us, however, we would like to see at least how many clusters and/or how big the clusters are.

You can see how many clusters are created after running KeyLLM by finding lists that have the same keywords. So essentially counting the number of documents that have the exact same keywords.

I see, yes this is the solution I'm doing, and I was hoping for something more efficient 😉
Thanks anyway