KeyLLM keyword extraction issue

Question

KeyLLM keyword extraction issue

ksachdeva11 opened this issue 7 months ago · comments

KeyLLM seems to be extracting keywords which are not even present in the document used. I am following the steps mentioned in this article - https://towardsdatascience.com/introducing-keyllm-keyword-extraction-with-llms-39924b504813

I am using Mistral 7B model.

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

documents = [
"As discussed above, for the training set, finer-grained instances in the training set are generally better than coarser-grained ones. This preference does not apply to classification time, i.e. the use of the classifier in the field. We should go ahead and predict the sentiment of whatever text we are given, be it a sentence or a chapter.",
"I received my package!",
"You clearly want to know what is being complained about and what is being liked."
]

keywords = kw_model.extract_keywords(documents); keywords

Output -

[['discussed',
  'above',
  'finer-grained',
  'instances',
  'training',
  'set',
  'better',
  'coarser-grained',
  'preference',
  'applies',
  'classification',
  'time',
  'field',
  'predict',
  'sentiment',
  'text',
  'sentence',
  'chapter.'],
 ['package',
  'received',
  'delivery',
  'shipment',
  'mail',
  'courier',
  'product',
  'order',
  'online',
  'store'],
 ['complained',
  'liked',
  'want',
  'know',
  'clear',
  'understand',
  'specific',
  'detail',
  'issue',
  'problem',
  'feedback',
  'opinion',
  'satisfaction',
  'enjoyment',
  'appreciation',
  'preference',
  'dislike',
  'dissatisfaction',
  'negative',
  'positive',
  'favorable',
  'unf']]

It seems to be extracting similar words even though they are not present in the original document. Seems like model specific issue?

Maarten Grootendorst · Answer 1 · Fri Oct 13 2023 15:42:54 GMT+0800 (China Standard Time)

Thank you for sharing this! The LLM indeed plays a role in extracting the type of keywords, whether they are present or not in the original document. However, the main culprit here is the prompt in itself. By tweaking the prompt you can ask the LLM to only extract keywords that are literally found in the text and not to come up with different ones.

I would advise looking at the documentation here which illustrates this with an example.

Kunal Sachdeva · Answer 2 · Fri Oct 13 2023 15:44:37 GMT+0800 (China Standard Time)

Got it.. thank you for your quick response!

Bolive84 · Answer 3 · Thu Oct 26 2023 22:33:51 GMT+0800 (China Standard Time)

Hi Maarten,

For some reason, when using check_vocab to get words that appear in the documents, and the exact same code as in the documentation I receive different results, here is what I get:

[[], [], ['Meta released', "LLaMA's model"]]

Is there anything that can explain that result?

Maarten Grootendorst · Answer 4 · Sun Oct 29 2023 15:39:00 GMT+0800 (China Standard Time)

@Bolive84 Could you share your full code? Without it, it is difficult to say what exactly is happening here.

Bolive84 · Answer 5 · Mon Nov 06 2023 22:10:42 GMT+0800 (China Standard Time)

Hi @MaartenGr, thanks for your reply, the code I use is the one that is provided on the tutorial (just masking my API key for security reasons):

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM

# Create your LLM
openai.api_key = xxxx

prompt = """
I have the following document:
[DOCUMENT]

Based on the information above, extract the keywords that best describe the topic of the text.
Make sure to only extract keywords that appear in the text.
Use the following format separated by commas:
<keywords>
"""
llm = OpenAI()

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, check_vocab=True)
keywords

Maarten Grootendorst · Answer 6 · Tue Nov 07 2023 16:10:55 GMT+0800 (China Standard Time)

@Bolive84 It might just be that OpenAI tends not to extract the exact keywords that appear in the text. Could you try with and without check_vocab=True to see the difference between output?