MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page:https://maartengr.github.io/BERTopic/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Representation on very large documents with LLMs cuts out its own prompts issue

arminpasalic opened this issue · comments

Hello Maarten!

Love the module. Currently using BERTopic for multilingual data for my thesis. However, I seek the assistance in making LLM works for representation - I somehow got it to work, but still not quite as effective?

My aim is to use an LLM for representation of KEYWORDS and REPRESENTATIVE DOCUMENTS, but these are very very long. I followed your guide on your official page, but couldn't make it truncate, nor fix the length. All documents are articles, and are in 6 separate languages.
For my embeddings, I split the text into chunks and averaged for the intial (custom) embedding stage.

So I used this Zephyr7b approach with LlamaCPP, and it almost worked by not quite... It somehow cuts the representation and names for the topics, but still gives a bit of valuable insight. I also tried with MISTRAL, but this did not work.
And I can't understand how to use a custom prompt in this scenario, compared to what your guide shows with prompt etc

Code (edit, added more 10:39)

# Filtered data to test
first_40_rows = combined_df.head(40)

# Extracting 'text' column and converting to list
docs = first_40_rows['cleaned_text'].tolist()

# Extracting custom averaged over 'embeddings' created by ROBERTA and stacking them vertically
embeddings = np.vstack(first_40_rows['embeddings'].to_numpy())

pip install llama-cpp-python --quiet

from bertopic import BERTopic
from bertopic.representation import LlamaCPP

# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
representation_model = LlamaCPP(
    "/work/Master/zephyr-7b-alpha.Q4_K_M.gguf"
)

# Use Danish stopwords with CountVectorizer
vectorizer_model = CountVectorizer(stop_words=danish_stopwords,ngram_range=(1, 3), min_df=2)

# Remaining setup
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=2, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
ctfidf_model = ClassTfidfTransformer()

# Initialize BERTopic with  custom models and the updated vectorizer
topic_model = BERTopic(embedding_model=model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  ctfidf_model=ctfidf_model,
    verbose=True,
    top_n_words=20,
    representation_model=representation_model

# Fit the BERTopic model 
topics, probs=topic_model.fit_transform(docs, embeddings)

df_topic['Representation'][1]
['\n"Israeli attacks on civilian Palestinians in Gaza and Leb',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

df_topic['Representation'][2]
[' "Israeli attacks on Gazan children"\n\nQ: Can you',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

A couple of things that might help here.

First, make sure that you use a prompt template that fits with the LLM that you are using. Like the example in the documentation, simply use the prompt variable:

prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
representation_model = LlamaCPP(
    "/work/Master/zephyr-7b-alpha.Q4_K_M.gguf",
   prompt=prompt
)

Second, as shown in the documentation you can truncate the input documents to make sure they do not exceed certain token limits. I think this would help your use case if your documents are quite long.

A couple of things that might help here.

First, make sure that you use a prompt template that fits with the LLM that you are using. Like the example in the documentation, simply use the prompt variable:

prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

# Use llama.cpp to load in a 4-bit quantized version of Zephyr 7B Alpha
representation_model = LlamaCPP(
    "/work/Master/zephyr-7b-alpha.Q4_K_M.gguf",
   prompt=prompt
)

Second, as shown in the documentation you can truncate the input documents to make sure they do not exceed certain token limits. I think this would help your use case if your documents are quite long.

Hi Maarten! Thanks for your quick reply.

I have done the following, and I think this can be a sufficient approach for my multilingual data - where I increased the context size of the zephyr model (code underneath).
However, I am wondering if I can "select" the number of topics to pass onto the fine-tuning of representation and for the LLM. Let's say I only want the top 50 to be fine-tuned. As I understand it, if I change the clustering approach from HBDSCAN to K-means, representative docs would not be available? I'm 'scared' of too many topics per country-wise (multilingual), and maybe spending too many resources (LLM) on the 'least' interesting topics.

# Use llama.cpp to load in a Quantized LLM.
llm = Llama(
    model_path='/work/Master/zephyr-7b-alpha.Q4_K_M.gguf',
    model_type="mistral",
    n_ctx=32768,
    stop=["Q:", "\n"]
)

prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic in English of max 5 words. Make sure you to only return the English label of max 5 words and nothing more.</s>
<|assistant|>"""

I have done the following, and I think this can be a sufficient approach for my multilingual data - where I increased the context size of the zephyr model (code underneath).

Do note that although you might have increased the context size, truncation is seldom a bad idea and something I would definitely recommend doing as the embeddings themselves typically also have limited context size.

However, I am wondering if I can "select" the number of topics to pass onto the fine-tuning of representation and for the LLM. Let's say I only want the top 50 to be fine-tuned. As I understand it, if I change the clustering approach from HBDSCAN to K-means, representative docs would not be available? I'm 'scared' of too many topics per country-wise (multilingual), and maybe spending too many resources (LLM) on the 'least' interesting topics.

Each topic is passed to the LLM individually and is not dependent on one another but are all processed sequentially. If you want to skip certain topics, you would have to change the source code as these representation models are seldom slow enough to warrant only processing certain topics.

For instance, the LLM is called once for each topic, so only a very limited number of times so it shouldn't take much wall time in doing so.

Also, representative documents are always available as they are calculated using the c-TF-IDF representations which are always calculated regardless of the representation or cluster models.