Unable to Display More Than 10 Words per Topic in BERTopic Despite top_n_words Setting

Question

Unable to Display More Than 10 Words per Topic in BERTopic Despite top_n_words Setting

Hanqingxu123 opened this issue 2 months ago · comments

Hello,

I've been using BERTopic for topic modeling and encountered an issue where I cannot display more than 10 words per topic, even though I've explicitly set top_n_words to 15. This issue persists across both topic extraction and visualization phases, where the expected outcome is to display the top 15 words per topic. Below is a summary of my setup and the encountered issue:

Setup:
I have initialized the BERTopic model with custom settings including various models for embedding, representation, dimensionality reduction, and clustering. I've set top_n_words to 15 with the intention to extract and visualize the top 15 words for each topic:

python
Copy code
from bertopic import BERTopic

topic_model = BERTopic(
embedding_model=embedding_model,
representation_model=representation_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
top_n_words=15
)

print(topic_model.top_n_words) # This correctly outputs 15
Issue:
After fitting the model with fit_transform, when attempting to extract or visualize the top words for each topic, it seems that the model only displays up to 10 words per topic, despite top_n_words being set to 15. This limitation occurs both when extracting topic words and scores and during visualization with visualize_barchart.

For example, extracting topic words and scores:

topic_number = 0
topic_words_and_scores = topic_model.get_topic(topic_number)

print("Number of words in topic:", len(topic_words_and_scores))
for word, score in topic_words_and_scores:
print(f"{word}: {score}")

And visualizing the topics with:

topic_model.visualize_barchart(n_words=15, height=500, width=350, top_n_topics=3)
In both cases, only 10 words are displayed per topic, not the 15 as expected based on the top_n_words setting.

Questions:

Is there a known limitation or bug that prevents displaying more than 10 words per topic, despite the top_n_words setting?
Are there additional steps or settings required to ensure that the top 15 words are extracted and visualized per topic?
Could there be any internal overrides or defaults that limit the number of words displayed per topic to 10, which I might not be aware of?
I would greatly appreciate any guidance or suggestions you could offer to resolve this issue, as I

Thank

Maarten Grootendorst · Answer 1 · Wed Apr 03 2024 14:19:45 GMT+0800 (China Standard Time)

You might need to set top_n_words in your representation_model. Could you share your full code?

Hanqingxu123 · Answer 2 · Wed Apr 03 2024 15:16:21 GMT+0800 (China Standard Time)

You might need to set top_n_words in your representation_model. Could you share your full code?
Thank you for your assistance. I'll provide the complete code, and I need your help to understand why I'm unable to generate more than the top ten topic words and their probabilities for each topic, including the subsequent visualization and dynamic topic-probability extraction which also only allows retrieving the top ten topic words.

import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic.vectorizers import ClassTfidfTransformer
import nltk
from nltk.stem import WordNetLemmatizer
import plotly.graph_objects as go
import string
import re
from bertopic.representation import MaximalMarginalRelevance
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.cluster import KMeans

Load the English model

nlp = spacy.load('en_core_web_sm')
STOP_WORDS = nlp.Defaults.stop_words

Load the data

data = pd.read_csv('AIb.csv')

Define a function to clean the text

def clean_text(text):
# Convert the text to lowercase
text = text.lower()
text = re.sub(r'\bartificial intelligence\b', 'ai', text)
text = re.sub(r'\bconvolutional neural network\b', 'cnn', text)
text = re.sub(r'\bLarge language models\b', 'LLMs', text)

doc = nlp(text)
text = ' '.join(token.lemma_ if token.text != "data" else token.text for token in doc if not token.is_punct and not token.is_space)

text = ' '.join(word for word in text.split() if word not in STOP_WORDS)


return text

Clean the text data

data['document'] = data['document'].apply(clean_text)

timestamps = data.PY.to_list()

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(data['document'], show_progress_bar=False)

Step 2 - Reduce dimensionality

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')#,random_state=42

Step 3 - Cluster reduced embeddings

#cluster_model = KMeans(n_clusters=15)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)
from sklearn.feature_extraction.text import CountVectorizer

Step 4 - Tokenize topics

vectorizer_model = CountVectorizer(ngram_range=(1, 3),stop_words="english",min_df=2)

Step 5 - Create topic representation

ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = MaximalMarginalRelevance(diversity=0.1)
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model = BERTopic(
embedding_model=embedding_model,
representation_model=representation_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
calculate_probabilities=True,
nr_topics="auto" ,
top_n_words=15
)
topics, probs= topic_model.fit_transform(data['document'],embeddings)
topic_number = 0
topic_words_and_scores = topic_model.get_topic(topic_number)

print("Number of words in topic:", len(topic_words_and_scores))
for word, score in topic_words_and_scores:
print(f"{word}: {score}")

Reduce outliers

new_topics = topic_model.reduce_outliers(data['document'],topics, strategy="c-tf-idf", threshold=0.1)
new_topics = topic_model.reduce_outliers(data['document'], new_topics, strategy="distributions")
topic_model.update_topics(data['document'], topics=new_topics)

visualize_topics1 = topic_model.visualize_topics()

visualize_topics1.write_html('visualize_topics.html')

embeddings = embedding_model.encode(data['document'], show_progress_bar=False)

Run the visualization with the original embeddings

topic_model.visualize_documents(data['document'], reduced_embeddings=reduced_embeddings, hide_document_hover=True, hide_annotations=True).write_html("主题文档分布.html")

topics_over_time = topic_model.topics_over_time(data['document'],
timestamps,global_tuning=True,
evolution_tuning=True,
nr_bins=29)

weights_list = []

for topic in topics_over_time['Topic']:

topic_words_weights = topic_model.get_topic(topic)
if topic_words_weights:

    weights_str = "; ".join([f"{word} ({weight:.4f})" for word, weight in topic_words_weights])
    weights_list.append(weights_str)
else:
  
    weights_list.append("")

topics_over_time['Word Weights'] = weights_list

topics_over_time.to_csv("topics_over_time_with_weights.csv", index=False)
topic_model.visualize_topics_over_time(topics_over_time).write_html('动态主题演变.html')

a=topic_model.visualize_barchart(n_words=15, height=500, width=350, top_n_topics=3)

a.write_html('主题-主题词分布.html')

fig = topic_model.visualize_heatmap().write_html("热力图.html")
fig1 = topic_model.visualize_hierarchy()

fig1.write_html("层级结构图.html")
fig2= topic_model.visualize_term_rank()
fig2.write_html("术语排行.html")
c=topic_model.get_document_info(data['document'])
c.to_csv("lis人工智能.csv")

Maarten Grootendorst · Answer 3 · Wed Apr 03 2024 15:21:10 GMT+0800 (China Standard Time)

You need to also increase the top_n_words value in MaximalMarginalRelevance. The reason is that MMR takes a number of input keywords, for example 15, and filters that down to a more diverse subset of 10. Here, the 15 relates to the top_n_words parameter in BERTopic (BERTopic(top_n_words=15) and 10 refers to the value in MMR (MaximalMarginalRelevance(top_n_words=10)). Make sure that the former is always bigger than the latter otherwise no diversification will be applied.

Hanqingxu123 · Answer 4 · Wed Apr 03 2024 15:40:40 GMT+0800 (China Standard Time)

representation_model = MaximalMarginalRelevance(diversity=0.1,top_n_words=10)

You need to also increase the top_n_words value in MaximalMarginalRelevance. The reason is that MMR takes a number of input keywords, for example 15, and filters that down to a more diverse subset of 10. Here, the 15 relates to the top_n_words parameter in BERTopic (BERTopic(top_n_words=15) and 10 refers to the value in MMR (MaximalMarginalRelevance(top_n_words=10)). Make sure that the former is always bigger than the latter otherwise no diversification will be applied.

Thank you for your previous advice. I understand that to achieve diversification, I need to set the top_n_words parameter in BERTopic to be greater than the value in MMR (MaximalMarginal Relevance). I've tried to configure my code following your guidance, but I'm not sure if I've implemented it correctly. In my BERTopic initialization, I've set top_n_words=15. However, I can't find a way in my code to directly set the top_n_words value in MMR to 10. How can I accurately set top_n_words=10 in MMR? Are there any example codes or more detailed step-by-step instructions that could help me complete this configuration? I'm worried that I may not have correctly understood or implemented your suggestion.

Maarten Grootendorst · Answer 5 · Wed Apr 03 2024 15:42:32 GMT+0800 (China Standard Time)

I'm not sure If I understand correctly, you can just do this:

mmr = MaximalMarginalRelevance(diversity=0.1, top_n_words=10)
topic_model = BERTopic(top_n_words=30, representation_model=mmr)

Hanqingxu123 · Answer 6 · Wed Apr 03 2024 15:45:59 GMT+0800 (China Standard Time)

I'm not sure If I understand correctly, you can just do this:

mmr = MaximalMarginalRelevance(diversity=0.1, top_n_words=10)
topic_model = BERTopic(top_n_words=30, representation_model=mmr)

Hanqingxu123 · Answer 7 · Wed Apr 03 2024 15:50:11 GMT+0800 (China Standard Time)

I'm not sure If I understand correctly, you can just do this:

mmr = MaximalMarginalRelevance(diversity=0.1, top_n_words=10)
topic_model = BERTopic(top_n_words=30, representation_model=mmr)

I made changes to my code based on your recent suggestions, and here are the output results.
representation_model = MaximalMarginalRelevance(diversity=0.1,top_n_words=10,)

topic_model = BERTopic(
embedding_model=embedding_model,
representation_model=representation_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
top_n_words=30
)
Whether it's outputting the top fifteen topic words for a specific topic or visualizing the top fifteen topic words for each topic, neither works. I still can't break through the threshold of ten.
30
Number of words in topic: 10
research: 0.07231625152831453
interdisciplinary: 0.0357705831667299
social: 0.02861962498142216
transdisciplinary: 0.02603533844803727
development: 0.025841769865464017
future: 0.023463745399213854
impact: 0.021701800390241003
energy: 0.021572137571230877
global: 0.01934053713282768
environmental: 0.019191544088425334

Hanqingxu123 · Answer 8 · Wed Apr 03 2024 15:57:38 GMT+0800 (China Standard Time)

I've just solved it. It turns out that I needed to set it to 15 in representation_model = MaximalMarginalRelevance(top_n_words=15, diversity=0.1) to make it work. Thank you for your valuable advice.