MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page:https://maartengr.github.io/BERTopic/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

blank labels issue with 2d documents Visualization

mahmawad opened this issue · comments

image

when I run topicmodeling in .py script I got this issue

Thanks for sharing but I am not familiar with your .py script. I will need a bit more information to understand what is happening here. Could you share your full code along with the version of BERTopic you are using?

thank you for replying

a normal importing for llama 2 and then I save visualization using write_html function
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(df_articles['PreprocessedText'].tolist(), show_progress_bar=True)


# ft = api.load('fasttext-wiki-news-subwords-300')
# 

# In[18]:


from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)


# In[19]:


reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)


# In[20]:


from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(metric='euclidean', cluster_selection_method='eom', prediction_data=True,min_cluster_size=15)


# In[20]:


from sklearn.cluster import KMeans

#cluster_model = KMeans(n_clusters=6, random_state=42)
cluster_model = KMeans(random_state=42,n_clusters=11)


# In[21]:


from sklearn.feature_extraction.text import CountVectorizer

# Custom list of words to exclude
custom_exclude_words = ["world", "automotive", "post",'first','new','car','cars','vehicle','vehicles','say']
# Merge the custom words with the standard stop words
vectorizer_model = CountVectorizer(stop_words=custom_exclude_words, min_df=3)


# In[22]:


from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration

# KeyBERT
keybert = KeyBERTInspired()

# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)

# Text generation with Llama 2
llama2 = TextGeneration(generator, prompt=prompt)

# All representation models
representation_model = {
    "KeyBERT": keybert,
    "Llama2": llama2,
    "MMR": mmr,
}


# In[2]:

"""
import torch
print(torch.cuda.memory_summary(device=None, abbreviated=False))
torch.cuda.empty_cache()

"""
# In[23]:


topics_inp=df_articles['PreprocessedText'].tolist()


# In[24]:


from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
      vectorizer_model=vectorizer_model,
  umap_model=umap_model,

  hdbscan_model=cluster_model,
  representation_model=representation_model,
  #ctfidf_model=ctfidf_model,
  # Hyperparameters
  top_n_words=10,
  verbose=True,

)

topics, probs = topic_model.fit_transform(topics_inp,embeddings)


# In[27]:


#topic_model.merge_topics(df_articles['PreprocessedText'].tolist(),[5,2])


# In[ ]:


# use one of the other topic representations, like KeyBERTInspired
#keybert_topic_labels = {topic: " | ".join(list(zip(*values))[0][:4]) for topic, values in topic_model.topic_aspects_["Llama2"].items()}
#topic_model.set_topic_labels(keybert_topic_labels)


# In[28]:


llama2_labels = [label[0][0].split("\n")[0] for label in topic_model.get_topics(full=True)["Llama2"].values()]
topic_model.set_topic_labels(llama2_labels)


# In[29]:


topic_model.get_topic_info()


# In[30]:


# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
viss=topic_model.visualize_documents(topics_inp, custom_labels=True,hide_annotations=False,hide_document_hover=False)
path_file = r"/home/amahmoud/workspace/vis_two_week_visul.html"
viss.write_html(path_file)

Could you check what labels you set in llama2_labels? There might be something going on there that Llama 2 might not have created all labels.

i checked them but i think the problem is when I run it in a py script. it works well when i run it in Jupyter Notebook but I need it in py file so it could be automated

That's strange as the output is actually HTML I believe and should not render differently in a Jupyter Notebook compared to using .py