MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Home Page:https://maartengr.github.io/BERTopic/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Guidance on managing BERTopic models

TheAIMagics opened this issue · comments

I'm seeking some advice on managing BERTopic models for efficient topic clustering.

I've utilised BERTopic to cluster approximately 13,000 data which is three months data, resulting in around 130 distinct topics. To further reduce the number of topics, I implemented hierarchical topic modelling technique, effectively reducing the number of topics to 100.

Upon thorough analysis of both the topics and the associated documents, I've identified the necessity for additional topic merging. As a result, I manually merged several topics to refine the clustering outcome.

Now, I'm at a stage where I need to preserve this final model for future use. Specifically, I aim to employ it for predicting the topics of upcoming month's data.

I would greatly appreciate any insights or suggestions.

Thanks for sharing your question. Your answer can be found in the best practices section that also details serialization. Generally, the most stable method of saving and using your model is through safetensors or pytorch. Using pickle can be very difficult to version control and the other methods are a bit more lenient when it comes to version control. Note that these methods do not use the underlying dimensionality reduction and clustering algorithms but instead directly calculate assignment based on the cosine similarity between document and topic embeddings.

@MaartenGr Thank you for your response. I have grasped the concept of utilizing various serialization methods to save the final BERTopic model efficiently. However, I encountered a challenge specific to my scenario. Along with merging several topics, I also modified the mappings of documents. How can I effectively handle this scenario while ensuring the integrity of the final model?

How can I effectively handle this scenario while ensuring the integrity of the final model?

I'm not quite sure what you mean. What issue are you currently facing?

Upon analysis of this distribution, identified certain documents that were not correctly assigned to the right clusters. To rectify this, I manually relocated these documents to their correct clusters.

Assuming you did this manual relocation using .update_topics, the resulting topic embeddings should be updated to incorporate those changes. So if you have saved and loaded the model using safetensors or pytorch, these topic embeddings will be used to perform the topic assignments.

How can I use update_topics to facilitate the relocation of documents within clusters while ensuring the accuracy of the model's topic assignments?

.update_topics recreates the topic embeddings based on the newly created topics, so the underlying assignment will also be updated assuming you use safetensors or pytorch.

Hi Maarten,

I have a similar question to TheAIMagics and still not able to wrap my head around it.  Putting together a dumbed down version of what I am looking for:

After training my docs, I get 2 clusters:
ClusterA docs: Damaged products, used products, dirty products, incorrect product
ClusterB docs: Wrong product received, wrong delivery, wrong item sent

Now I review these clusters and want to tell the model that "incorrect product" actually belongs to ClusterB. How do I do this programmatically so when I send my next set of data to this model for predictions,  anything related to "incorrect products" are correctly added to Cluster B.

Im not exactly sure how to do that with update_topics as this method seems to only update the topic representation, which may not include "incorrect product" at all.

@AILearnerMode If you check the source code of .update_topics you will notice that the topic embeddings are also updated and not only the topic representation. Therefore, and assuming you are using the safetensors/pytorch method I described above, it should also update the assignment.

@MaartenGr

  1. Using update_topics method, I modified both the topics and the mappings of documents.
  2. Saved the updated model using the safetensors
  3. Loaded the model for prediction tasks.

Despite following this workflow, the predictions generated by the loaded model remain consistent with the original model, rather than reflecting the modifications made to the topics and document mappings.
Code

topic_model.update_topics(docs=modified_docs, topics=modified_topics)

directory_path = "directory_path"
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(directory_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

updated_model = BERTopic.load(directory_path)

new_docs = ["But app shows return failed."]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(new_docs, show_progress_bar=True)
new_topics, new_probs = updated_model.transform(new_docs, embeddings)

Any suggestions on how to resolve this issue?

@TheAIMagics Could share the steps before .update_topics as it might relate to choices being made in .update_topics.

@MaartenGr

  1. After training of the BERTopic model, each docs information was saved into a file for reference.
  2. Within the saved file, a manual relocation task was performed, relocating six documents initially assigned to Topic1 into Topic3
  3. Lists of modified documents (modified_docs) and their corresponding topics (modified_topics) were prepared
  4. The BERTopic model was updated using the update_topics method
  5. Despite the successful relocation of the six documents into Topic 3 as confirmed by the updated topic information. Upon prediction using the updated model, it was noted that the six relocated documents still remained categorized under Topic1

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15 , n_components=5, min_dist=0.0, metric='cosine')
# Step 3 - Tokenize topics
vectorizer_model= CountVectorizer(min_df = 5,ngram_range=(1, 3), stop_words="english")
# Step 4 - Create topic representation
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)
topic_model = BERTopic(
      # Pipeline models
      embedding_model=embedding_model,  # Embedding model for sentence embeddings
      umap_model=umap_model,  # UMAP model for dimensionality reduction
      vectorizer_model=vectorizer_model,  # Vectorizer model for transforming text data
      ctfidf_model=ctfidf_model,  # Model for contextualized TF-IDF representation
      verbose=True,  # Display progress and information during training,
      calculate_probabilities= True
    )
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)
topic_df = topic_model.get_topic_info()

Topic_List = topic_model.get_document_info(docs)['Topic'].to_list()
Probability_List = topic_model.get_document_info(docs)['Probability'].to_list()
df["Topic"] = Topic_List
df["Probability"] = Probability_List
df.to_csv("original_docs_mapping.csv", index= False)

modified_docs_path = "manual relocation of topics file path"
modified_docs_df = pd.read_csv(modified_docs_path)
modified_docs = modified_docs_df['Document'].to_list()
modified_topics = modified_docs_df['Topic'].to_list()
topic_model.update_topics(docs=modified_docs, topics=modified_topics)

directory_path = "directory_path"
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(directory_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

updated_model = BERTopic.load(directory_path)

new_docs = ["But app shows return failed."]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(new_docs, show_progress_bar=True)
new_topics, new_probs = updated_model.transform(new_docs, embeddings)

Original_model topic_info()
Original_Model

Updated_model topic_info()
updated_model

Hmmmm, it does seem like the topic embeddings are not properly updated... Could you perhaps run the following before doing the inference:

updated_model = BERTopic.load(directory_path, embedding_model="all-MiniLM-L6-v2")
updated_model._create_topic_vectors()

It should update the topic vectors using the embedding model.

@MaartenGr Thank you for your response

Model performs good in predicting relocated documents, but it fails when dealing with outliers, assigning them to random clusters instead of correctly identifying them as outliers (Topic -1).

Steps Taken:

To address this issue, I attempted to utilize the hdbscan_model, which employs an approximation method to predict new points. This approach successfully predicts outliers, but it overlooks the accurate prediction of relocated documents.

topic_model.update_topics(docs = modified_docs,topics = modified_topics)
topic_model._create_topic_vectors()

new_doc = ["I love apple fruit juice"]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(new_docs, show_progress_bar=True)
topic_model.transform(new_doc, embeddings)

Challenge:

This approach does not handle the prediction of relocated documents properly. When using the hdbscan_model, the predictions generated by the loaded model remain consistent with the original model, which is not ideal.

I am open to saving the model using the pickle method for predictions. This might provide a workaround to ensure accurate predictions for both relocated documents and outliers. Any suggestions on how to tackle this issue would be greatly appreciated.

It makes sense that the method you tried does not work since the topic vectors are not taken into account when running the predictions for HDBSCAN. I suggested doing it after loading the model using either pytorch or safetensors since the topic vectors will be used in those cases.

I believe manually changing the assignment can only be done by removing HDBSCAN since that is a clustering model that learns certain clusters. What you suggest is adapting HDBSCAN, which is quite difficult.

Instead, I would still advise using the method I proposed earlier and see if you can identify a threshold for when something does or does not relate to an outlier. My expectation is that if it does not assign it to an outlier, you can still use the probabilities (by setting topic_model.calculate_probabilities_ = True after loading your model) to do a manual assignment. This manual assignment would consist of three sequential steps:

  • If the largest probability is the outlier topic, assign the document to the outlier topic
  • elIf the largest probability does not exceed a certain threshold, assign it to the outlier topic
  • else, assign to the non-outlier topics with the largest probability