Guidance on managing BERTopic models

Question

Guidance on managing BERTopic models

TheAIMagics opened this issue 3 months ago · comments

I'm seeking some advice on managing BERTopic models for efficient topic clustering.

I've utilised BERTopic to cluster approximately 13,000 data which is three months data, resulting in around 130 distinct topics. To further reduce the number of topics, I implemented hierarchical topic modelling technique, effectively reducing the number of topics to 100.

Upon thorough analysis of both the topics and the associated documents, I've identified the necessity for additional topic merging. As a result, I manually merged several topics to refine the clustering outcome.

Now, I'm at a stage where I need to preserve this final model for future use. Specifically, I aim to employ it for predicting the topics of upcoming month's data.

I would greatly appreciate any insights or suggestions.

Maarten Grootendorst · Answer 1 · Sun Mar 10 2024 15:27:25 GMT+0800 (China Standard Time)

Thanks for sharing your question. Your answer can be found in the best practices section that also details serialization. Generally, the most stable method of saving and using your model is through safetensors or pytorch. Using pickle can be very difficult to version control and the other methods are a bit more lenient when it comes to version control. Note that these methods do not use the underlying dimensionality reduction and clustering algorithms but instead directly calculate assignment based on the cosine similarity between document and topic embeddings.

Satya Thakur · Answer 2 · Mon Mar 11 2024 22:55:55 GMT+0800 (China Standard Time)

@MaartenGr Thank you for your response. I have grasped the concept of utilizing various serialization methods to save the final BERTopic model efficiently. However, I encountered a challenge specific to my scenario. Along with merging several topics, I also modified the mappings of documents. How can I effectively handle this scenario while ensuring the integrity of the final model?

Maarten Grootendorst · Answer 3 · Wed Mar 13 2024 21:20:47 GMT+0800 (China Standard Time)

How can I effectively handle this scenario while ensuring the integrity of the final model?

I'm not quite sure what you mean. What issue are you currently facing?

Satya Thakur · Answer 4 · Thu Mar 14 2024 20:11:30 GMT+0800 (China Standard Time)

Hi Maarten, Clarifying my use case: I ran BERTopic clustering for about 75000 documents(that represents 3 months of data). Since some clusters were similar, I merged them using ".merge_topics". After this, I assessed the distribution of topics and associated documents using: topic_model.get_document_info(docs) Upon analysis of this distribution, identified certain documents that were not correctly assigned to the right clusters. To rectify this, I manually relocated these documents to their correct clusters. Now when next month's data comes in, I want to predict the clusters and I want the clustering to be done taking into account my manual updates (so that I don't have to keep repeating the same manual exercise every month). In other words, how do I effectively communicate the changes I am making manually, to the model? (especially concerning the modifications made at the document level.) Thanks again for your help.

Maarten Grootendorst · Answer 5 · Fri Mar 15 2024 15:48:23 GMT+0800 (China Standard Time)

Upon analysis of this distribution, identified certain documents that were not correctly assigned to the right clusters. To rectify this, I manually relocated these documents to their correct clusters.

Assuming you did this manual relocation using .update_topics, the resulting topic embeddings should be updated to incorporate those changes. So if you have saved and loaded the model using safetensors or pytorch, these topic embeddings will be used to perform the topic assignments.

Satya Thakur · Answer 6 · Sat Mar 16 2024 15:23:12 GMT+0800 (China Standard Time)

How can I use update_topics to facilitate the relocation of documents within clusters while ensuring the accuracy of the model's topic assignments?

Maarten Grootendorst · Answer 7 · Mon Mar 18 2024 18:55:26 GMT+0800 (China Standard Time)

.update_topics recreates the topic embeddings based on the newly created topics, so the underlying assignment will also be updated assuming you use safetensors or pytorch.

AILearnerMode · Answer 8 · Thu Mar 21 2024 02:47:24 GMT+0800 (China Standard Time)

Hi Maarten,

I have a similar question to TheAIMagics and still not able to wrap my head around it. Putting together a dumbed down version of what I am looking for:

After training my docs, I get 2 clusters:
ClusterA docs: Damaged products, used products, dirty products, incorrect product
ClusterB docs: Wrong product received, wrong delivery, wrong item sent

Now I review these clusters and want to tell the model that "incorrect product" actually belongs to ClusterB. How do I do this programmatically so when I send my next set of data to this model for predictions, anything related to "incorrect products" are correctly added to Cluster B.

Im not exactly sure how to do that with update_topics as this method seems to only update the topic representation, which may not include "incorrect product" at all.

Maarten Grootendorst · Answer 9 · Thu Mar 21 2024 20:36:12 GMT+0800 (China Standard Time)

@AILearnerMode If you check the source code of .update_topics you will notice that the topic embeddings are also updated and not only the topic representation. Therefore, and assuming you are using the safetensors/pytorch method I described above, it should also update the assignment.

Satya Thakur · Answer 10 · Thu Mar 21 2024 23:07:01 GMT+0800 (China Standard Time)

@MaartenGr

Using update_topics method, I modified both the topics and the mappings of documents.
Saved the updated model using the safetensors
Loaded the model for prediction tasks.

Despite following this workflow, the predictions generated by the loaded model remain consistent with the original model, rather than reflecting the modifications made to the topics and document mappings.
Code

topic_model.update_topics(docs=modified_docs, topics=modified_topics)

directory_path = "directory_path"
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(directory_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

updated_model = BERTopic.load(directory_path)

new_docs = ["But app shows return failed."]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(new_docs, show_progress_bar=True)
new_topics, new_probs = updated_model.transform(new_docs, embeddings)

Any suggestions on how to resolve this issue?

Maarten Grootendorst · Answer 11 · Fri Mar 22 2024 17:56:44 GMT+0800 (China Standard Time)

@TheAIMagics Could share the steps before .update_topics as it might relate to choices being made in .update_topics.

Satya Thakur · Answer 12 · Fri Mar 22 2024 18:43:49 GMT+0800 (China Standard Time)

@MaartenGr

After training of the BERTopic model, each docs information was saved into a file for reference.
Within the saved file, a manual relocation task was performed, relocating six documents initially assigned to Topic1 into Topic3
Lists of modified documents (modified_docs) and their corresponding topics (modified_topics) were prepared
The BERTopic model was updated using the update_topics method
Despite the successful relocation of the six documents into Topic 3 as confirmed by the updated topic information. Upon prediction using the updated model, it was noted that the six relocated documents still remained categorized under Topic1


# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15 , n_components=5, min_dist=0.0, metric='cosine')
# Step 3 - Tokenize topics
vectorizer_model= CountVectorizer(min_df = 5,ngram_range=(1, 3), stop_words="english")
# Step 4 - Create topic representation
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)
topic_model = BERTopic(
      # Pipeline models
      embedding_model=embedding_model,  # Embedding model for sentence embeddings
      umap_model=umap_model,  # UMAP model for dimensionality reduction
      vectorizer_model=vectorizer_model,  # Vectorizer model for transforming text data
      ctfidf_model=ctfidf_model,  # Model for contextualized TF-IDF representation
      verbose=True,  # Display progress and information during training,
      calculate_probabilities= True
    )
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)
topic_df = topic_model.get_topic_info()

Topic_List = topic_model.get_document_info(docs)['Topic'].to_list()
Probability_List = topic_model.get_document_info(docs)['Probability'].to_list()
df["Topic"] = Topic_List
df["Probability"] = Probability_List
df.to_csv("original_docs_mapping.csv", index= False)

modified_docs_path = "manual relocation of topics file path"
modified_docs_df = pd.read_csv(modified_docs_path)
modified_docs = modified_docs_df['Document'].to_list()
modified_topics = modified_docs_df['Topic'].to_list()
topic_model.update_topics(docs=modified_docs, topics=modified_topics)

directory_path = "directory_path"
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(directory_path, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

updated_model = BERTopic.load(directory_path)

new_docs = ["But app shows return failed."]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(new_docs, show_progress_bar=True)
new_topics, new_probs = updated_model.transform(new_docs, embeddings)

Original_model topic_info()

Updated_model topic_info()

Maarten Grootendorst · Answer 13 · Sun Mar 24 2024 15:40:01 GMT+0800 (China Standard Time)

Hmmmm, it does seem like the topic embeddings are not properly updated... Could you perhaps run the following before doing the inference:

updated_model = BERTopic.load(directory_path, embedding_model="all-MiniLM-L6-v2")
updated_model._create_topic_vectors()

It should update the topic vectors using the embedding model.

Satya Thakur · Answer 14 · Fri Mar 29 2024 11:26:03 GMT+0800 (China Standard Time)

@MaartenGr Thank you for your response

Model performs good in predicting relocated documents, but it fails when dealing with outliers, assigning them to random clusters instead of correctly identifying them as outliers (Topic -1).

Steps Taken:

To address this issue, I attempted to utilize the hdbscan_model, which employs an approximation method to predict new points. This approach successfully predicts outliers, but it overlooks the accurate prediction of relocated documents.

topic_model.update_topics(docs = modified_docs,topics = modified_topics)
topic_model._create_topic_vectors()

new_doc = ["I love apple fruit juice"]
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(new_docs, show_progress_bar=True)
topic_model.transform(new_doc, embeddings)

Challenge:

This approach does not handle the prediction of relocated documents properly. When using the hdbscan_model, the predictions generated by the loaded model remain consistent with the original model, which is not ideal.

I am open to saving the model using the pickle method for predictions. This might provide a workaround to ensure accurate predictions for both relocated documents and outliers. Any suggestions on how to tackle this issue would be greatly appreciated.

Maarten Grootendorst · Answer 15 · Fri Mar 29 2024 15:50:35 GMT+0800 (China Standard Time)

It makes sense that the method you tried does not work since the topic vectors are not taken into account when running the predictions for HDBSCAN. I suggested doing it after loading the model using either pytorch or safetensors since the topic vectors will be used in those cases.

I believe manually changing the assignment can only be done by removing HDBSCAN since that is a clustering model that learns certain clusters. What you suggest is adapting HDBSCAN, which is quite difficult.

Instead, I would still advise using the method I proposed earlier and see if you can identify a threshold for when something does or does not relate to an outlier. My expectation is that if it does not assign it to an outlier, you can still use the probabilities (by setting topic_model.calculate_probabilities_ = True after loading your model) to do a manual assignment. This manual assignment would consist of three sequential steps:

If the largest probability is the outlier topic, assign the document to the outlier topic
elIf the largest probability does not exceed a certain threshold, assign it to the outlier topic
else, assign to the non-outlier topics with the largest probability