Updating and Pushing a BERTopic Model with New Documents to Hugging Face Hub still shows old no of training document
sdave-connexion opened this issue · comments
Have you searched existing issues? 🔎
- I have searched and found no existing issues
Desribe the bug
I have been using BERTopic for topic modelling and recently needed to update my existing BERTopic model with new documents. I want to push the updated model to the Hugging Face Hub, ensuring that it reflects the new number of documents and topics.
Here’s what I’ve done so far:
- Loaded my existing BERTopic model:
- Added new documents and their embeddings:
- Updated the model with new documents:
`new_topics, new_probs = topic_model.transform(lemmatized_docs, embeddings)`
- Saved the updated model using safetensors:
- Pushed the updated model to Hugging Face Hub:
Despite following these steps, I still see the old number of training documents in the repository on the Hugging Face Hub. How can I ensure that the updated model reflects the new number of training and topics?
Any help or guidance on this would be greatly appreciated!
Reproduction
from bertopic import BERTopic
# Load your existing BERTopic model
topic_model= BERTopic.load("shantanudave/BERTopic_ArXiv",embedding_model="sentence-transformers/all-MiniLM-L6-v2")
new_topics, new_probs = topic_model.transform(lemmatized_docs, embeddings)
new_model_name = "BERTopic_v2"
# Save the updated model locally using safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save(new_model_name, serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)
from huggingface_hub import login
# Authenticate with Hugging Face
login(token="your_hugging_face_token")
# Push the updated model to Hugging Face Hub
topic_model.push_to_hf_hub(
repo_id=f"shantanudave/{new_model_name}",
serialization="safetensors",
save_ctfidf=True,
save_embedding_model=embedding_model
)
BERTopic Version
pip install -U bertopic
Updated the model with new documents:
That's the thing, you didn't update the model. When you use .transform
, you are merely predicting the topics of the documents that you passed to it. .transform
, like it's used in scikit-learn, it not meant to update the underlying model. Instead, if you want to update the model, I would advise using either online topic modeling or the .merge_model
technique.
@MaartenGr
In my case, new data comes in every two days. So in this case I am planning to:
- Load the existing model
- Update the model using Online Topic Modeling.
- Save the model
Is this way correct ? Or is there any other easier way ?
Thanks in advance
You can only do this if step 1 was also done with online topic modeling. You cannot use .partial_fit
after .fit
at the moment. Instead, I would advise using the .merge_models
technique to iteratively combine new models.