Set random seed in `hierarchical_topics`?

Question

Set random seed in `hierarchical_topics`?

serenalotreck opened this issue 2 months ago · comments

I've set the random seed when I fit my topic model, and I'm getting reproducible results. I'm using the following:

def fit_reduce_model(rep_model, docs):
    """
    Defines all component models internally besides the representation model, which is the only one that changes.
    Pre-calculates embeddings, fits model, and performs outlier reduction.

    parameters:
        rep_model, class instance from bertopic.representation: representation model
        docs, list of str: documents to model

    returns:
        topic_model, BERTopic model: fitted model with outliers reduced
    """
    # Define all component models
    print('Defining component models...')
    sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    ## Using default HDBSCAN model, no definition needed
    vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
    representation_model=rep_model
    
    # Pre-calculate embeddings
    print('Calculating embeddings...')
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # We reduce our embeddings to 2D as it will allows us to quickly iterate later on
    reduced_embeddings = umap_model.fit_transform(embeddings)
    
    # Fit the model
    print('Fitting model...')
    topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
    topics, probs = topic_model.fit_transform(docs, embeddings)
    
    # Reduce outliers
    print('Reducing outliers...')
    new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
    topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)
    
    return topic_model

However, when I run the following, I get varied results:

# Fit the model
mmr_rep_model = MaximalMarginalRelevance(diversity=0.3)
mmr_model = fit_reduce_model(mmr_rep_model, docs)

# Generate hierarchical topics
hierarchical_topics = mmr_model.hierarchical_topics(docs)
fig = mmr_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()

I don't see a way in the docs to set a random seed for hierarchical_topics; let me know if I've overlooked something!

Maarten Grootendorst · Answer 1 · Wed Apr 03 2024 14:40:43 GMT+0800 (China Standard Time)

Just to be sure, do you get varied results out of .hierarchical_topics or out of .visualize_hierarchy? They are different code bases and your code suggests you get varied results from .visualize_hierarchy and not .hierarchical_topics.

Serena Lotreck · Answer 2 · Wed Apr 03 2024 22:52:29 GMT+0800 (China Standard Time)

Why would .visualize_hierarchy be different if hierarchical_topics is the same, since hierarchical_topics is passed to .visualize_hierarchy? I can go check, but if that is the case, I'd like a way to set a random seed for .visualize_hierarchy!

Serena Lotreck · Answer 3 · Wed Apr 03 2024 23:15:02 GMT+0800 (China Standard Time)

Ok I checked, and it is .hierarchical_topics that's giving different results, I saved out the results and read them back in, and when I ran .visualize_hierarchy, I got the same visualization.

Maarten Grootendorst · Answer 4 · Fri Apr 05 2024 14:59:21 GMT+0800 (China Standard Time)

Why would .visualize_hierarchy be different if hierarchical_topics is the same, since hierarchical_topics is passed to .visualize_hierarchy?

They are different code bases, so any randomness can appear in either function. There also has been randomness in visualization functions before.

Ok I checked, and it is .hierarchical_topics that's giving different results, I saved out the results and read them back in, and when I ran .visualize_hierarchy, I got the same visualization.

That's good to know! Looking through the code of .hierarchical_topics (assuming you are using v0.16 of BERTopic), I don't see anything that would explain this.

Does this also happen if you run it with 20NewsGroups? Could you create a self-contained reproducible example? That way, I can more easily find the issue.

Serena Lotreck · Answer 5 · Sat Apr 06 2024 04:58:58 GMT+0800 (China Standard Time)

Here is the code with 20NewsGroups:

from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
import openai
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

def fit_reduce_model(rep_model, docs):
    """
    Defines all component models internally besides the representation model, which is the only one that changes.
    Pre-calculates embeddings, fits model, and performs outlier reduction.

    parameters:
        rep_model, class instance from bertopic.representation: representation model
        docs, list of str: documents to model

    returns:
        topic_model, BERTopic model: fitted model with outliers reduced
    """
    # Define all component models
    print('Defining component models...')
    sentence_model = SentenceTransformer('allenai/scibert_scivocab_cased')
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    ## Using default HDBSCAN model, no definition needed
    vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=10)
    representation_model=rep_model
    
    # Pre-calculate embeddings
    print('Calculating embeddings...')
    embeddings = sentence_model.encode(docs, show_progress_bar=True)
    # We reduce our embeddings to 2D as it will allows us to quickly iterate later on
    reduced_embeddings = umap_model.fit_transform(embeddings)
    
    # Fit the model
    print('Fitting model...')
    topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, representation_model=representation_model, vectorizer_model=vectorizer_model)
    topics, probs = topic_model.fit_transform(docs, embeddings)
    
    # Reduce outliers
    print('Reducing outliers...')
    new_topics = topic_model.reduce_outliers(docs, topics, strategy='embeddings', threshold=0.1) # This method ends up reducing all outliers even with this threshold
    topic_model.update_topics(docs, topics=new_topics, vectorizer_model=vectorizer_model, representation_model=representation_model)
    
    return topic_model

representation_model = MaximalMarginalRelevance(diversity=0.3)
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

model = fit_reduce_model(representation_model, docs)

hierarchical_topics = model.hierarchical_topics(docs)
hierarchical_topics.head()

fig = model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.show()

It looks like neither of the two functions introduces randomness for the 20NewsGroups (see screenshots below), which is super odd to me, because the only difference between this code and what I did previously are the input docs, and I wouldn't expect that to matter.

Running it the first time:

Running just the visualization again:

Running both the hierarchical topic generation and the visualization again:

Maarten Grootendorst · Answer 6 · Sun Apr 07 2024 15:02:15 GMT+0800 (China Standard Time)

I wouldn't expect the input documents to have this kind of influence and I expect that it does stem from either a difference in your code or your environment. DId you make sure that the environments you worked in between when you could produce the code with your data and the 20NewsGroups are exactly the same? As in, same python version, dependency versions (even numpy, numba, pandas, etc)?

Serena Lotreck · Answer 7 · Mon Apr 08 2024 09:20:37 GMT+0800 (China Standard Time)

Yes they're the same! I launched a jupyter notebook instance using the same kernel that I made from a conda environment, and haven't changed any packages in the conda environment between having noticed the issue and trying to reproduce it.

Maarten Grootendorst · Answer 8 · Wed Apr 10 2024 17:23:29 GMT+0800 (China Standard Time)

In that case, I'm not entirely sure what is happening here. The data should not influence whether something is reproducible or not, they should not influence any stochasticity or randomness unless the data itself is random.

Serena Lotreck · Answer 9 · Thu Apr 11 2024 03:26:26 GMT+0800 (China Standard Time)

Very weird.. I'm going to stick with reading in the hierarcical topics when I need to regenerate the figure for now, I'll let you know if I figure anything else out. Thanks for your help!