`precomputed` Distance Compatibility for HDBSCAN

Question

`precomputed` Distance Compatibility for HDBSCAN

jjovalle99 opened this issue 2 months ago · comments

Hi there!

Recently, I've been experimenting with the UMAP + HDBSCAN workflow and noticed an opportunity to enhance its functionality related to distance metrics.

Proposal:

I propose to add compatibility for precomputed distances in HDBSCAN within BERTopic. This would allow users to use custom distance metrics, including the cosine similarity, which is not directly supported as a built-in metric in HDBSCAN.

Why This Matters:

Flexibility: This addition would provide users with the ability to use a broader range of distance metrics, tailoring the model more closely to their specific needs.
Semantic Understanding: Cosine similarity is particularly effective for understanding semantic relationships in text data. By enabling precomputed distances, users can leverage cosine similarity for better topic modeling outcomes.
Wider Application: This feature could broaden BERTopic's applicability across different domains where specific distance metrics are crucial for accurate modeling.

Implementation Insight:

I've already implemented (very quick) this feature locally and found that it integrates well with the existing pipeline. I'm confident that it could be a valuable addition to BERTopic without compromising performance or usability. The following is an non-exhaustive way of implementing this, of course this will need more work to be fully incorporated, but is just a mock of it:

    def __init__(self,
                 language: str = "english",
                 top_n_words: int = 10,
                 n_gram_range: Tuple[int, int] = (1, 1),
                 min_topic_size: int = 10,
                 nr_topics: Union[int, str] = None,
                 low_memory: bool = False,
                 calculate_probabilities: bool = False,
                 seed_topic_list: List[List[str]] = None,
                 zeroshot_topic_list: List[str] = None,
                 zeroshot_min_similarity: float = .7,
                 embedding_model=None,
                 umap_model: UMAP = None,
                 hdbscan_model: hdbscan.HDBSCAN = None,
                 vectorizer_model: CountVectorizer = None,
                 ctfidf_model: TfidfTransformer = None,
                 representation_model: BaseRepresentation = None,
                 verbose: bool = False,
                 distance_matrix: np.ndarray = None, <--------------------
                 ):

        self.hdbscan_model = hdbscan_model or hdbscan.HDBSCAN(min_cluster_size=self.min_topic_size,
                                                              metric='euclidean',
                                                              cluster_selection_method='eom',
                                                              prediction_data=True)
        self.distance_matrix = distance_matrix   <--------------------

    def _cluster_embeddings(self,
                            umap_embeddings: np.ndarray,
                            documents: pd.DataFrame,
                            partial_fit: bool = False,
                            y: np.ndarray = None) -> Tuple[pd.DataFrame,
                                                           np.ndarray]:
        ...
        logger.info("Cluster - Start clustering the reduced embeddings")
        if partial_fit:
            self.hdbscan_model = self.hdbscan_model.partial_fit(umap_embeddings)
            labels = self.hdbscan_model.labels_
            documents['Topic'] = labels
            self.topics_ = labels
        elif self.hdbscan_model.get_params()["metric"] == "precomputed":  <--------------------
            logger.info("Cluster - Using a precomputed distance matrix (MUST BE OF THE REDUCED EMBEDDINGS)")
            self.hdbscan_model.fit(self.distance_matrix)
            labels = self.hdbscan_model.labels_
            documents['Topic'] = labels
            self._update_topic_size(documents)

I'd love to hear your thoughts on this proposal. Do you see this as a valuable addition to BERTopic? Would there be any concerns or additional considerations we should account for?

I'm excited about the potential to contribute this feature to the community and look forward to your feedback.

Thank you for considering this enhancement!

Maarten Grootendorst · Answer 1 · Fri Mar 22 2024 18:06:40 GMT+0800 (China Standard Time)

Thank you for sharing this extensive description of this use case! I agree that it would be nice to have something like this implemented although I am curious as to how many users would end up using this feature.

Having said that, you can already pass the distance matrix to BERTopic and then simply skip over dimensionality reduction (as you already did before) in order to make this work. It would, however, introduce issues with topic embeddings but I'm actually curious about what would happen.

Lastly, do you think there is a way to implement this without introducing an HDBSCAN-specific parameter to the initialization of BERTopic? The reason why I ask is that my philosophy with BERTopic is to make it as modular as possible, so introducing this parameter might go against that if it is specific to HDBSCAN. Moreover, I want to keep the parameter space as small as possible in the initialization to keep the usage of BERTopic user-friendly. I have already seen some information-overload happening with the current set of parameters.

What do you think?

Juan Ovalle · Answer 2 · Fri Mar 22 2024 21:34:37 GMT+0800 (China Standard Time)

Hey @MaartenGr, thank you for answering!

Yes, I think it's possible to implement this. As an initial idea, I think we can just get the metric parameter from HDBSCAN (self.hdbscan_model.get_params()["metric"]) and then define the logic. We can leverage scikit-learn's pairwise metrics to define it without any addition of extra parameters and maintaining modularity.

If I get your approval I can start working on that

Maarten Grootendorst · Answer 3 · Sun Mar 24 2024 15:35:14 GMT+0800 (China Standard Time)

Ah right, then we would calculate the distance matrix ourselves based on what has been set within HDBSCAN. I think it's important here that there are additional checks to make sure that a missing "metric" does not run into errors or that it automatically calculates the metric.

Your work on this would be greatly appreciated!