embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Home Page:https://arxiv.org/abs/2210.07316

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multilayer/Hiarchical Clustering

KennethEnevoldsen opened this issue · comments

Currently clustering works by:

  1. Computing embeddings for documents
  2. Cluster based on labels and embeddings
  3. Calculate v_measure

Some datasets have multiple layers of annotations (e.g. genres and subgenres) and we could thus have introduced a loop for multiple labels, e.g.:

  1. Computing embeddings for documents
  2. Cluster based on label level 1 and embeddings
  3. Calculate v_measure for label level 1
  4. Cluster based on label level 2 and embeddings
  5. Calculate v_measure for label level 2
    ...

A good sample dataset for this might be SNLClustering (Norwegian), or WikiClustering (Multi).

I was just thinking about this yesterday, I think I can take it :D

This is again, the same kind of issue as with the classification:

  1. Do we only consider hierarchical clustering?
  2. If so, do we penalise models for having gotten something wrong on an earlier level? The way @KennethEnevoldsen proposed sort of assumes that the scores on the different levels are independent (correct me if I'm saying something stupid)
  3. Do we actually want to measure model performance with a hierarchical clustering method or do we stick with KMeans? I have looked a bit into this and there isn't much in the ways of literature on evaluating hierarchical clustering, we'd have to get smart about it.

Here's an alternative that would also make sense to me:

  1. Cluster embeddings -> Calculate v-scores for first level
  2. For each first level gold label:
    1. Select all texts that are labelled this gold label, even the ones the model mislabelled as something else and give them gold label -1.
    2. Cluster these -> Calculate v-scores
  3. Repeat for all levels

If anyone has ideas or five cents on this I would love to hear.