Multilayer/Hiarchical Clustering

Question

KennethEnevoldsen opened this issue a month ago · comments

Currently clustering works by:

Some datasets have multiple layers of annotations (e.g. genres and subgenres) and we could thus have introduced a loop for multiple labels, e.g.:

A good sample dataset for this might be SNLClustering (Norwegian), or WikiClustering (Multi).

Márton Kardos · Answer 1 · Fri Apr 26 2024 16:03:29 GMT+0800 (China Standard Time)

I was just thinking about this yesterday, I think I can take it :D

Márton Kardos · Answer 2 · Fri Apr 26 2024 17:01:31 GMT+0800 (China Standard Time)

This is again, the same kind of issue as with the classification:

Do we only consider hierarchical clustering?
If so, do we penalise models for having gotten something wrong on an earlier level? The way @KennethEnevoldsen proposed sort of assumes that the scores on the different levels are independent (correct me if I'm saying something stupid)
Do we actually want to measure model performance with a hierarchical clustering method or do we stick with KMeans? I have looked a bit into this and there isn't much in the ways of literature on evaluating hierarchical clustering, we'd have to get smart about it.

Here's an alternative that would also make sense to me:

Cluster embeddings -> Calculate v-scores for first level
For each first level gold label:
1. Select all texts that are labelled this gold label, even the ones the model mislabelled as something else and give them gold label -1.
2. Cluster these -> Calculate v-scores
Repeat for all levels

If anyone has ideas or five cents on this I would love to hear.