[Question] Ideas on clustering vectors (without defining no. clusters)

Question

[Question] Ideas on clustering vectors (without defining no. clusters)

lassebomh opened this issue 10 months ago · comments

Lasse H. Bomholt commented 10 months ago

Hey there,

I'm using pgvector with Django and it's doing a great job.

I have a large table of text paired with embeddings, and I want to automatically organize them into categories, and later label the categories with GPT by sampling the text. That means that I don't know the number of clusters in advance. Preferably a solution would allow me to define a maximum distance between each vector inside the cluster, so I can adjust how general they will be. Any ideas on the best way of doing this?

Since HNSW or Ivfflat indices are already a sort of cluster, maybe we could query them somehow? Just throwing ideas around here. Perhaps it is simply a bad idea to do the clustering inside the database, I honestly don't know.

What do you think?

Thanks in advance.

Andrew Kane · Answer 1 · Sat Sep 16 2023 01:02:28 GMT+0800 (China Standard Time)

Hi @lassebomh, this page from scikit-learn has a few suggestions for clustering with an unknown number of clusters (more clustering docs). IVFFlat uses k-means clustering, which requires a known number of clusters, but I plan to keep it an internal detail of the index for now (as the exact method could change).

Lasse H. Bomholt · Answer 2 · Sat Sep 16 2023 01:44:11 GMT+0800 (China Standard Time)

Scikit is definitely go-to when the vectors are in memory, but I was wondering if there were any in-db techniques?

Thanks a lot for helping out.

Andrew Kane · Answer 3 · Sat Sep 16 2023 02:03:02 GMT+0800 (China Standard Time)

pgvector doesn't provide any, but you could create a Postgres extension that implements one of the methods above.