[Question] Ideas on clustering vectors (without defining no. clusters)
lassebomh opened this issue · comments
Hey there,
I'm using pgvector with Django and it's doing a great job.
I have a large table of text paired with embeddings, and I want to automatically organize them into categories, and later label the categories with GPT by sampling the text. That means that I don't know the number of clusters in advance. Preferably a solution would allow me to define a maximum distance between each vector inside the cluster, so I can adjust how general they will be. Any ideas on the best way of doing this?
Since HNSW or Ivfflat indices are already a sort of cluster, maybe we could query them somehow? Just throwing ideas around here. Perhaps it is simply a bad idea to do the clustering inside the database, I honestly don't know.
What do you think?
Thanks in advance.
Hi @lassebomh, this page from scikit-learn has a few suggestions for clustering with an unknown number of clusters (more clustering docs). IVFFlat uses k-means clustering, which requires a known number of clusters, but I plan to keep it an internal detail of the index for now (as the exact method could change).
Scikit is definitely go-to when the vectors are in memory, but I was wondering if there were any in-db techniques?
Thanks a lot for helping out.
pgvector doesn't provide any, but you could create a Postgres extension that implements one of the methods above.