pgvector / pgvector-python

pgvector support for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Ideas on clustering vectors (without defining no. clusters)

lassebomh opened this issue · comments

Hey there,

I'm using pgvector with Django and it's doing a great job.

I have a large table of text paired with embeddings, and I want to automatically organize them into categories, and later label the categories with GPT by sampling the text. That means that I don't know the number of clusters in advance. Preferably a solution would allow me to define a maximum distance between each vector inside the cluster, so I can adjust how general they will be. Any ideas on the best way of doing this?

Since HNSW or Ivfflat indices are already a sort of cluster, maybe we could query them somehow? Just throwing ideas around here. Perhaps it is simply a bad idea to do the clustering inside the database, I honestly don't know.

What do you think?

Thanks in advance.

Hi @lassebomh, this page from scikit-learn has a few suggestions for clustering with an unknown number of clusters (more clustering docs). IVFFlat uses k-means clustering, which requires a known number of clusters, but I plan to keep it an internal detail of the index for now (as the exact method could change).

Scikit is definitely go-to when the vectors are in memory, but I was wondering if there were any in-db techniques?

Thanks a lot for helping out.

pgvector doesn't provide any, but you could create a Postgres extension that implements one of the methods above.