pgvector-remote is a PostgreSQL extension developed by the Georgia Tech Database Labs. It builds upon the functionality provided by pgvector, introducing seamless integration with dedicated vector stores like Pinecone, with plans to support other vendors in the future.
This extension simplifies the process of storing and retrieving vectors in vector stores while leveraging the power and familiarity of PostgreSQL.
Supports:
- exact and approximate nearest neighbor search
- Metadata filtering with vector similarity search
- L2 distance, inner product, and cosine distance
- vectors are buffered and batch-inserted into remote stores per user-defined sizes
- Seamless data integration and synchronization between pgvector and Pinecone
Compile and install the extension (supports Postgres 12+)
cd /tmp
git clone --branch feature/remote_indexes https://github.com/georgia-tech-db/pgvector-remote.git
cd pgvector-remote
make
make install # may need sudo
See the installation notes if you run into issues
You can also install it with Docker
Ensure C++ support in Visual Studio is installed, and run:
call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"
Note: The exact path will vary depending on your Visual Studio version and edition
Then use nmake
to build:
set "PGROOT=C:\Program Files\PostgreSQL\16"
git clone --branch feature/remote_indexes https://github.com/georgia-tech-db/pgvector-remote.git
cd pgvector-remote
nmake /F Makefile.win
nmake /F Makefile.win install
You can also install it with Docker
Enable the extension (do this once in each database where you want to use it)
CREATE EXTENSION vector;
Create a vector column with 3 dimensions
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Insert vectors
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Get the nearest neighbors by L2 distance
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Also supports inner product (<#>
) and cosine distance (<=>
)
Note: <#>
returns the negative inner product since Postgres only supports ASC
order index scans on operators
Create a new table with a vector column
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Or add a vector column to an existing table
ALTER TABLE items ADD COLUMN embedding vector(3);
Insert vectors
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Upsert vectors
INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Update vectors
UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Delete vectors
DELETE FROM items WHERE id = 1;
pgvector-remote utilizes Pinecone to create a remote index from vectors stored in PostgreSQL for vector similarity search. To enable metadata filtering alongside vector similarity search, additional metadata must be passed when creating the index.
Add pinecone api key as a system configuration value
ALTER SYSTEM SET pinecone.api_key = 'xxxxxxxx-xxxx-xxxx-xxxx–xxxxxxxxxxxx';
Add an index for each distance function you want to use.
L2 distance
CREATE INDEX ON items USING pinecone (embedding vector_l2_ops) with (spec = '{"serverless":{"cloud":"aws","region":"us-west-2"}}');
Metadata along with vector embedding
CREATE INDEX ON items USING pinecone (embedding vector_l2_ops, price, quantity) with (spec = '{"serverless":{"cloud":"aws","region":"us-west-2"}}');
Here price and quantity are other columns present in postgresql which you want to use as a filter while performing vector similarity search.
Inner product
CREATE INDEX ON items USING pinecone (embedding vector_ip_ops) with (spec = '{"serverless":{"cloud":"aws","region":"us-west-2"}}');
Cosine distance
CREATE INDEX ON items USING pinecone (embedding vector_cosine_ops) with (spec = '{"serverless":{"cloud":"aws","region":"us-west-2"}}');
Get the nearest neighbors to a vector
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Get the nearest neighbors to a row
SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;
Get rows within a certain distance
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
Note: Combine with ORDER BY
and LIMIT
to use an index
Get the distance
SELECT embedding <-> '[3,1,2]' AS distance FROM items;
For inner product, multiply by -1 (since <#>
returns the negative inner product)
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
For cosine similarity, use 1 - cosine distance
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;
Average vectors
SELECT AVG(embedding) FROM items;
Average groups of vectors
SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;
pinecone.top_k: Get the top K relevant results from pinecone.
pinecone.vectors_per_request: Number of vectors per request.
pinecone.requests_per_batch: Number of requests to be sent in one batch.
The buffer size is calculated as pinecone.vectors_per_request * pinecone.requests_per_batch
pinecone.max_buffer_scan: Pinecone max buffer search
Each vector takes 4 * dimensions + 8
bytes of storage. Each element is a single precision floating-point number (like the real
type in Postgres), and all elements must be finite (no NaN
, Infinity
or -Infinity
). Vectors can have up to 16,000 dimensions.
Operator | Description | Added |
---|---|---|
+ | element-wise addition | |
- | element-wise subtraction | |
* | element-wise multiplication | 0.5.0 |
<-> | Euclidean distance | |
<#> | negative inner product | |
<=> | cosine distance |
Function | Description | Added |
---|---|---|
cosine_distance(vector, vector) → double precision | cosine distance | |
inner_product(vector, vector) → double precision | inner product | |
l2_distance(vector, vector) → double precision | Euclidean distance | |
l1_distance(vector, vector) → double precision | taxicab distance | 0.5.0 |
vector_dims(vector) → integer | number of dimensions | |
vector_norm(vector) → double precision | Euclidean norm |
Function | Description | Added |
---|---|---|
avg(vector) → vector | average | |
sum(vector) → vector | sum | 0.5.0 |
If your machine has multiple Postgres installations, specify the path to pg_config with:
export PG_CONFIG=/Library/PostgreSQL/16/bin/pg_config
Then re-run the installation instructions (run make clean
before make
if needed). If sudo
is needed for make install
, use:
sudo --preserve-env=PG_CONFIG make install
A few common paths on Mac are:
- EDB installer -
/Library/PostgreSQL/16/bin/pg_config
- Homebrew (arm64) -
/opt/homebrew/opt/postgresql@16/bin/pg_config
- Homebrew (x86-64) -
/usr/local/opt/postgresql@16/bin/pg_config
Note: Replace 16
with your Postgres server version
If compilation fails with fatal error: postgres.h: No such file or directory
, make sure Postgres development files are installed on the server.
For Ubuntu and Debian, use:
sudo apt install postgresql-server-dev-16
Note: Replace 16
with your Postgres server version
If compilation fails and the output includes warning: no such sysroot directory
on Mac, reinstall Xcode Command Line Tools.
Get the [Docker image] with:
docker pull kslohith17/pgvector-remote:latest
This contains postgres along with pgvector-remote configured to run on it.
Thanks to:
- pgvector: Open-source vector similarity search for Postgres
- PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension
- Faiss: A Library for Efficient Similarity Search and Clustering of Dense Vectors
- Using the Triangle Inequality to Accelerate k-means
- k-means++: The Advantage of Careful Seeding
- Concept Decompositions for Large Sparse Text Data using Clustering
- Efficient and Robust Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs
Coming soon