pgvector / pgvector-python

pgvector support for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for printing SQL of SQLAlchemy query

sarimak opened this issue · comments

pgvector.sqlalchemy.Vector doesn't provide a process_literal_param method, so any SQLALchemy query containing literal values for such column will fail to be rendered to final SQL including the values when literal_binds are enabled. Please implement the method.

Repro steps:

import sqlalchemy as sa
import sqlalchemy.dialects.postgresql
import pgvector.sqlalchemy as pgv

engine = sa.create_engine("postgresql+psycopg2://postgres:password@localhost:5432/neighbors-2")
conn = engine.connect()
table = sa.Table("foo", sa.MetaData(), sa.Column("col1", pgv.Vector(4)))
query = sa.select(table.c.col1).where((table.c.col1.max_inner_product([0.5, 0.5, 0.5, 0.5]) * -1) >= 0.1)
kwargs = {"literal_binds": True}
sql = str(query.compile(dialect=sqlalchemy.dialects.postgresql.dialect()), compile_kwargs=kwargs)

raises

sqlalchemy.exc.CompileError: No literal value renderer is available for literal value "[0.5, 0.5, 0.5, 0.5]" with datatype VECTOR(4)

(The query itself is OK which can be verified by using kwargs = {} -- the compilation returns 'SELECT foo.col1 \nFROM foo \nWHERE (foo.col1 <#> %(col1_1)s) * %(param_1)s >= %(param_2)s')

Workaround:

import pgvector.utils

class Vector(pgv.Vector, sa.TypeDecorator):
    impl = pgv.Vector
    cache_ok = True

    def process_literal_param(self, value, dialect):
        return repr(pgvector.utils.to_db(value, self.dim))

and use this wrapper class in the table definition instead of the library-provided pgv.Vector. The SQL rendering then works:

str(query.compile(dialect=sqlalchemy.dialects.postgresql.dialect(), compile_kwargs={"literal_binds": True}))
"SELECT foo.col1 \nFROM foo \nWHERE (foo.col1 <#> '[0.5,0.5,0.5,0.5]') * -1 >= 0.1"

Rationale: I wanted to have a test that makes sure the approximate similarity search is indeed using an index. So I needed to run EXPLAIN over the search query, which needs plain SQL text of the SQLAlchemy query including the values for the prepared statement, so I can execute explain_query = sa.text(f"EXPLAIN {query}") and assert for "Index Scan using foo" in "\n".join(row[0] for row in explain_query.fetchall()).

I am aware that literal_binds can't be used with untrusted inputs (SQL injection) -- but for tests with hard-coded/trusted inputs it is perfectly usable.

And given that it seems to be quite easy to stop using the approximate similarity index by seemingly innocent change of the SQLAlchemy query, I find such regression test valuable as a safety net for my colleagues/future self.

Hi @sarimak, thanks for the suggestion. Since it can't be used with untrusted input, I don't think it makes sense to include, but this issue should be helpful for users who want to add it manually.

Well, then for example Alembic's preview/SQL-only mode won't be able to leverage pgvector's syntax. The literal_binds is an opt-in feature and it has legitimate use cases. But as you prefer -- I have this workaround, so my needs are already covered.

Hi @sarimak, just fyi: this was added in #51.