pgvector / pgvector

Open-source vector similarity search for Postgres

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pg-vector not using hnsw indexes

lin-goo opened this issue · comments

commented

Help me!!! I created the index using hnsw, but I can't use the index for even the simplest query

My system info:

PostgreSQL 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

My Table Structure:

CREATE TABLE faces (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL DEFAULT 0,
    tsv_content vector(512) UNIQUE NOT NULL,
    created_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    deleted_time BIGINT DEFAULT 0
) ;
CREATE INDEX faces_tsv_content_hnsw_idx ON faces USING hnsw (tsv_content vector_cosine_ops) WITH (m = 16, ef_construction = 64);

My query sql:

EXPLAIN ANALYSE SELECT id FROM faces ORDER BY
    tsv_content <=> '[-0.121626004..., 0.015510366298258305]'
LIMIT 2;

Analyse result:

Limit  (cost=21.45..21.46 rows=2 width=16) (actual time=4.007..4.009 rows=2 loops=1)
  ->  Sort  (cost=21.45..22.73 rows=509 width=16) (actual time=4.005..4.006 rows=2 loops=1)
        Sort Key: ((tsv_content <=> '[-0.1216260...0.015510366]'::vector))"
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on faces  (cost=0.00..16.36 rows=509 width=16) (actual time=0.088..3.810 rows=509 loops=1)
Planning Time: 0.656 ms
Execution Time: 4.060 ms

From explain result, I can see that the index is not being used.

Any idea what I am doing wrong

@ankane

Hi @lin-goo, it looks like you only have ~500 rows, so a table scan will likely be around the same speed. See the docs for how to encourage the planner to use the index.

commented

I recreated the table with smaller dimensions, this time resulting in the use of indexes, with the following information

-- create table sql
CREATE TABLE tf (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL DEFAULT 0,
    tsv_content vector(3) UNIQUE NOT NULL,
    created_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    deleted_time BIGINT DEFAULT 0
) ;
CREATE INDEX tf_tsv_content_hnsw_idx ON tf USING hnsw (tsv_content vector_cosine_ops) WITH (m = 16, ef_construction = 64);


-- query sql
EXPLAIN ANALYSE SELECT id FROM tf ORDER BY
    tsv_content <=> '[1, 2, 3]'
LIMIT 2;


-- analyse result
Limit  (cost=4.48..4.60 rows=2 width=16) (actual time=0.045..0.047 rows=2 loops=1)
  ->  Index Scan using tf_tsv_content_hnsw_idx on tf  (cost=4.48..54.60 rows=810 width=16) (actual time=0.043..0.044 rows=2 loops=1)
"        Order By: (tsv_content <=> '[0,0,0]'::vector)"
Planning Time: 0.077 ms
Execution Time: 0.070 ms
commented

Does the use of an index correlate with the size of the vector dimension? @ankane

commented

Hi @lin-goo, it looks like you only have ~500 rows, so a table scan will likely be around the same speed. See the docs for how to encourage the planner to use the index.嗨,看起来你只有500行,所以表扫描的速度可能是一样的。请参阅文档了解如何鼓励计划者使用索引。

The data is only 500 rows because it is in the development phase and does not store more data, the amount of data in the production environment will be a lot of

The difference likely has to do with TOAST (vectors over 498 dimensions / 2 KB are stored out-of-line by default, and this isn't included in the table scan cost estimate). When there are more rows, it should use the index.

commented

Do you mean that even though it doesn't show the use of indexes in the analysis results, it is used in the actual query?

No, it means the planner will use (and show) a different plan when there's more data.

commented

I'll try increasing the amount of data then and see if the index is used, thank you very much for your reply!

commented

No, it means the planner will use (and show) a different plan when there's more data.不,这意味着当有更多数据时,计划器将使用(并显示)不同的计划。

Hi~ I have now increased the amount of data to 8000 entries and the index is now working properly. Thanks again for your answer!

-- analyse result
Limit  (cost=108.60..108.72 rows=2 width=16) (actual time=4.346..4.375 rows=2 loops=1)
  ->  Index Scan using faces_tsv_content_hnsw_idx on faces  (cost=108.60..628.14 rows=8923 width=16) (actual time=4.344..4.372 rows=2 loops=1)
        Order By: (tsv_content <=> '[-0.121626005... ,0.015510366]'::vector)"
Planning Time: 0.357 ms
Execution Time: 4.467 ms