pg-vector not using hnsw indexes

Question

pg-vector not using hnsw indexes

lin-goo opened this issue 3 months ago · comments

Help me!!! I created the index using hnsw, but I can't use the index for even the simplest query

My system info:

PostgreSQL 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

My Table Structure:

CREATE TABLE faces (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL DEFAULT 0,
    tsv_content vector(512) UNIQUE NOT NULL,
    created_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    deleted_time BIGINT DEFAULT 0
) ;
CREATE INDEX faces_tsv_content_hnsw_idx ON faces USING hnsw (tsv_content vector_cosine_ops) WITH (m = 16, ef_construction = 64);

My query sql:

EXPLAIN ANALYSE SELECT id FROM faces ORDER BY
    tsv_content <=> '[-0.121626004..., 0.015510366298258305]'
LIMIT 2;

Analyse result:

Limit  (cost=21.45..21.46 rows=2 width=16) (actual time=4.007..4.009 rows=2 loops=1)
  ->  Sort  (cost=21.45..22.73 rows=509 width=16) (actual time=4.005..4.006 rows=2 loops=1)
        Sort Key: ((tsv_content <=> '[-0.1216260...0.015510366]'::vector))"
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on faces  (cost=0.00..16.36 rows=509 width=16) (actual time=0.088..3.810 rows=509 loops=1)
Planning Time: 0.656 ms
Execution Time: 4.060 ms

From explain result, I can see that the index is not being used.

Any idea what I am doing wrong

@ankane

Andrew Kane · Answer 1 · Sat Mar 23 2024 15:12:50 GMT+0800 (China Standard Time)

Hi @lin-goo, it looks like you only have ~500 rows, so a table scan will likely be around the same speed. See the docs for how to encourage the planner to use the index.

lin · Answer 2 · Sat Mar 23 2024 15:15:13 GMT+0800 (China Standard Time)

I recreated the table with smaller dimensions, this time resulting in the use of indexes, with the following information

-- create table sql
CREATE TABLE tf (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL DEFAULT 0,
    tsv_content vector(3) UNIQUE NOT NULL,
    created_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    deleted_time BIGINT DEFAULT 0
) ;
CREATE INDEX tf_tsv_content_hnsw_idx ON tf USING hnsw (tsv_content vector_cosine_ops) WITH (m = 16, ef_construction = 64);


-- query sql
EXPLAIN ANALYSE SELECT id FROM tf ORDER BY
    tsv_content <=> '[1, 2, 3]'
LIMIT 2;


-- analyse result
Limit  (cost=4.48..4.60 rows=2 width=16) (actual time=0.045..0.047 rows=2 loops=1)
  ->  Index Scan using tf_tsv_content_hnsw_idx on tf  (cost=4.48..54.60 rows=810 width=16) (actual time=0.043..0.044 rows=2 loops=1)
"        Order By: (tsv_content <=> '[0,0,0]'::vector)"
Planning Time: 0.077 ms
Execution Time: 0.070 ms

lin · Answer 3 · Sat Mar 23 2024 15:16:03 GMT+0800 (China Standard Time)

Does the use of an index correlate with the size of the vector dimension? @ankane

lin · Answer 4 · Sat Mar 23 2024 15:17:18 GMT+0800 (China Standard Time)

Hi @lin-goo, it looks like you only have ~500 rows, so a table scan will likely be around the same speed. See the docs for how to encourage the planner to use the index.嗨，看起来你只有500行，所以表扫描的速度可能是一样的。请参阅文档了解如何鼓励计划者使用索引。

The data is only 500 rows because it is in the development phase and does not store more data, the amount of data in the production environment will be a lot of

Andrew Kane · Answer 5 · Sat Mar 23 2024 15:41:17 GMT+0800 (China Standard Time)

The difference likely has to do with TOAST (vectors over 498 dimensions / 2 KB are stored out-of-line by default, and this isn't included in the table scan cost estimate). When there are more rows, it should use the index.

lin · Answer 6 · Sat Mar 23 2024 15:48:59 GMT+0800 (China Standard Time)

Do you mean that even though it doesn't show the use of indexes in the analysis results, it is used in the actual query?

Andrew Kane · Answer 7 · Sat Mar 23 2024 15:53:56 GMT+0800 (China Standard Time)

No, it means the planner will use (and show) a different plan when there's more data.

lin · Answer 8 · Sat Mar 23 2024 16:07:23 GMT+0800 (China Standard Time)

I'll try increasing the amount of data then and see if the index is used, thank you very much for your reply!

lin · Answer 9 · Sun Mar 24 2024 15:53:31 GMT+0800 (China Standard Time)

No, it means the planner will use (and show) a different plan when there's more data.不，这意味着当有更多数据时，计划器将使用（并显示）不同的计划。

Hi~ I have now increased the amount of data to 8000 entries and the index is now working properly. Thanks again for your answer!

-- analyse result
Limit  (cost=108.60..108.72 rows=2 width=16) (actual time=4.346..4.375 rows=2 loops=1)
  ->  Index Scan using faces_tsv_content_hnsw_idx on faces  (cost=108.60..628.14 rows=8923 width=16) (actual time=4.344..4.372 rows=2 loops=1)
        Order By: (tsv_content <=> '[-0.121626005... ,0.015510366]'::vector)"
Planning Time: 0.357 ms
Execution Time: 4.467 ms