surrealdb / surrealdb

A scalable, distributed, collaborative, document-graph database, for the realtime web

Home Page:https://surrealdb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug: Vector NN search count does not respect WHERE clause

orimay opened this issue · comments

Describe the bug

I want to be able to perform query against vector field and some scalar fields at the same time. Say, I have vector V. I want to find closest 10 points, that have some flag set to true.

It doesn't seem to be possible to utilize vector index for such things. If we query

SELECT
  id, flag, v
FROM
  target
WHERE
  flag = true &&
  v <|10|> $point
;

it will take the closest 10 points and then filter those by flag. This way, I end up with less then 10 points.

Same issue comes into play when I try to paginate the results. If I want to specify START, vector query must take it into account, and I need to query <|50|> with START 40 if we want to take 10 records with the offset of 40.

It is impossible to build composite index on multiple fields if one of the fields is intended to be used for vector search.

Steps to reproduce

DELETE FROM pts;

DEFINE INDEX mt_pt1 ON pts FIELDS point MTREE DIMENSION 1;

INSERT INTO pts [
	{ id: pts:1, point: [ 1f ], flag: true },
	{ id: pts:2, point: [ 2f ], flag: false },
	{ id: pts:3, point: [ 3f ], flag: true },
	{ id: pts:4, point: [ 4f ], flag: false },
	{ id: pts:5, point: [ 5f ], flag: true },
	{ id: pts:6, point: [ 6f ], flag: false },
	{ id: pts:7, point: [ 7f ], flag: true }
];

LET $pt = [4.5f];

SELECT
    id,
    i,
    flag,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts
WHERE
    flag = true &&
    point <|2|> $pt
ORDER BY
    similarity DESC PARALLEL;

SELECT
    id,
    i,
    flag,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts
WHERE
    flag = true &&
    point <|2|> $pt
ORDER BY
    similarity DESC
EXPLAIN
;

outputs

[
	{
		flag: true,
		i: NONE,
		id: pts:5,
		similarity: 1
	}
]

-------- Query 6 (200.001µs) --------

[
	{
		detail: {
			plan: {
				index: 'mt_pt1',
				operator: '<2>',
				value: [
					4.5f
				]
			},
			table: 'pts'
		},
		operation: 'Iterate Index'
	},
	{
		detail: {
			type: 'Store'
		},
		operation: 'Collector'
	}
]

Expected behaviour

-------- Query 5 (299.999µs) --------

[
	{
		flag: true,
		i: NONE,
		id: pts:5,
		similarity: 1
	},
	{
		flag: true,
		i: NONE,
		id: pts:3,
		similarity: 1
	}
]

SurrealDB version

https://surrealist.app/

Contact Details

dmitrii.a.baranov@gmail.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Assigning to @emmanuel-keller who is able to answer this better.

The result you get is consistent with the current query logic, in the sense that each predicate returns a subset of the records, and the result is the intersection of these two subsets:

  • flag = true returns [pts:1, pts:3, pts5, pts7]
  • point <|2|> $pt returns [pts:4, pts5]

The intersection of both returns is [pts:5], which is correct from a technical point of view.

That said, it is true that we don't currently have a way to filter the KNN operation so you can have the N results matching the filter.

As a workaround, you could store two tables. One table would store points where the flag is true, and the other would store the points where the flag is false.

DELETE FROM pts_true;
DELETE FROM pts_false;

DEFINE INDEX mt_pt1 ON pts_true FIELDS point MTREE DIMENSION 1;
DEFINE INDEX mt_pt1 ON pts_false FIELDS point MTREE DIMENSION 1;

INSERT INTO pts_true [
	{ id: pts_true:1, point: [ 1f ], flag: true },
	{ id: pts_true:3, point: [ 3f ], flag: true },
	{ id: pts_true:5, point: [ 5f ], flag: true },
	{ id: pts_true:7, point: [ 7f ], flag: true }
];

INSERT INTO pts_false [
	{ id: pts_false:2, point: [ 2f ], flag: false },
	{ id: pts_false:4, point: [ 4f ], flag: false },
	{ id: pts_false:6, point: [ 6f ], flag: false },
];

LET $pt = [4.5f];

SELECT
    id,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts_true
WHERE
    point <|2|> $pt
ORDER BY
    similarity DESC;

Run it with Surrealist

That would work for this simple example, but I agree that it would not be suitable for something more complex involving more intricate filters.

To meet this requirement we could introduce the following syntax:

SELECT
    id,
    flag,
    vector::similarity::cosine(point, $pt) AS similarity
FROM
    pts
WHERE
    flag = true &&
    point <||> $pt
ORDER BY
    similarity DESC
LIMIT 2

In this case, the KNN operator would not limit the result and would only stop providing results once the limit is reached. That would be compatible with the way SurrealDB executes queries, allowing for any complexity in the filtering as well as pagination.

Would that work for you / Would you be happy with this syntax?

Yes, this would be amazing, thank you! And thank you for mentioning it in the stream :)

Will it be the only syntax instead of the current one? Having it work only by LIMIT may simplify it

We have started working on the implementation. For this to work efficiently, we need to make sure that the query is primarily ordered by the KNN distance. So, we will add a vector::distance::knn() function that will be mandatory to be placed in the ORDER BY clause. It will also return the computed distance, so the distance does not need to be recomputed.

SELECT
    id,
    flag,
    vector::distance::knn() AS distance
FROM
    pts
WHERE
    flag = true &&
    point <||> $pt
ORDER BY
   vector::distance::knn() DESC
LIMIT 2

I think both syntaxes may coexist.

Awesome! Will it be possible to order by distance alias? Will it work for cosine? And will it be possible to build index on multiple fields, e.g. vector + boolean?