cozodb / cozo

A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!

Home Page:https://cozodb.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FTS: Field `k` is required for HNSW search

iSuslov opened this issue · comments

In a sandbox here https://www.cozodb.org/wasm-demo/ I created a table:

:create my_table {id: String, year: Validity => president: String}
?[id, year, president] <- [['US1', [2001, true], 'Bush'],
                        ['US2', [2005, true], 'Bush'],
                        ['US3', [2009, true], 'Obama'],
                        ['US4', [2013, true], 'Obama'],
                        ['US5', [2017, true], 'Trump'],
                        ['US6', [2021, true], 'Biden']]

:put my_table {id, year => president}

Then created an index:

::fts create my_table:my_fts_index {
    extractor: president,
    tokenizer: Simple,
    filters: [Lowercase, Stemmer('english'), Stopwords('en')]
}

Then tried FTS and got Field k is required for HNSW search error

?[id, year, president, score] := ~my_table:my_fts_index {id, year, president | query: $q, bind_score: score }
:order -score

Somehow it thinks that I'm trying to utilize HNSW instead of FST.

Full error:

parser::hnsw_query_required

  × Field `k` is required for HNSW search
   ╭─[1:1]
 1 │ ?[id, year, president, score] := ~my_table:my_fts_index {id, year, president | query: $q, bind_score: score }
   ·                                  ────────────────────────────────────────────────────────────────────────────
 2 │ :order -score 
   ╰────

In documentation https://docs.cozodb.org/en/latest/releases/v0.7.html#full-text-search it doesn't mention the need for k. When I add k: 10 everything works.

I can see that this part of doc https://docs.cozodb.org/en/latest/vector.html#full-text-search-fts mentions k. Probably just need to fix the release announcement page mentioned above.


Update 1: In the doc for tokenizers it says:

Tokenizer is specified in the configuration as a function call such as Ngram(9), or if you omit all arguments, Ngram is also acceptable.

But when I try to use Ngram as a tokenizer value, I get an error Unknown tokenizer: Ngram.

On the same page tokenizer: Simple + n_gram: 3 parameters used for LSH index. https://docs.cozodb.org/en/latest/vector.html#minhash-lsh-for-near-duplicate-indexing-of-strings-and-lists

Needs clarification.

Update 2: It seems like import_relations does not built existing indexes for imported values.