yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Warnings while creating cosine based index

shriyog opened this issue · comments

While building NGT index using the cosine distance metric, I see lot many warnings like below.

createIndex: Warning. The specified number of edges could not be acquired, because the pruned parameter [-S] might be set.
  The node id=6651608
  The number of edges for the node=7
  The pruned parameter (edgeSizeForSearch [-S])=40

Created the index using this command where I don't specify any -S param (default is 40)

ngt create -d 40 -D c cosine-index
ngt append -d 40 cosine-index vectors.ssv

I feel this suspicious as there are differences compared to another index built with L2 (Euclidean) distance metric using the same input vectors.

  1. Index build time - 4 Mins (cosine) vs 45 Mins (L2)
  2. Epsilon vs Precision (mentioned below)
  3. Index size on disk is the same though
Euclidean
# Factor (Epsilon)      # of Queries    Precision       Time(msec)      # of computations       # of visted nodes
0       100     0.436   0.293037        0       0
0.01    100     0.55    0.0437106       0       0
0.02    100     0.664   0.0645273       0       0
0.03    100     0.802   100.782         0       0
0.04    100     0.889   728.165         0       0
0.05    100     0.932   2077.52         0       0
0.06    100     0.958   3091.21         0       0
0.07    100     0.973   4509.79         0       0
0.08    100     0.985   5053.05         0       0
0.09    100     0.988   5463.39         0       0
0.1     100     0.993   5964.26         0       0

Cosine
# Factor (Epsilon)      # of Queries    Precision       Time(msec)      # of computations       # of visted nodes
0       100     0.256   0.0588535       0       0
0.01    100     0.273   0.033929        0       0
0.02    100     0.278   0.0337207       0       0
0.03    100     0.286   0.0346833       0       0
0.04    100     0.295   0.0367112       0       0
0.05    100     0.318   0.0401136       0       0
0.06    100     0.355   0.0426844       0       0
0.07    100     0.384   0.0472755       0       0
0.08    100     0.394   0.0479118       0       0
0.09    100     0.415   0.0516687       0       0
0.1     100     0.441   0.057455        0       0

The warning seems to be originating from here due to which I think the cosine based index is not properly built hence the impact on accuracy. Any thoughts on this or it's expected?

Could you run the command below to get your index's information.

ngt info [your cosine index path]

I tried to reproduce your problem with the datasets I have, but I could not. Since the problem might depend on datasets, could you provide your dataset, if possible.

Hey @masajiro — Thanks for the command, it details out the index meta which is quite helpful.

This is the output for an index created with above-mentioned warnings.

> ngt info catalog-mod-0-cosine/
NGT version: 1.13.7
Processed 1000000
Processed 2000000
Processed 3000000
Processed 4000000
Processed 5000000
Processed 6000000
The size of the object repository (not the number of the objects):	6652051
The number of the removed objects:	0/6652051
The number of the nodes:	6652051
The number of the edges:	130936766
The mean of the edge lengths:	-nan
The mean of the number of the edges per node:	19.68366839
The number of the nodes without edges:	0
The maximum of the outdegrees:	139690
The minimum of the outdegrees:	10
The number of the nodes where indegree is 0:	0
The maximum of the indegrees:	139690
The minimum of the indegrees:	10
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:6652051:130936766:0:19.68366839:-nan:139690:10:1432.146574:139690:10:1432.146574:10:10:10:10:136.0223814:10:198.2695696:10:-nan:0:-nan:0

The dataset had empty vectors which may or may not be the reason for warnings. I created another index with a clean 1 Mn vectors & it didn't give any warnings this time. Here's the command output for it.

> ngt info catalog-1m-clean-cosine/
NGT version: 1.13.7
Processed 1000000
The size of the object repository (not the number of the objects):      1000000
The number of the removed objects:      0/1000000
The number of the nodes:        1000000
The number of the edges:        19999890
The mean of the edge lengths:   0.2193799515
The mean of the number of the edges per node:   19.99989
The number of the nodes without edges:  0
The maximum of the outdegrees:  3598
The minimum of the outdegrees:  10
The number of the nodes where indegree is 0:    0
The maximum of the indegrees:   3598
The minimum of the indegrees:   10
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:1000000:19999890:0:19.99989:0.2193799515:3598:10:29.96648422:3598:10:29.96648422:13:13:10:10:92.58104:10:177.7591:10:0.2021325693:0:0.2021325693:0

Also, want to mention that the optimization guide helped me a lot to achieve desired accuracy & performance with the ONNG index. Thanks a lot for putting it together.

The dataset is 6.6 Mn, I'll try to reproduce the issue with a minimal dataset & share it with you. Let me get back on this by Monday.

Did you solve this issue?