facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.

Home Page:https://faiss.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poor recall when using PCA pre-processing with Inner Product Distance metric

leothomas opened this issue · comments

Summary

Hey there!

I'm looking into switching from L2 distance metric to Inner Product distance metric. Our current index is built using PCA + IVFFlat (PCA128,IVF{K},Flat) with L2 distance, where the input vectors have dimension 512 and the number of IVF centroids is defined as: 4*sqrt(N) < K < 16*sqrt(N) (N = number of vectors indexed).

Compared to a Flat index, this index reaches a kNN intersection measure of 0.96 @ rank 100.

However when I build the same index with an inner product distance metric (vectors are normalized prior to training, adding to the index, and searching for both the L2 and IP distance metrics) I get a kNN intersection measure of 0.019 @ rank 100. Setting nprobe to the number of centroids (to mimic a Flat search) actually reduces the kNN intersection measure to 0.003 @ rank 100.

Without the PCA pre-processing, the IVF index with inner product distance metric has a kNN intersection measure of 0.97 @ rank 100 - which is ideal, but the index is simply much too big to hold in memory.

Is there some sort of fundamental incompatibility between PCA pre-processing and Inner Product distance metric?

I was able to achieve excellent compression and a kNN intersection measure of ~ 0.70 @ rank 100 with the OPQ{M}_{D},IVF{K},PQ{M} and OPQ{M}_{D},IVF{K}_HNSW32,PQ{M} indexes with the inner product distance metric. Are there any other indexing recommendations for pre-processing, coarse or fine quantization, or even search time parameters (efSearch, etc) that might work better with the inner product distance metric?

Thanks again for taking a look at this !

Platform

OS: macOS 13.01

Faiss version: 1.7.3

Installed from: pip install 'faiss-cpu==1.7.3'

Faiss compilation options:

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

# Note: xt, xb, and xq have all been normalized with `faiss.normalize_L2(xt)`

# reference:
index = faiss.index_factory(xt.shape[1], "Flat", faiss.METRIC_INNER_PRODUCT)
index.train(xt)
index.add(xb)
Dref, Iref = index.search(xq, 100)

k = int(4*math.sqrt(xb.shape[0]))

# PCA pre-processing with L2 distance --> great results!
index = faiss.index_factory(xt.shape[1], "PCA128,IVF{str(k)},Flat")
index.train(xt)
index.add(xb)
D, I = index.search(xq, 100, params = faiss.SearchParametersIVF(nprobe=10))
{rank: knn_intersection_measure(I[:, :rank], Iref[:, :rank]) for rank in [1, 10, 100]}
>>> {1: 1.0, 10: 0.967, 100: 0.9617}

# PCA pre-processing with Inner Product distance --> abysmal results
index = faiss.index_factory(xt.shape[1], "PCA128,IVF{str(k)},Flat", faiss.METRIC_INNER_PRODUCT)
index.train(xt)
index.add(xb)
D, I = index.search(xq, 100, params = faiss.SearchParametersIVF(nprobe=10))
{rank: knn_intersection_measure(I[:, :rank], Iref[:, :rank]) for rank in [1, 10, 100]}
>>> {1: 0.0, 10: 0.002, 100: 0.019}

# Inner Product distance without PCA pre-processing --> great results but huge memory consumption
index = faiss.index_factory(xt.shape[1], "IVF{str(k)},Flat", faiss.METRIC_INNER_PRODUCT)
index.train(xt)
index.add(xb)
D, I = index.search(xq, 100, params = faiss.SearchParametersIVF(nprobe=10))
{rank: knn_intersection_measure(I[:, :rank], Iref[:, :rank]) for rank in [1, 10, 100]}
>>> {1: 1.0, 10: 0.995, 100: 0.9673}

This is surprising. What is the initial data dimensionality? Would it be possible to share the vectors?

Hey there! Thanks for taking a look! The vectors have shape: (512, ) and dtype: numpy.float32.

Here's a file containing a small subset (100) of the vectors: 2020_12_12_subset.fvecs.zip.

Here is a link to a file containing the entire set of vectors added to a single index (one index per month of data).

The training set is generated by randomly sampling across all monthly files, until we have between 4 * sqrt(N) and 16 * sqrt(N) training vectors, with N being the total number of vectors indexed. In this case I chose 8 * sqrt(N).

Here is a single vector (before normalizing):
array([0.00000000e+00, 4.08430147e-04, 1.18545437e+00, 0.00000000e+00,
       2.27321482e+00, 1.54295738e-03, 5.26361167e-04, 0.00000000e+00,
       3.38814825e-05, 0.00000000e+00, 0.00000000e+00, 4.24570650e-01,
       0.00000000e+00, 0.00000000e+00, 4.67802361e-02, 9.65269469e-03,
       4.61871037e-04, 0.00000000e+00, 0.00000000e+00, 4.36048460e+00,
       0.00000000e+00, 5.68176460e+00, 2.80647844e-01, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 2.78753352e+00, 0.00000000e+00,
       2.35022712e+00, 0.00000000e+00, 4.84778658e-02, 6.43251717e-01,
       1.64340204e-03, 0.00000000e+00, 0.00000000e+00, 3.81218123e+00,
       0.00000000e+00, 2.03117180e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.69215798e+00, 0.00000000e+00, 0.00000000e+00,
       3.85047897e-04, 0.00000000e+00, 1.43826054e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.30347600e-02, 0.00000000e+00,
       1.06703475e-01, 1.22844622e-01, 1.18765254e-02, 0.00000000e+00,
       2.33212924e+00, 6.94844592e-03, 6.49189949e-01, 5.87021559e-02,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 2.34959126e+00, 0.00000000e+00, 1.02928802e-01,
       0.00000000e+00, 0.00000000e+00, 2.20299911e+00, 7.55079836e-02,
       0.00000000e+00, 0.00000000e+00, 1.32876918e-01, 3.34572699e-03,
       1.09084713e+00, 0.00000000e+00, 0.00000000e+00, 4.63117599e-01,
       0.00000000e+00, 4.48396873e+00, 0.00000000e+00, 6.00585079e+00,
       0.00000000e+00, 7.71000087e-02, 5.64224899e-01, 5.33351040e+00,
       4.18188865e-04, 0.00000000e+00, 0.00000000e+00, 6.24118373e-04,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       9.83330458e-02, 2.63771683e-01, 0.00000000e+00, 0.00000000e+00,
       1.76251590e-01, 0.00000000e+00, 0.00000000e+00, 9.82281923e-01,
       2.34911728e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.33228040e+00, 2.60370746e-02, 0.00000000e+00, 3.57584381e+00,
       4.84157085e-01, 1.95727125e-02, 0.00000000e+00, 5.86162388e-01,
       7.52598513e-03, 3.38371444e+00, 0.00000000e+00, 3.04818940e+00,
       1.65834260e+00, 1.19849615e-01, 2.26252779e-01, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.57161236e+00,
       0.00000000e+00, 0.00000000e+00, 6.43720627e-01, 0.00000000e+00,
       0.00000000e+00, 2.47750897e-03, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       5.01900721e+00, 0.00000000e+00, 3.11816968e-02, 5.59445500e-01,
       0.00000000e+00, 8.78229737e-03, 2.33042955e-01, 9.05349180e-02,
       0.00000000e+00, 3.85129833e+00, 0.00000000e+00, 7.32775927e-02,
       0.00000000e+00, 1.84055901e+00, 0.00000000e+00, 2.67358171e-03,
       0.00000000e+00, 1.53432274e+00, 0.00000000e+00, 0.00000000e+00,
       2.49994211e-02, 0.00000000e+00, 0.00000000e+00, 2.90997624e+00,
       0.00000000e+00, 0.00000000e+00, 1.30942130e+00, 0.00000000e+00,
       3.13563161e-02, 0.00000000e+00, 2.67477892e-03, 6.75363988e-02,
       1.66207582e-01, 0.00000000e+00, 0.00000000e+00, 6.99823373e-04,
       1.54913394e-02, 5.68675637e-01, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.81745148e+00, 0.00000000e+00, 0.00000000e+00,
       4.84930217e-01, 0.00000000e+00, 0.00000000e+00, 3.26825047e+00,
       1.03376210e+00, 0.00000000e+00, 0.00000000e+00, 1.02973469e-01,
       1.13317680e+00, 2.25058962e-02, 1.29022673e-02, 0.00000000e+00,
       2.50312709e-03, 0.00000000e+00, 1.33579376e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.16702509e+00, 3.89024019e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 6.67101026e-01, 0.00000000e+00,
       3.70575380e+00, 7.66108799e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 5.86722946e+00, 0.00000000e+00, 1.59060621e+00,
       0.00000000e+00, 3.66107881e-01, 0.00000000e+00, 6.90836728e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       2.81974220e+00, 4.95951343e-03, 0.00000000e+00, 2.64969729e-02,
       0.00000000e+00, 0.00000000e+00, 4.20392752e+00, 6.06601715e+00,
       2.53612852e+00, 0.00000000e+00, 3.53584671e+00, 0.00000000e+00,
       0.00000000e+00, 2.90730223e-03, 0.00000000e+00, 0.00000000e+00,
       8.32287550e-01, 1.99459391e-04, 0.00000000e+00, 0.00000000e+00,
       9.40854475e-02, 0.00000000e+00, 0.00000000e+00, 2.68238395e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 7.62694836e+00, 5.29604340e+00, 3.89173150e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 8.09190050e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.47578955e+00,
       5.04332595e-04, 3.88359539e-02, 4.95079994e+00, 0.00000000e+00,
       0.00000000e+00, 3.43584269e-02, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 3.24924350e-01, 1.91412258e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       4.30223942e+00, 5.03486604e-04, 4.85977605e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 6.47835508e-02, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.83885765e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 2.63520527e+00, 1.76634476e-01,
       1.59513545e+00, 3.08032990e+00, 0.00000000e+00, 1.61443278e-02,
       8.85661542e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       2.03128927e-03, 0.00000000e+00, 2.58248639e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 9.04115796e-01, 3.11318133e-03,
       3.07337474e-03, 0.00000000e+00, 1.15459703e-01, 8.12141776e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 5.44282818e+00,
       7.68841356e-02, 6.15408970e-03, 0.00000000e+00, 1.40098736e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.03868922e-03,
       0.00000000e+00, 0.00000000e+00, 4.75200319e+00, 0.00000000e+00,
       0.00000000e+00, 1.50562106e-02, 0.00000000e+00, 0.00000000e+00,
       5.06811619e-01, 3.28483176e-03, 1.50831118e-02, 1.96179342e+00,
       0.00000000e+00, 0.00000000e+00, 1.38410982e-02, 0.00000000e+00,
       5.59307194e+00, 0.00000000e+00, 0.00000000e+00, 1.05718225e-02,
       0.00000000e+00, 0.00000000e+00, 1.09667301e+00, 0.00000000e+00,
       1.64638805e+00, 1.84855092e+00, 2.36622393e-02, 8.21108162e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.46917075e-01, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.14590859e+00,
       7.95775726e-02, 1.06241751e+00, 0.00000000e+00, 0.00000000e+00,
       1.46609402e+00, 0.00000000e+00, 2.27923127e-04, 0.00000000e+00,
       4.58218215e-04, 0.00000000e+00, 2.41070054e-03, 2.36309147e+00,
       2.60034966e+00, 0.00000000e+00, 2.45808482e+00, 0.00000000e+00,
       3.95438552e-01, 4.81823397e+00, 1.96086252e+00, 0.00000000e+00,
       0.00000000e+00, 5.22180259e-01, 0.00000000e+00, 1.96187815e-04,
       0.00000000e+00, 4.05640854e-03, 0.00000000e+00, 2.15945137e-03,
       0.00000000e+00, 0.00000000e+00, 1.18970662e-01, 0.00000000e+00,
       0.00000000e+00, 1.87555957e+00, 0.00000000e+00, 0.00000000e+00,
       8.57250214e-01, 0.00000000e+00, 0.00000000e+00, 2.48691603e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.28582637e-02,
       3.22910324e-02, 0.00000000e+00, 2.83409632e-03, 0.00000000e+00,
       4.90677953e-02, 7.52867460e-02, 6.06696978e-02, 3.07987928e-02,
       5.28933644e-01, 0.00000000e+00, 4.91607352e-04, 7.35818446e-01,
       2.13144183e+00, 1.43062130e-01, 7.33405948e-02, 2.63899798e-03,
       8.55512524e+00, 0.00000000e+00, 0.00000000e+00, 2.93270397e+00,
       0.00000000e+00, 2.40167290e-01, 7.53109995e-03, 0.00000000e+00,
       3.86320734e+00, 0.00000000e+00, 5.29762655e-02, 1.88932657e+00,
       4.31169510e+00, 0.00000000e+00, 4.59093601e-03, 2.07456279e+00,
       0.00000000e+00, 0.00000000e+00, 6.97626397e-02, 2.25789566e-03,
       0.00000000e+00, 5.77813089e-02, 1.20385885e+00, 0.00000000e+00,
       7.66109991e+00, 6.53804615e-02, 2.14639616e+00, 0.00000000e+00,
       1.99114569e-02, 4.48051654e-02, 1.18760276e+00, 1.51365370e-01,
       2.81427890e-01, 4.08457173e-03, 6.10075188e+00, 1.30849342e+01,
       0.00000000e+00, 3.26321349e-02, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.55240393e+00,
       3.31316289e-04, 0.00000000e+00, 0.00000000e+00, 4.12878573e-01,
       0.00000000e+00, 6.22653425e-01, 2.00655842e+00, 6.15296932e-03,
       6.49426952e-02, 5.21282712e-03, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 2.93466949e+00, 1.82790613e+00,
       0.00000000e+00, 0.00000000e+00, 7.13083660e-03, 0.00000000e+00,
       0.00000000e+00, 2.93663107e-02, 4.11311574e-02, 1.87130141e+00,
       4.36512232e-02, 0.00000000e+00, 0.00000000e+00, 3.11322403e+00,
       0.00000000e+00, 0.00000000e+00, 1.43544659e-01, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.23258209e+00],
      dtype=float32)

Hi there @mdouze! Just wondering if you might have had a chance to take another look at this?

Cheers!