A Comprehensive Survey and Experimental Comparison of Graph-based Approximate Nearest Neighbor Search

Introduction

Approcimate Nearest Neighbor Search (ANNS) is a fundamental building block in various application domains. Recently, graph-based algorithms have emerged as a very effective choice to implement ANNS. Our paper (arXiv link, PDF) provides a comprehensive comparative analysis and experimental evaluation of representative graph-based ANNS algorithms on carefully selected datasets of varying sizes and characteristics.

This project contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper. It is worth noting that we reimplement all algorithms based on exactly the same design pattern, programming language (except for the hash table in IEH) and tricks, and experimental setup, which makes the comparison more fair.

Algorithms

we evaluate thirteen representative graph-based ANNS algorithms, and their papers and the original codes are given in the following table.

ALGORITHM	PAPER	CODE
KGraph	WWW'2011	C++/Python
FANNG	CVPR'2016	-
NSG	VLDB'2019	C++
NSSG	arXiv'2019	C++/Python
DPG	TKDE'2019	C++
Vamana	NeurIPS'2019	-
EFANNA	arXiv'2016	C++/MATLAB
IEH	IEEE T CYBERNETICS'2014	-
NSW	IS'2014	C++/Python
HNSW	TPAMI'2018	C++/Python
NGT-panng	SISAP'2016	C++/Python
NGT-onng	arXiv'2018	C++/Python
SPTAG-KDT	ACM MM'2012; CVPR'2012; TPAMI'2014	C++
SPTAG-BKT	ACM MM'2012; CVPR'2012; TPAMI'2014	C++
HCNNG	PR'2019	-

Datasets

Our experiment involves eight real-world datasets popularly deployed by existing works. All datasets are pre-split into base data and query data and come with groundtruth data in the form of the top 20 or 100 neighbors. Additional twelve synthetic datasets are used to test the scalability of each algorithm to the performance of different datasets.

Note that, all base data and query data are converted to fvecs format, and groundtruth data is converted to ivecs format. Please refer here for the description of fvecs and ivecs format. All datasets in this format can be downloaded from here.

Parameters

For the optimal parameters of each algorithm on all experimental datasets, see the parameters page.

Usage

Prerequisites

GCC 4.9+ with OpenMP
CMake 2.8+
Boost 1.55+

Compile on Linux

$ git clone https://github.com/Lsyhprum/WEAVESS.git
$ cd WEAVESS/
$ mkdir build && cd build/
$ cmake ..
$ make -j

Index construction evaluation

Before building index, you should set the root directory for the dataset in WEAVESS/test/main.cpp first. Then, you can run the following instructions for build graph index.

cd WEAVESS/build/test/
./main algorithm_name dataset_name build

After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. You can run the following command in current directory for getting other index information, such as average out-degree, graph quality, and the number of connected components.

./main algorithm_name dataset_name info

Search performance

With the index built, you can run the following commands to perform the search. Related information about the search such as search time, distance evaluation times, candidate set size, average query path length, memory load can be obtained or calculated according to the output log information.

cd WEAVESS/build/test/
./main algorithm_name dataset_name search

Components evaluation

Note that the default dev branch is the evaluation of the overall performance of all algorithms, and the evaluation of a certain component needs to be carried out under the test branch. For more details, please see our paper.

Reference

Please cite our work in your publications if it helps your research:

@misc{wang2021comprehensive,
    title={A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search},
    author={Mengzhao Wang and Xiaoliang Xu and Qiang Yue and Yuxiang Wang},
    year={2021},
    eprint={2101.12631},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

Acknowledgements

Thanks to everyone who provided references for this project. Special thanks to Dr. Weijie Zhao, Dr. Mingjie Li, and Dr. Cong Fu for their assistance in the necessary implementation of this project.

op-hunter / WEAVESS