This software implements consistent weighted sampling (CWS), a similarity-preserving hashing technique for weighted Jaccard (or min-max) similarity, and approximate nearest neighbor (ANN) search via CWS. The software applies a simplification of the original CWS method called that generates non-negative integer vectors of sample ids, i.e., the sampled weights are not stored.
You can download and compile the software as follows.
$ git clone https://github.com/tonellotto/CWS
$ cd CWS
$ mvn clean package
The software supports the LIBSVM format whose each feature vector is written in ASCII, as follows.
<label> <index1>:<value1> <index2>:<value2> <index3>:<value3> ...
<label> <index1>:<value1> <index2>:<value2> <index3>:<value3> ...
.
.
.
<label> <index1>:<value1> <index2>:<value2> <index3>:<value3> ...
Move to the target/bin
folder, containing the executable scripts generated during the build process.
-
Create the
news20
folder to store the dataset files:$ mkdir news20
-
Download the dataset
news20.scale.bz2
to be used as input:$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/news20.scale.bz2 $ bzip2 -d news20.scale.bz2 $ mv news20.scale news20/news20.scale_base.txt
-
Download the dataset
news20.t.scale.bz2
and extract the first 100 feature vectors to be used as a query collection:$ wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/news20.t.scale.bz2 $ bzip2 -d news20.t.scale.bz2 $ head -100 news20.t.scale > news20/news20.scale_query.txt $ rm -f news20.t.scale
As a result, there should be the input file news20.scale_base.txt
and query collection file news20.scale_query.txt
in the folder news20
.
$ ./cws-main -i news20/news20.scale_base.txt -o news20/news20.scale_base.cws -d 62062 -D 64 -w -l
1) Generate random matrix data...
Elapsed time: 00:00.606
The random matrix data consumes 90,91 MiB
2) Do consistent weighted sampling...
15935 vectors processed in 00:17.480
Completed!! Processed 0,02 millions of elements in 00:17.482
As a result, there should be the CWS data file news20/news20.scale_base.cws
.
$ ./cws-main -i news20/news20.scale_query.txt -o news20/news20.scale_query.cws -d 62062 -D 64 -w -l
1) Generate random matrix data...
Elapsed time: 00:00.656
The random matrix data consumes 90,91 MiB
2) Do consistent weighted sampling...
100 vectors processed in 00:00.264
Completed!! Processed 0,00 millions of elements in 00:00.264
As a result, there should be the CWS data file news20/news20.scale_query.cws.bvecs
.
To evaluate the ANN search, create the ground truth data in (weighted) Jaccard similarity from news20.scale_base.txt
and news20.scale_query.txt
.
./ground-truth -i news20/news20.scale_base.txt -q news20/news20.scale_query.txt -o news20/news20.scale_groundtruth -w -l
Completed!! Processed 100 queries in 00:02.156
Output in news20/news20.scale_query.txt
As a result, there should be the ground truth file news20/news20.scale_groundtruth
.
Search ANN vectors from the database news20.scale_base.cws
for each query vector in news20.scale_query.cws
.
./ann-search -i news20/news20.scale_base.cws -q news20/news20.scale_query.cws -o news20/news20.scale_score -k 100
Output in news20/news20.scale_score
As a result, there should be the result file news20/news20.scale_score
.
Evaluate the recalls for the search results.
../../scripts/evaluate.py news20/news20.scale_score news20/news20.scale_groundtruth
Recall@1: 0.571
Recall@2: 0.704
Recall@5: 0.786
Recall@10: 0.847
Recall@20: 0.878
Recall@50: 0.918
Recall@100: 0.918