ekzhu / set-similarity-search-benchmarks

Benchmark Datasets for Set Similarity Search

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set	Note	Number of sets	Number of tokens	File size	Papers
BMS-POS (Source)	A set is a purchase in a shop; a token is a product category in that purchase	515,597	1,657	3.8 MB	1
Kosarak (Source)	A set is a user; a token is a link clicked by the user	990,002	41,270	13 MB	1
Flickr	A set is a photo; a token is a tag or a word from the title	1,680,490	810,660	29 MB	1,4
Netflix (Source)	A set is a user; a token is a movie rated by the user	480,189	17,770	166 MB	1
Orkut (Source)	A set is a user; a token is a group membership of the user	1,853,285	15,293,693	378 MB	1
Canada-US-UK Open Data Query Benchmark 1k Query Benchmark 10k Query Benchmark 100k	A set is a table column; a token is a data value	745,414	562,320,456	2.52 GB	2
WDC Web Table 2015, English Relational-Only Query Benchmark 100 Query Benchmark 1k Query Benchmark 10k	A set is a table column; a token is a data value	163,510,917	184,644,583	4.32 GB	2,3

All data sets follow the same format:

Compressed using gzip.
First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
LSH Ensemble: Internet Scale Domain Search, VLDB 2016
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
Spatio-textual similarity joins, VLDB 2012

About

Benchmark Datasets for Set Similarity Search

Apache License 2.0