Benchmark data sets for set similarity search algorithms.
Data set | Note | Number of sets | Number of tokens | File size | Papers |
---|---|---|---|---|---|
BMS-POS (Source) | A set is a purchase in a shop; a token is a product category in that purchase | 515,597 | 1,657 | 3.8 MB | 1 |
Kosarak (Source) | A set is a user; a token is a link clicked by the user | 990,002 | 41,270 | 13 MB | 1 |
Flickr | A set is a photo; a token is a tag or a word from the title | 1,680,490 | 810,660 | 29 MB | 1,4 |
Netflix (Source) | A set is a user; a token is a movie rated by the user | 480,189 | 17,770 | 166 MB | 1 |
Orkut (Source) | A set is a user; a token is a group membership of the user | 1,853,285 | 15,293,693 | 378 MB | 1 |
Canada-US-UK Open Data Query Benchmark 1k Query Benchmark 10k Query Benchmark 100k |
A set is a table column; a token is a data value | 745,414 | 562,320,456 | 2.52 GB | 2 |
WDC Web Table 2015, English Relational-Only Query Benchmark 100 Query Benchmark 1k Query Benchmark 10k |
A set is a table column; a token is a data value | 163,510,917 | 184,644,583 | 4.32 GB | 2,3 |
All data sets follow the same format:
- Compressed using gzip.
- First line of the main file is
<number of sets> <number of tokens>
and optionally a third number<sum of all set sizes>
- All other lines are
<set size>\t<1>,<2>,<3>,...
, where\t
is a tab separator,<1>
and so on are tokens. - All tokens are integers, transformed from the original strings using a global ascending frequency order.
Papers in set similarity search using the above data sets:
- An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
- LSH Ensemble: Internet Scale Domain Search, VLDB 2016
- JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
- Spatio-textual similarity joins, VLDB 2012