ekzhu / set-similarity-search-benchmarks

Benchmark Datasets for Set Similarity Search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set Note Number of sets Number of tokens File size Papers
BMS-POS (Source) A set is a purchase in a shop; a token is a product category in that purchase 515,597 1,657 3.8 MB 1
Kosarak (Source) A set is a user; a token is a link clicked by the user 990,002 41,270 13 MB 1
Flickr A set is a photo; a token is a tag or a word from the title 1,680,490 810,660 29 MB 1,4
Netflix (Source) A set is a user; a token is a movie rated by the user 480,189 17,770 166 MB 1
Orkut (Source) A set is a user; a token is a group membership of the user 1,853,285 15,293,693 378 MB 1
Canada-US-UK Open Data
Query Benchmark 1k
Query Benchmark 10k
Query Benchmark 100k
A set is a table column; a token is a data value 745,414 562,320,456 2.52 GB 2
WDC Web Table 2015, English Relational-Only
Query Benchmark 100
Query Benchmark 1k
Query Benchmark 10k
A set is a table column; a token is a data value 163,510,917 184,644,583 4.32 GB 2,3

All data sets follow the same format:

  • Compressed using gzip.
  • First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
  • All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
  • All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

  1. An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
  2. LSH Ensemble: Internet Scale Domain Search, VLDB 2016
  3. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
  4. Spatio-textual similarity joins, VLDB 2012

About

Benchmark Datasets for Set Similarity Search

License:Apache License 2.0