Public Datasets For Recommender Systems

This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS). They are collected and tidied from Stack Overflow, articles, recommender sites and academic experiments. Most of the datasets presented here are free, having open sorce linceses, however, some are not and you need to ask permission to use or cite the authors' work.

In addition, this repository contains some pre-processed datasets with treatment for academic experiments.

Link and datasets descriptions

Book

Book Crossing:: The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August / September 2004) from the Book-Crossing community

Dating

Dating Agency:: This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006.

E-commerce

Amazon:: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014
Retailrocket recommender system dataset:: The dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website.

Music

Amazon Music:: This digital music dataset contains reviews and metadata from Amazon
Yahoo Music:: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.
LastFM (Implicit):: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
Million Song Dataset:: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Movies

MovieLens:: GroupLens Research has collected and made available rating datasets from their movie web site
Yahoo Movies:: This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services.
CiaoDVD:: CiaoDVD is a dataset crawled from the entire category of DVDs from the dvd.ciao.co.uk website in December, 2013
FilmTrust:: FilmTrust is a small dataset crawled from the entire FilmTrust website in June, 2011
Netflix:: This is the official data set used in the Netflix Prize competition.

Games

Steam Video Games:: This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are 'purchase' and 'play'. The value indicates the degree to which the behavior was performed - in the case of 'purchase' the value is always 1, and in the case of 'play' the value represents the number of hours the user has played the game.

Jokes

Jester:: This Joke dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users

Food

Chicago Entree:: This dataset contains a record of user interactions with the Entree Chicago restaurant recommendation system.

Anime

Anime Recommendations Database:: This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Other dataset

You can find more datasets in:

GroupLens Datasets link
LibRec Datasets link
Yahoo Research link
Datasets for Machine Learning link
Stanford Large Network Dataset Collection link

Usage and License

Before using these data sets, please review their README files or sites for the usage licenses, acknowledgments and other details.

Note : If you have difficulties in downloading any of these datasets please contact me. I have backup of all datasets.

Recommender Tools

Case Recommender:: Python.
MyMediaLite:: C#.

Contributors

Arthur Fortes da Costa {fortes [dot] arthur [at] gmail [dot] com} [Editor]

fajieyuan / Datasets-for-Recommender-Systems