jamesliao2016 / RecSys2018

1'st Place Approach by Layer6 AI to the 2018 ACM RecSys Challenge

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

2018 ACM RecSys Challenge 1'st Place Team vl6

Team members: Maksims Volkovs, Himanshu Rai, Zhaoyue Cheng, Yichao Lu (University of Toronto), Ga Wu (University of Toronto, Vector Institute), Scott Sanner (University of Toronto, Vector Institute) (paper)

Contact: maks@layer6.ai

This repository contains the Java implementation of our entries for both main and creative tracks. Our approach consists of a two-stage model where in the first stage a blend of collaborative filtering methods is used to quickly retrieve a set of candidate songs for each playlist with high recall. Then in the second stage a pairwise playlist-song gradient boosting model is used to re-rank the retrieved candidates and maximize precision at the top of the recommended list.

The model is implemented in Java and tested on the following environment:

  • Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  • 256GB RAM
  • Nvidia Titan V
  • Java Oracle 1.8.0_171
  • Python, Numpy 1.14.3, Sklearn 0.19.1, Scipy 1.1.0
  • Apache Maven 3.3.9
  • CUDA 8.0 and CUDNN 8.0
  • Intel MKL 2018.1.038
  • XGBoost and XGBoost4j 0.7

All models are executed from src/main/java/main/Executor.java, the main function has examples on how to do main and creative track model training, evaluation and submission. To run the model:

  • Set all paths:
//OAuth token for spotify creative api, if doing creative track submission
String authToken = "";

// path to song audio feature file, if doing creative track submission
String creativeTrackFile = "/home/recsys2018/data/song_audio_features.txt";

// path to MPD directory with the JSON files
String trainPath = "/home/recsys2018/data/train/";

// path to challenge set JSON file
String testFile = "/home/recsys2018/data/test/challenge_set.json";

// path to python SVD script included in the repo, default location: script/svd_py.py
String pythonScriptPath = "/home/recsys2018/script/svd_py.py";

//path to cache folder for temp storage, at least 20GB should be available in this folder
String cachePath = "/home/recsys2018/cache/";
  • Compile and execute with maven:
export MAVEN_OPTS="-Xms150g -Xmx150g"
mvn clean compile
mvn exec:java -Dexec.mainClass="main.Executor" 

Note that by default the code is executing model for the main track, to run the creative track model set xgbParams.doCreative = true. For the creative track we extracted extra song features from the Spotify Audio API. We were able to match most songs from the challenge Million Playlist Dataset, and used the following fields for further feature extraction: [acousticness, danceability, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence]. In order to download the data for this track, you need to get the OAuth Token from Spotify API page and assign it to the authToken variable in the Executor.main function.

We prioritized speed over memory for this project so you'll need at least 100GB of RAM to run model training and inference. The full end-to-end runtime takes approximately 1.5 days.

About

1'st Place Approach by Layer6 AI to the 2018 ACM RecSys Challenge

License:Apache License 2.0


Languages

Language:Java 99.1%Language:Python 0.9%