Details of the competition can be found here.
To keep the repo lightweight, the dataset does not ship with the code. The .csv
data can be downloaded from Kaggle (requires account) and untarred in the top-level directory.
Some benchmarks require the scikit-learn package.
Semi-supervised learning review:
The competition appears to be an instance of bipartite ranking:
- An Efficient Boosting Algorithm for Combining Preferences
- A boosting algorithm for learning bipartite ranking functions with partially labeled data
Personalized PageRank with Monte Carlo looks promising:
- Build features with link analysis on author/paper graph, possibly with NetworkX library (doesn't seem to scale, looks like we need our own implementation)
- How to use titles, keywords, affliction and other raw text features?