In-class Machine Learning data challenge: https://www.kaggle.com/c/inf554-2018/.
Predict missing links in a citation network using textual features as well as topological features between two papers.
The extracted features are:
Textual features
- (1) Cosine similarity of abstracts
- (2) Number of overlap words in abstract
- (3) Cosine similarity of titles
- (4) Number of overlap words in title
- (5) Number of overlap words between target’s title and source’s abstract
Graphical features
Citation graph
- (6) Number of common neighbourhoods
- (7) Link-based Jaccard coefficient
- (8) Adamic-Adar index
- (9) Preference attachment
- (10) Difference in betweenness centrality
- (11) Difference in the number of in-links
- (12) Number of times target cited
- (13) Pagerank of source
- (14) Pagerank of target
- (15) Minimal distance*
- (16) Is the same cluster?
Author collaboration graph
- (17) Number of common neighbourhoods
- (18) Link-based Jaccard coefficient
- (19) Preference attachment
- (20) Adamic-Adar index
Other features
- (21) Difference in publication year
- (22) Journal popularity of target
- (23) Is the same journal?
- (24) The number of common authors
- (25) Is self-cited?
We have tried over different classifiers: ExtraTree, Adaboost, LogisticRegression, LinearSVM, RandomForest, and NeuralNetwork.
Our best submission on Kaggle was using NeuralNet: 0.97715 F1-score.
For more information of the project, please have a look at our report in the file report_ML1_2018.pdf.
The data should be correctly placed in the folder data.