Missing Citations Prediction

In-class Machine Learning data challenge: https://www.kaggle.com/c/inf554-2018/.

Predict missing links in a citation network using textual features as well as topological features between two papers.

1. Features

The extracted features are:
Textual features

(1) Cosine similarity of abstracts
(2) Number of overlap words in abstract
(3) Cosine similarity of titles
(4) Number of overlap words in title
(5) Number of overlap words between target’s title and source’s abstract

Graphical features
Citation graph

(6) Number of common neighbourhoods
(7) Link-based Jaccard coefficient
(8) Adamic-Adar index
(9) Preference attachment
(10) Difference in betweenness centrality
(11) Difference in the number of in-links
(12) Number of times target cited
(13) Pagerank of source
(14) Pagerank of target
(15) Minimal distance*
(16) Is the same cluster?

Author collaboration graph

(17) Number of common neighbourhoods
(18) Link-based Jaccard coefficient
(19) Preference attachment
(20) Adamic-Adar index

Other features

(21) Difference in publication year
(22) Journal popularity of target
(23) Is the same journal?
(24) The number of common authors
(25) Is self-cited?

2. Results

We have tried over different classifiers: ExtraTree, Adaboost, LogisticRegression, LinearSVM, RandomForest, and NeuralNetwork.
Our best submission on Kaggle was using NeuralNet: 0.97715 F1-score.

3. Others

For more information of the project, please have a look at our report in the file report_ML1_2018.pdf.
The data should be correctly placed in the folder data.

tuanh208 / MissingCitationsPrediction

Missing Citations Prediction

1. Features

2. Results

3. Others

About

Languages