tuanh208 / MissingCitationsPrediction

Predict missing links in a citation network.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing Citations Prediction

In-class Machine Learning data challenge: https://www.kaggle.com/c/inf554-2018/.

Predict missing links in a citation network using textual features as well as topological features between two papers.

1. Features

The extracted features are:
Textual features

  • (1) Cosine similarity of abstracts
  • (2) Number of overlap words in abstract
  • (3) Cosine similarity of titles
  • (4) Number of overlap words in title
  • (5) Number of overlap words between target’s title and source’s abstract

Graphical features
Citation graph

  • (6) Number of common neighbourhoods
  • (7) Link-based Jaccard coefficient
  • (8) Adamic-Adar index
  • (9) Preference attachment
  • (10) Difference in betweenness centrality
  • (11) Difference in the number of in-links
  • (12) Number of times target cited
  • (13) Pagerank of source
  • (14) Pagerank of target
  • (15) Minimal distance*
  • (16) Is the same cluster?

Author collaboration graph

  • (17) Number of common neighbourhoods
  • (18) Link-based Jaccard coefficient
  • (19) Preference attachment
  • (20) Adamic-Adar index

Other features

  • (21) Difference in publication year
  • (22) Journal popularity of target
  • (23) Is the same journal?
  • (24) The number of common authors
  • (25) Is self-cited?

2. Results

We have tried over different classifiers: ExtraTree, Adaboost, LogisticRegression, LinearSVM, RandomForest, and NeuralNetwork.
Our best submission on Kaggle was using NeuralNet: 0.97715 F1-score.

3. Others

For more information of the project, please have a look at our report in the file report_ML1_2018.pdf.
The data should be correctly placed in the folder data.

About

Predict missing links in a citation network.


Languages

Language:Jupyter Notebook 64.3%Language:Python 35.7%