Dataset

Using the KDD Cup 2003 dataset, it contains roughly 30,000 research papers from the arXiv. I will be using abstract and citation graph, the abstract dataset contains metadata related to an research paper. The abstract metadata we will utilize will be title, author names, and date.

from data import Data
from model import Model
import random

X_train, X_test, y_train, y_test, AUTHOR_MAP = Data().getXY(useSaved=True)

model = Model().getFitModel(X_train, y_train)

accuracy, true_idx = Model().testModel(X_test, y_test, model, subset_acc=False)
print(f'Accuracy of the model: {accuracy}')

index = random.randint(0, len(X_test))

AUTHOR_MAP_DASH = {y:x for x,y in AUTHOR_MAP.items()}
s = ', '.join([AUTHOR_MAP_DASH[i] if i in AUTHOR_MAP_DASH.keys() else "UNK" for i in X_test[index]])

print(f'Cited research authors {s} and the paper is {Model().getAuthor([X_test[index]], AUTHOR_MAP, model)}')

NOTE: Preprocessing and RadiusNeighborsClassifier code can be found in IIITD.ipynb while node2vec.ipynb contains node2vec implementation.

About

Given the citations of a paper, with how much certainty can you predict one of the authors of that paper

Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%