Here we extract 1.7m
documents from the arXiv dataset available in https://www.kaggle.com/Cornell-University/arxiv. Moreover, we use 100,000
keywords to track word frequency and co-occurrence within each document.
Our goal is to build a knowledge graph to find related words using the metric PMI
. Additionally, we project points of co-occurrence matrix onto a 2D plot to visualize the similarity and dissimilarity among words and clusters.
Query
=> related wordsrequirement document
=> data entity, database design, business processsecond language acquisition
=> second language writing, contrastive analysis, english as second languagebernouli distribution
=> multiple outcome, dynamic decision making, multivariate gaussian modelbusiness concept
=> customer segment, strategic change, research knowledgealgorithm
=> attribute oriented induction, coding algorithm, resource allocation algorithm
We observe that similar words form a word cluster.