ycsun1972 / ScientificEvolution

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ScientificEvolution

This is our research code for paper: Unraveling Scientific Evolutionary Paths: An Embedding-based Topic Analysis

We initially investigated mainstream and emerging embedding methods, including word2vec, doc2vec, BERT, and SciBERT, and compared their superiority in capturing semantics and extracting topics via several benchmarks, with TF-IDF and LDA as baselines. The optimal word-level and document-level embedding methods were selected for topic extraction and scientific evolutionary path identification.

Released Items

  • Labeled WoS and MedLine datasets
  • Codes for document clustering tasks
  • Object detection dataset
  • Word lists for data pre-processing
  • Codes for scientific evolutionary path identification

Preliminary Preparations

Install toolkits

gensim
sklearn
transformers

Download Pretrained Models

Embedding Methods Benchmark

Labeled WoS and MedLine datasets

Datasets used for benchmarking embedding methods are stored in datasets. The two datasets were extracted from the Web of Science and MedLine databases, separately. Each contains 50000 documents which are divided into ten categories. The corpora are stored in 50000_WoS.txt and 50000_MedLine.txt. And their labels are stored in 50000_WoS_label.txt and 50000_MedLine_Label.txt.

Codes for document clustering tasks

Codes used for evaluating the performance of candidate methods are provided in embedding_and_baseline. Six files are contained, including codes for using doc2vec, word2vec, BERT, SciBERT, BioBERT, tf-idf, and LDA, to vectorize document and cluster documents.

Scientific Evolution Analysis

Object detection dataset

The dataset is used to demonstrate the feasibility and effectiveness of the methodology in unraveling scientific evolutionary paths. It was collected from the Web of Science database, containing 56,529 peer-reviewed documents in the field of object detection, published between 2011 and 2020.

Codes for scientific evolutionary path identification

There are two files in [evolution_analysis](evolution analysis):

Citation

@ARTICLE{10273147,
  author={Jin, Qianqian and Chen, Hongshu and Zhang, Yi and Wang, Xuefeng and Zhu, Donghua},
  journal={IEEE Transactions on Engineering Management}, 
  title={Unraveling Scientific Evolutionary Paths: An Embedding-Based Topic Analysis}, 
  year={2023},
  pages={1-15},
  keywords={Semantics;Knowledge engineering;Analytical models;Task analysis;Patents;Technological innovation;Syntactics;Doc2vec;embedding;evolution analysis;evolutionary paths;topic extraction;word2vec},
  doi={10.1109/TEM.2023.3312923}}

About


Languages

Language:Jupyter Notebook 100.0%