This is our research code for paper: Unraveling Scientific Evolutionary Paths: An Embedding-based Topic Analysis
We initially investigated mainstream and emerging embedding methods, including word2vec, doc2vec, BERT, and SciBERT, and compared their superiority in capturing semantics and extracting topics via several benchmarks, with TF-IDF and LDA as baselines. The optimal word-level and document-level embedding methods were selected for topic extraction and scientific evolutionary path identification.
- Labeled WoS and MedLine datasets
- Codes for document clustering tasks
- Object detection dataset
- Word lists for data pre-processing
- Codes for scientific evolutionary path identification
gensim
sklearn
transformers
Datasets used for benchmarking embedding methods are stored in datasets. The two datasets were extracted from the Web of Science and MedLine databases, separately. Each contains 50000 documents which are divided into ten categories. The corpora are stored in 50000_WoS.txt and 50000_MedLine.txt. And their labels are stored in 50000_WoS_label.txt and 50000_MedLine_Label.txt.
Codes used for evaluating the performance of candidate methods are provided in embedding_and_baseline. Six files are contained, including codes for using doc2vec, word2vec, BERT, SciBERT, BioBERT, tf-idf, and LDA, to vectorize document and cluster documents.
The dataset is used to demonstrate the feasibility and effectiveness of the methodology in unraveling scientific evolutionary paths. It was collected from the Web of Science database, containing 56,529 peer-reviewed documents in the field of object detection, published between 2011 and 2020.
There are two files in [evolution_analysis](evolution analysis):
- topic_similarity_calculation_over_period, which is used for calculating the semantic similarity topics over periods.
- term_extraction_for_topic_designation, which is used for constructing terms’ semantic and co-occurrence networks, calculating the degree centrality of each term, and selecting the most representative terms for topic designation.
@ARTICLE{10273147,
author={Jin, Qianqian and Chen, Hongshu and Zhang, Yi and Wang, Xuefeng and Zhu, Donghua},
journal={IEEE Transactions on Engineering Management},
title={Unraveling Scientific Evolutionary Paths: An Embedding-Based Topic Analysis},
year={2023},
pages={1-15},
keywords={Semantics;Knowledge engineering;Analytical models;Task analysis;Patents;Technological innovation;Syntactics;Doc2vec;embedding;evolution analysis;evolutionary paths;topic extraction;word2vec},
doi={10.1109/TEM.2023.3312923}}