THUDM / scholar-profiling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OAG scholar profiling

Prerequisites

  • Linux
  • Python 3.7
  • PyTorch 1.10.0+cu111

Getting Started

Installation

Clone this repo.

git clone https://github.com/THUDM/scholar-profiling.git
cd scholar-profiling

Please install dependencies by

pip install -r requirements.txt

Dataset

The dataset can be downloaded from BaiduPan (with password 7lro). There are three parts as follows:

  • data_ex.zip: unzip the file and put the data directory into project directory.
  • pretrain_models.zip: unzip the file and put the pretrain_models directory into project directory.
  • googleSearch: use 7z to extract data.zip in this folder and put the googleSearch directory in the data directory.

How to run

cd $project_path
export CUDA_VISIBLE_DEVICES='?'  # specify which GPU(s) to be used
export PYTHONPATH="`pwd`:$PYTHONPATH"

# Statistical machine learning (SML) methods:
# gender
python sml_baseline/GenderPredict/main.py
# homepage
python sml_baseline/HomepagePrediction/homepage_train.py
# position
python sml_baseline/TitlePrediction/title_main.py
# evaluation
python sml_baseline/merge_results.py
python evaluate.py --hp output/sml/sml_predict_xgboost.json --rf data/raw/ground_truth.json

# BERT
# First, uncomment three functions including create_gender_classification_data(), create_homepage_classification_data(), create_title_classification_data() to generate training data
python bert_baseline/tools.py 
# gender
python bert_baseline/gender_classification_bert.py
# homepage
python bert_baseline/homepage_classification_bert.py
# position
python bert_baseline/title_classification_bert.py
# for evaluation, uncomment merge_result() funciton in bert_baseline/tools.py 
python bert_baseline/tools.py 
python evaluate.py --hp data/luoyang-result_new.json --rf data/raw/ground_truth.json

# Bi-LSTM-CRF for position tagging
python data_process.py
python bert_bilstm_crf/run.py

# BERT with prompt tuning
# First, uncomment four functions including get_gender_data(r'data/raw/new_dev.xlsx'), get_title_data(r'data/raw/new_dev.xlsx'), get_gender_test(), and get_train_data() to generate training data
python data_process.py
# gender
python prompt/gender_prompt.py
# postion
python prompt/title_prompt.py

For how to extract more attributes from long texts of scholars' profiles, please see README.md in bio_models.

About


Languages

Language:Jupyter Notebook 56.5%Language:Python 41.8%Language:Shell 1.7%Language:Dockerfile 0.1%