zach-blumenfeld / TAPE

Repository for TAPE

Home Page:https://arxiv.org/abs/2305.19523

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Explanations as Features:
LLM-Based Features for Text-Attributed Graphs

arXiv PWC

Citation

@misc{he2023explanations,
      title={Explanations as Features: LLM-Based Features for Text-Attributed Graphs}, 
      author={Xiaoxin He and Xavier Bresson and Thomas Laurent and Bryan Hooi},
      year={2023},
      eprint={2305.19523},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

0. Python environment setup with Conda

conda create --name TAPE python=3.8
conda activate TAPE

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
conda install -c pyg pytorch-sparse
conda install -c pyg pytorch-scatter
conda install -c pyg pytorch-cluster
conda install -c pyg pyg
pip install ogb
conda install -c dglteam/label/cu113 dgl
pip install yacs
pip install transformers

1. Download TAG datasets

A. Original text attributes

Dataset Description
ogbn-arxiv The OGB provides the mapping from MAG paper IDs into the raw texts of titles and abstracts.
Download the dataset here, unzip and move it to dataset/ogbn_arxiv_orig. The dataset size is 200M.
Cora Download the dataset here and move it to dataset/cora_orig. The dataset size is 2.6G.
PubMed Download the dataset here and move it to dataset/PubMed_orig. The dataset size is 115M.

B. LLM responses

Dataset Description
ogbn-arxiv Download the dataset here and move it to gpt_responses/ogbn_arxiv. The dataset size is 662M.
Cora Download the dataset here and move it to gpt_responses/cora. The dataset size is 11M.
PubMed Download the dataset here and move it to gpt_responses/PubMed. The dataset size is 77M.

2. Fine-tuning the LMs

To use the orginal text attributes

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv

To use the GPT responses

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv use_gpt True

3. Training the GNNs

To use different GNN models

python -m core.trainEnsemble gnn.model.name GCN
python -m core.trainEnsemble gnn.model.name SAGE
python -m core.trainEnsemble gnn.model.name RevGAT gnn.train.use_dgl True gnn.train.lr 0.002 gnn.train.dropout 0.75

To use different types of features

# Our enriched features
python -m core.trainEnsemble gnn.train.feature_type TA_P_E

# Our individual features
python -m core.trainGNN gnn.train.feature_type TA
python -m core.trainGNN gnn.train.feature_type E
python -m core.trainGNN gnn.train.feature_type P

# OGB features
python -m core.trainGNN gnn.train.feature_type ogb

4. Reproducibility

Use run.sh to run the codes and reproduce the published results.

About

Repository for TAPE

https://arxiv.org/abs/2305.19523

License:MIT License


Languages

Language:Python 99.1%Language:Shell 0.9%