JiaruiFeng / TAGLAS

An atlas of text-attributed graph datasets in the era of large graph and language model.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TAGLAS

This repository collects multiple Text-Attributed Graph (TAG) datasets from various sources and provides a unified approach for preprocessing and loading. We also offer a standardized task generation pipeline for evaluating the performance of GNN/LLM on these datasets. Technical report of the TAGLAS is available in arxiv. The project is still under construction, so please expect more datasets and features in the future. Stay tuned!

🔥News

  • 2024.06: First version release.

Statistics

Here are currently included datasets:

Dataset (key) Avg. #N Avg. #E #G Task level Task Split (train/val/test) Domain description Source
Cora_node (cora_node) 2708 10556 1 Node 7-way classification 140/500/2068 Co-Citation Predict the category of papers. Graph-LLM, OFA
Cora_link (cora_link) 2708 10556 1 Link Binary classification 17944/1056/2112 Co-Citation Predict whether two papers are co-cited by other papers. Graph-LLM, OFA
Pubmed_node (pubmed_node) 19717 88648 1 Node 3-way classification 60/500/19157 Co-Citation Predict the category of papers. Graph-LLM, OFA
Pubmed_link (pubmed_link) 19717 88468 1 Link Binary classification 150700/8866/17730 Co-Citation Predict whether two papers are co-cited by other papers. Graph-LLM, OFA
Arxiv (arxiv) 169343 1166243 1 Node 40-way classification 90941/29799/48603 Citation Predict the category of papers. OGB, OFA
WikiCS (wikics) 11701 216123 1 Node 10-way classification 580/1769/5847 Wiki page Predict the category of wiki pages. PyG, OFA
Product-subset (products) 54025 144638 1 Node 47-way classification 14695/1567/36982 Co-purchase Predict the category of products. TAPE
FB15K237 (fb15k237) 14541 310116 1 Link 237-way classification 272115/17535/20466 Knowledge graph Predict the relationship between two entities. OFA
WN18RR (wn18rr) 40943 93003 1 Link 11-way classification 86835/3034/3134 Knowledge graph Predict the relationship between two entities. OFA
MovieLens-1m (ml1m) 9923 2000418 1 Link regression/5-way 850177/50011/100021 Movie rating Predict the rating between users and movies. PyG
Chembl_pretrain (chemblpre) 25.87 55.92 365065 Graph 1048-way binary classification 341952/0/0 molecular Predict the effectiveness of molecule to multiple assays. GIMLET, OFA
PCBA (pcba) 25.97 56.20 437929 Graph 128-way binary classification 349854/43650/43588 molecular Predict the effectiveness of molecule to multiple assays. GIMLET, OFA
HIV (hiv) 25.51 54.94 41127 Graph Binary classification 32901/4113/4113 molecular Predict the effectiveness of molecule to hiv. GIMLET, OFA
BBBP (bbbp) 24.06 51.91 2039 Graph Binary classification 1631/204/204 molecular Predict the effectiveness of molecule to brain blood barrier. GIMLET, OFA
BACE (bace) 34.09 73.72 1513 Graph Binary classification 1210/151/152 molecular Predict the effectiveness of molecule to BACE1 protease. GIMLET, OFA
toxcast (toxcast) 18.76 38.50 8575 Graph 588-way binary classification. 6859/858/858 molecular Predict the effectiveness of molecule to multiple assays. GIMLET, OFA
esol (esol) 13.29 27.35 1128 Graph Regression 902/113/113 molecular Predict the solubility of the molecule. GIMLET, OFA
freesolv (freesolv) 8.72 16.76 642 Graph Regression 513/64/65 molecular Predict the free energy of hydration of the molecule. GIMLET, OFA
lipo (lipo) 27.04 59.00 4200 Graph Regression 3360/420/420 molecular Predict the lipophilicity of the molecule. GIMLET, OFA
cyp450 (cyp450) 24.52 53.02 16896 Graph 5-way binary classification 13516/1690/1690 molecular Predict the effectiveness of molecule to CYP450 enzyme family. GIMLET, OFA
tox21 (tox21) 18.57 38.59 7831 Graph 12-way binary classification 6264/783/784 molecular Predict the effectiveness of molecule to multiple assays. GIMLET, OFA
muv (muv) 24.23 52.56 93087 Graph 17-way binary classification 74469/9309/9309 molecular Predict the effectiveness of molecule to multiple assays. GIMLET, OFA
ExplaGraphs (expla_graph) 5.17 4.25 2766 Graph Question Answering 1659/553/554 Commonsense Common sense reasoning. G-retriver
SceneGraphs (scene_graph) 19.13 68.44 100000 Graph Question Answering 59978/19997/20025 scene graph Scene graph question answering. G-retriver
MAG240m-subset (mag240m) 5875010 26434726 1 Node 153-way classification 900722/63337/63338/132585 Citation Predict the category of papers. OGB
Ultrachat200k (ultrachat200k) 3.72 2.72 449929 Graph Question Answering 400000/20000/29929 Conversation Answer the question given previous conversation. UltraChat200k

Requirement

PyG>=2.3
datasets
torch>=2.0.1
transformers>=4.36.2
huggingface_hub
rdkit

Installation

You can directly clone the repository into your working project by using the following command:

git clone https://github.com/JiaruiFeng/TAGLAS.git

We will provide a more user-friendly installation method in the future.

Usage

Datasets

Load datasets

The basic way to load a dataset is by using its key. The dataset key can be found in the table above. For example, to load the Arxiv dataset:

from TAGLAS import get_dataset
dataset = get_dataset("arxiv")

You can also load multiple datasets at the same time:

from TAGLAS import get_datasets
dataset_list = get_datasets(["arxiv", "pcba"])

By default, all data files are be saved in the ./TAGDataset directory root in the repository directory. If you want to change the data path, you can set the root parameter when loading the dataset:

from TAGLAS import get_datasets
dataset_list = get_datasets(["arxiv", "pcba"], root="your_path")

The above function will load the dataset in the default way, which is suitable for most cases. However, some datasets may have additional arguments. To have further control over the loading process, you can also load the dataset by directly add additional arguments:

from TAGLAS import get_dataset
dataset = get_dataset("fb15k237", to_undirected=False)

Finally, directly import from the dataset class is also supported:

from TAGLAS.datasets import Arxiv
dataset = Arxiv()

Data key description and basic usage

All data samples are stored in the dataset with class TAGData, which is inherited from Data class in torch_geometric package. Different information will be stored in different key. Most datasets contain the following keys:

  • x: Text feature for all nodes. Usually a list or np.ndarray.
  • node_map: A mapping from node index to node text feature. Usually a torch.LongTensor.
  • edge_attr: Text feature for all edges. Usually a list or np.ndarray.
  • edge_map: A mapping from edge index to edge text feature. Usually a torch.LongTensor.
  • label: Text feature for all labels. Usually a list or np.ndarray.
  • label_map: A mapping from label index to label text feature. Usually a torch.LongTensor.
  • edge_index: The graph structure. Usually a torch.LongTensor.

Some datasets may also contain:

  • x_original: The vector feature of all nodes in the original data source. Usually a torch.Tensor.
  • edge_attr_orignal: The vector feature of all edges in the original data source. Usually a torch.Tensor.
  • question: Text feature of question for the QA tasks.
  • question_map: A mapping from question index to question text feature.
  • answer: Text feature of answer for the QA tasks.
  • answer_map: A mapping from answer index to answer text feature.

Here is a simple demonstration:

from TAGLAS import get_dataset
dataset = get_dataset("arxiv")
# Get node text feature for the whole dataset.
x = dataset.x
# Get the first graph sample in the dataset.
data = dataset[0]
# Get edge text feature for the sample.
edge_attr = data.edge_attr

Feature mapping

For graph-level datasets, all _map keys like node_map or edge_map will store the mapping to the global feature of all data sample. The global features can be accessed by:

from TAGLAS import get_dataset
dataset = get_dataset("hiv")
# Get the global node text features.
dataset.x 
# Get the global edge text features.
dataset.edge_attr

The feature for a specific sample can be obtained by:

from TAGLAS import get_dataset
dataset = get_dataset("hiv")
# Global node text features
x = dataset.x
data = dataset[0]
# Get node text feature for sample 0 by the global node_map key of the sample 0.
sample_x = [x[i] for i in data.node_map]
# We also provide direct access to the text feature of each sample by:
sample_x = dataset[0].x

For node/edge-level datasets, since they contain only one graph, the local map is also the global map, and the logic remains the same. The reason we store the features this way is to avoid repeated text features, especially for large datasets with only a few unique text features (like molecule datasets).

Tasks

Supported tasks

In this repository, we provide a unified way to generate tasks based on datasets. Currently, we support the following five task types:

  • default: The default task implements the most common methods used in the graph community for node, edge, and graph-level tasks. Specifically, it returns the entire original graph for node and edge-level tasks and an original graph sample for graph-level tasks. Additionally, it uses the node and edge features from the original source if available; otherwise, it generates identical features. This type is mainly used for debugging and baseline evaluation.
  • default_text: The logic of default_text tasks is the same as default, except that all features are replaced with text features. Additionally, we support converting all text features to sentence embeddings.
  • subgraph:The subgraph task converts node and edge-level tasks into subgraph-based tasks. Specifically, for the target node or edge, it will sample a subgraph around the target. Similar to the default task, it uses the original node and edge features.
  • subgraph_text: The logic of subgraph_text tasks is the same as subgraph except that all features are replaced with text features.
  • QA: The QA task converts all tasks into a question-answering format. Each sample will include a question and an answer key. By default, the QA tasks will sample subgraphs for node and edge-level tasks.

Load tasks

To load a specific task, simply call:

from TAGLAS import get_task
# Load default node-level task on cora
task = get_task("cora_node", "default")
# Load subgraph_text edge-level task on pubmed and val split
task = get_task("pubmed_link", "subgraph_text", split="val")

Similarly, you can load multiple task at the same time:

from TAGLAS import get_tasks
# Load QA tasks on all datasets.
tasks = get_tasks(["cora_node", "arxiv", "wn18rr", "scene_graph"], "QA")
# Specify task type for each dataset.
tasks = get_tasks(["cora_node", "arxiv"], ["QA", "subgraph_text"])

By default, all generated tasks will not be saved. For fast loading and repeat experiments, you can save and load the generated tasks by:

from TAGLAS import get_task
# save_data will save the generated task into corresponding folder. load_saved will try to load the saved task first before generate new task.
arxiv_task = get_task("arxiv", "subgraph_text", split="test", save_data=True, load_saved=True)
# In defualt, the saved task file will be named by used important arguments (like split, hop...). You can also specify it by yourself:
arxiv_task = get_task("arxiv", "subgraph_text", split="test", save_data=True, load_saved=True, save_name="your_name")

Directly construct task given dataset is also supported: Finally, directly import from the dataset class is also supported:

from TAGLAS.datasets import Arxiv
from TAGLAS.tasks import SubgraphTextNPTask
dataset = Arxiv()
# Load subgraph_text node-level task on Arxiv dataset.
task = SubgraphTextNPTask(dataset)

Convert text feature to sentence embedding

For default_text, subgraph_text, and QA task types, we also provide function to convert raw text feature to sentence embedding:

from TAGLAS import get_task
from TAGLAS.tasks.text_encoder import SentenceEncoder
encoder_name = "ST"
encoder = SentenceEncoder(encoder_name)
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
arxiv_task.convert_text_to_embedding(encoder_name, encoder)

In TAGLAS, we implement several commonly used LLMs for sentence embedding, including ST (Sentence Transformer), BERT (vanilla BERT), e5 (E5), llama2_7b (Llama2-7b), and llama2_13b (Llama2-13b). You can load different models by inputting the respective model_key into SentenceEncoder. Additionally, you can implement your own sentence embedding model as long as it has a __call__ function to convert input text lists into embeddings.

Collate

For all tasks in TAGLAS, we provide a unified collcate function. Specifically, call the collate function by:

from TAGLAS import get_task
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
# Call collate function to get a batch of data
batch = arxiv_task.collate([arxiv_task[i] for i in range(16)])

The collate function is implemented based on torch_geometric.loader.dataloader.Collater. However, there is a major difference. For all text feature keys like x and edge_attr, it only stores the unique text features in the batch. Additionally, all _map keys store the map from the corresponding unique text features in the batch to all elements.

from TAGLAS import get_task
arxiv_task = get_task("arxiv", "subgraph_text", split="test")
batch = arxiv_task.collate([arxiv_task[i] for i in range(16)])
# to get node text features for all nodes in the batch
x = batch.x[batch.node_map]
# to get edge text features for all edges in the batch
edge_attr = batch.edge_attr[batch.edge_map]

In this way, the batch data is more memory and computation efficient, as each unique text only needs to be encoded once.

Evaluation

"For each dataset and task, we provide a default evaluation tool for performance evaluation based on torchmetric. Specifically, for each dataset, we support two types of evaluation based on its supported task types."

  • default: Used for all task types except QA. It supports evaluation based on tensor output, which is commonly used.
  • QA: It supports evaluation based on text output.

To get an evaluator for a certain task, simply call:

from TAGLAS import get_evaluator, get_evaluators
# Get default evaluator for cora_node task. metric_name is a string indicate the name of metric.
metric_name, evaluator = get_evaluator("cora_node", "subgraph_text")
# Get QA evaluator for arxiv
metric_name, evaluator = get_evaluator("arxiv", "QA")
# Get evaluator for multiple input tasks.
metric_name_list, evaluator_list = get_evaluators(["cora_node", "arxiv"], "QA")

Issues and Bugs

The project is still in development. If you encounter any issues or bugs while using it, please feel free to open an issue in the GitHub repository.

Cite

If you found the TAGLAS helpful in your project, consider cite it! Thank you!

@misc{taglas,
      title={TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models}, 
      author={Jiarui Feng and Hao Liu and Lecheng Kong and Yixin Chen and Muhan Zhang},
      year={2024},
      eprint={2406.14683},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.14683}, 
}

About

An atlas of text-attributed graph datasets in the era of large graph and language model.

License:MIT License


Languages

Language:Python 100.0%