Nolanogenn / re_with_gcn

testing repository for relation extraction with GCN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RE with GCN

Installation

We recommend Linux Python 3.8.13, CPU only. You can install the dependencies by: pip install -r requirements.txt. If you like to use Cuda, we recommend the PyTorch version cu113. You can install it for CUDA 11.6 by running: pip install -r requirements_cu116.txt.

In case you recreate our datasets: A manual install spacy-sentence-bert is required: simply follow the installation instructions on the official GitHub page and download the en_stsb_bert_base model by running pip install https://github.com/MartinoMensio/spacy-sentence-bert/releases/download/v0.1.2/en_stsb_bert_base-0.1.2.tar.gz#en_stsb_bert_base-0.1.2. And then run: python -m spacy download en_core_web_sm And run python -m wn download oewn:2021 If you use KG features in the dataset creation, you have to install jRDF2Vec according to its GitHub page. The TACRED dataset is not linked to a KG, therefore, we are using the spaCy-entity-linker to link the entities. You can install the entity linker by pip install spacy-entity-linker and download the required model by python -m spacy_entity_linker "download_knowledge_base"

Datasets

Download form Dropbox and place under re_with_gcn/data. This folder contains some example models for benchmarking, as well as all raw datasets.

Reproduction

If you want to reproduce the evaluation of our models, you can use the Jupyter Notebook reproduction.ipynb to load our models and run the evaluation. Therefore, you can download our models and datasets from the Dropbox link shown above. Unfortunately, we can not provide all datasets for download during the anonymous review phase due to storage limitations of public cloud storage providers.

Technical details

RDF2Vec embeddings for FewRel

Train RDF2Vec (Lite) with the jRDF2Vec Framework for all relevant entities.

WordNet

For WordNet we can simply compute the RDF2Vec embeddings for all entities. Therefore, train RDF2Vec embeddings by: nohup java -Xmx200g -jar jrdf2vec-1.3-SNAPSHOT.jar -graph data/wordnet.nt -numberOfWalks 100 -depth 4 -walkGenerationMode "MID_WALKS" -walkDirectory ./re_with_gcn_wordnet > ~/rdf2vec_re_with_gcn_wordnet.txt &

Wikidata

For Wikidata, first the triples that will be predicted have to be removed. Moreover, the relevant entities have to be identified in order to not compute the embeddings for all Wikidata entities and instead use RDF2Vec lite.

  1. Remove the extracted triples from Wikidata, s.t. the relations that will be predicted are not implicitly included in the embeddings: remove_triples_from_wikidata.py
  2. Train RDF2Vec embeddings FewRel: nohup java -Xmx200g -jar jrdf2vec-1.2-SNAPSHOT.jar -graph ~/knowledgegraphs/wikidata-20160328_reduced_by_re_with_gcn_fewrel.nt -light ~/knowledgegraphs/re_with_gcn_relevant_entities_fewrel.txt -numberOfWalks 100 -depth 4 -walkGenerationMode "MID_WALKS" -walkDirectory ./re_with_gcn_wikidata_fewrel -dimension 100 > ~/rdf2vec_re_with_gcn_wikidata_100.txt &
  3. Train RDF2Vec embeddings T-REx: nohup java -Xmx200g -jar jrdf2vec-1.2-SNAPSHOT.jar -graph ~/knowledgegraphs/wikidata-20160328_reduced_by_re_with_gcn_trex.nt -light ~/knowledgegraphs/re_with_gcn_relevant_entities_trex.txt -numberOfWalks 100 -depth 4 -walkGenerationMode "MID_WALKS" -walkDirectory ./re_with_gcn_wikidata_trex_768 -dimension 768 > ~/rdf2vec_re_with_gcn_wikidata_768.txt &
  4. Train RDF2Vec embeddings T-REx: nohup java -Xmx200g -jar jrdf2vec-1.2-SNAPSHOT.jar -graph ~/knowledgegraphs/wikidata-20160328_reduced_by_re_with_gcn_trex.nt -light ~/knowledgegraphs/re_with_gcn_relevant_entities_trex.txt -numberOfWalks 100 -depth 4 -walkGenerationMode "MID_WALKS" -walkDirectory ./re_with_gcn_wikidata_trex_100 -dimension 100 > ~/rdf2vec_re_with_gcn_wikidata_100.txt &

java -Xmx200g -jar jrdf2vec-1.3-SNAPSHOT.jar -onlyTraining -walkDirectory ./walks -dimension 100

Shortest Path in Wikidata

The file wikidata_head_tail_2_path.json contains a dictionary that maps a string of two entities 'entity1 entity2' to the computed shortest path in wikidata, and a list of all wikidata relations appearing in these paths. entity1 is always smaller than entity2 according to the python string comparison. The file is generated with the jupyter notebook wikidata_shortest_path_generation.

T-REx Dataset reformatting

The notebook semeval108_reformatting.ipynb contains the code to sample and reformat the T-REx dataset for our graph creation.

SemEval Dataset reformatting

The notebook trex_reformatting.ipynb contains the code to sample and reformat the SemEval dataset for our graph creation.

TACRED Dataset reformatting

The script tacred_reformatting.py contains the code to sample and reformat the TACRED dataset for our graph creation.

Controlled Datasets

The controlled datasets have been created by modifying two features:

  1. substitution strategy:
    1. : the new entity is chosen randomly among all the possible entities encountered in the training data
    2. : the new entity will be chosen from those of the same type as the former entity
    3. : the new entity will be chosen from those that appear in the same relation as the same role
  2. entity substituted:
    1. : only the subject is substituted
    2. : only the object is substituted
    3. : both the subject and the object are substituted

The first and last number in the filename describe, respectively, the first and the second feature described here.

Dataset generation and training

All commands to generate datasets and to start the training are listed in run.sh (after the commands above are executed). In case you want to execute all experiments at one (what we do not recommend to do), you can simply run the shell script. The best model configurations are listed in data/results.txt s.t. you do not have to rerun the hyperparameter evaluation in case you want to train one of our models form scratch.

In case of Errors during training

In case of errors, the final evaluation is not executed. To rerun the ray trails that exited due to errors, you can add the ray training directors, e.g., train_2022-12-29_17-26-13, to the ray training call to with the argument ERRORED_ONLY to resume the training. The line to do this is un-commented in the train.py script.

About

testing repository for relation extraction with GCN


Languages

Language:Jupyter Notebook 97.2%Language:Python 2.4%Language:Shell 0.4%