Schlampig / ORE

Open Relation Extraction pipeline for Chinese text using BERT as backbone in PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ORE


996.icu LICENSE

Introduction:

Inspired by the amazing effect of Magi, we try to approach a more simple, brutal method to handle the open relation extraction problem, only using a pipeline including NER and RE.


Definition:

Open Relation Extraction, also defined as Open Triples Extraction or Open Facts Extraction, is a task to extract any triples (head entity, tail entity, relation) / tuples (head entity, tail entity) from the given text (a sentence or a document, we focus on sentence here). Our method ORE is short for Open Relation Extraction, utilized the pre-trained language model BERT as backbone, implemented by PyTorch.


Pipeline

There are three main steps of the pipeline: DS -> OpenNER -> OpenRE

  • DS: Distant Supervision: split each long text into sentences. Located triples (sometimes named facts) from knowledge graph in each sentence. That is, one sample has one sentence, several triples (if relation is not None), tuples (the current entity pair has relation but the relation span does not exist in the sentence), and some isolated entities. Entity linking, correference resolution, and the modifier-core structure expansion might be necessary.

  • OpenNER: Open Named Entity Recognition: train a NER model to find all entities in the current sentence, regardless of the type of entities. B-I-O annotation strategy is used here.

  • OpenRE: Open Relation Extraction: do the piecewise coupling for all entities recognized by NER model, and input each entity-pair with the sentence into a RE model, which predicts whether the two entities have relation or not. If there is a relation existing between the current two entities, RE model would further tried to find the relation span.

  • Note We do not provide detailed approaches of Distant Supervision.


File Dependency:

-> your_raw_data -> your_train_data.json
                 |-> your_test_data.json
-> check_points -> bert_chinese_ner -> log.txt
                                    |-> model.pth
                                    |-> setting.txt
                -> bert_chinese_re -> log.txt
                                   |-> model.pth
                                   |-> setting.txt
pretrain_data -> ner_train.pkl
              |-> ner_dev.pkl
              |-> re_train.pkl
              |-> re_dev.pkl
pretrain_models -> bert_chinese -> bert_config.json
                                |-> pytorch_model.pth
                                |-> vocab.txt
bert_codes -> python scripts of tokenization, modeling, optimization, utils
learn_ner.py
learn_re.py

Dataset

  • raw corpus: The raw corpus we used for this work is from Magi Practical Web Article Corpus. Factually, any unstructured texts are suitable.

  • knowledge graphs: The main knowledge graph we used for distant supervision is CN-DBpedia. Fusing several graphs may be useful.

  • training data: examples could be found here. The type of both head_entity, tail_entity, and relation is string, while sometimes relation could be None. In addition, head/tail_entity is a string denoted as "entity mention" in each text, entity_index is the index of head/tail_entity. If there are more than one head/tail_entity in one text, entity_index = [idx_1, idx_2, …]. Whatever, entity_index is actually not used in this method.

sample = {"_id": string, 
          "EL_res": [{"text": string, 
                      "triples": [[head_entity, tail_entity, relation], 
                                   [head_entity, tail_entity, relation],
                                   ...], 
                      "entity_idx": {head/tail_entity: entity_index, ...}
                      }, 
                     {"text": string, 
                      "triples": list, 
                      "entity_idx": dictionary},
                      ...
                      ]}
  • test data: examples could be found here. The type of both head_entity, tail_entity, and relation is string, while sometimes relation could be None.
test_data = [
            {"unique_id": int, 
             "text": string, 
             "triples": [[head_entity, tail_entity, relation],
                         [head_entity, tail_entity, relation],
                         ...]
            },
            {"unique_id": int, 
             "text": string, 
             "triples": list},
             ...
]

Command Line:

  • learn NER model: you could prepare data, train the model, make one prediction for the given example all together by just running the following command (or you could go into the learn_ner.py script to run them in turn):
python learn_ner.py
  • learn RE model: you could prepare data, train the model, make one prediction for the given example, make prediciton on batched test data all together by just running the following command (or you could go into the learn_re.py script to run them in turn):
python learn_re.py

Requirements

  • Python = 3.6.9
  • pytorch = 1.3.1
  • scikit-learn = 0.21.3
  • tqdm = 4.39.0
  • requests = 2.22.0 (optional)
  • Flask = 1.1.1 (optional)
  • ipdb = 0.12.2 (optional)

References

  • code: the original BERT-related codes are from bert_cn_finetune project of ewrfcas and transformers project of Hugging Face.
  • literature:
    • Soares, L. B. , FitzGerald, N. , Ling, J. , Kwiatkowski T. . (2019). Matching the Blanks: Distributional Similarity for Relation Learning. ACL 2019. paper/code.
    • Zhang, N. , Deng, S. , Sun, Z. , Wang, G. , Chen, X. , Zhang, W. , Chen, H. . (2019). Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks. NAACL 2019. paper.
    • Jianlin Su. A Hierarchical Relation Extraction Model with Pointer-Tagging Hybrid Structure. GitHub. blog/code.
    • Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, Ido Dagan. Supervised Open Information Extraction. NAACL 2018. paper.
  • blogs: 技术总结:业务场景中的标签挖掘与开源概念标签知识库总结 | 刘焕勇 老刘说NLP 2022年02月17日

About

Open Relation Extraction pipeline for Chinese text using BERT as backbone in PyTorch


Languages

Language:Python 100.0%