acidAnn / fewrelde

A German dataset for few-shot relation extraction :bow_and_arrow:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FewRelDe 🏹

A German dataset for few-shot relation extraction.

Data Source

The English FewRel dataset is a benchmark for few-shot relation extraction. FewRelDe is the German version of a subset from the training and validation set of FewRel. The bootstrapping process via machine translation was first proposed by Markus Eberts in the master thesis "Relation Extraction with Attention-Based Transfer Learning". FewRelDe was created for the master thesis "Few-Shot Relation Extraction for German" by Anna Sauer.

Translation Process

The English instances from FewRel have been automatically translated into German with the help of DeepL Translator. Each translation has been reviewed by hand.

Relation set

FewRelDe contains a subset of 19 relations from FewRel. Each relation is part of the Wikidata ontology. You can access more information about each relation by clicking on the Wikidata id in the following table:

Wikidata id German name English name
P6 Leiter:in der Regierung head of government
P22 Vater father
P25 Mutter mother
P26 Ehepartner:in spouse
P27 Land der Staatsangehörigkeit country of citizenship
P84 Architekt:in architect
P102 Parteizugehörigkeit member of political party
P118 Liga league
P177 Querung crosses
P206 liegt am or im Gewässer located in or next to body of water
P466 Bewohner:in, Nutzer:in occupant
P551 Wohnsitz residence
P740 Gründungsort location of formation
P931 vom Verkehrsknotenpunkt bedient place served by transport hub
P937 Wirkungsort work location
P974 Nebenfluss tributary
P1408 Sendelizenz in licensed to broadcast to
P3373 Geschwister sibling
P4552 Gebirgszug mountain range

FewRelDe also contains an OTHER relation for negative instances that do not belong to any of the other 19 relations. These negative instances were randomly selected from the remaining FewRel training and validation relations.

Each of the 20 relations in FewRelDe has 50 instances. The total of labeled relation instances thus amounts to 1,000.

Dataset Format

The format of the JSON file is modeled after the data format of the FewRel benchmark for few-shot relation extraction. Each file contains a dictionary whose keys are the names of the relations in the dataset. For each relation key, the corresponding value is a list of the labeled instances of that relation. This list contains a dictionary for each individual instance with

  • "tokens": a list with the token string sequence in the sentence
  • "h": information on the head entity in a list with
    • a string with the entity mention in lower case
    • a string with the Wikidata id of the entity (cf. wikidata.org). In WissenschaftsSTANDARD, this string is left empty because no entity linking between the head and tail entity and their Wikidata equivalent is performed.
    • a list with a nested list that contains the indices of the entity mention tokens in the sentence
  • "t": information on the tail entity in a list with the same structure as "h"
  • "ner": a list with information obtained from the task of named entity recognition (NER). The list contains the BIOES entity type tag for each token in the sentence. The BIOES tagging has been created using the German NER model in Stanza with the CoNLL 2003 tag set. This tag set contains the entity types PER (person), ORG (organisation) and LOC (location) (cf. https://stanfordnlp.github.io/stanza/available_models.html#available-ner-models).

Consider the following made-up example:

{
  "tokens": ["Robin", "Hood", "lebt", "in", "Sherwood", "Forest", "."],
  "h": ["sherwood forest", "", [[4, 5]]],
  "t": ["robin hood", "", [[0, 1]]],
  "ner": ["B-PER", "E-PER", "O", "O", "B-PER", "E-PER", "O"]
}

License

FewRel has been released under a Creative Commons BY-SA 4.0 license (cf. https://thunlp.github.io/1/fewrel1.html). Therefore, FewRelDe is also released under a Creative Commons BY-SA 4.0 license.

About

A German dataset for few-shot relation extraction :bow_and_arrow:

License:Other