avramandrei / Information-Extraction-Romanian

This repository contains an information extraction system for the Romanian language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Information-Extraction-Romanian

The system extracts information from a Romanian text file and saves it as a Resource Description Framework(RDF) data graph that can be futher queried using the query.py script.

Each node in the RDF graph is saved as triples(Subject+Predicate+Object) and has the following structure:

<relation:relation rdf:nodeID="RDF_ID">
    <relation:object>
      <entity:TYPE_OF_ENTITY rdf:nodeID="RDF_ID">
        <entity:words>list of words</entity:words>
      </entity:TYPE_OF_ENTITY>
    </relation:object>
    <relation:predicate>list of words</relation:predicate>
    <relation:subject>
      <entity:TYPE_OF_ENTITY rdf:nodeID="RDF_ID">
        <entity:words>list of words</entity:words>
      </entity:TYPE_OF_ENTITY>
    </relation:subject>
</relation:relation>

Where:

  • RDF_ID is the id of the rdf node
  • TYPE_OF_ENTITY is one of the 16 entities described in RONEC
  • list of words is a list of words that creates the Subject, predicate or the object of the node

Installation

Install with:

git clone https://github.com/avramandrei/Information-Extraction-Romanian.git
cd Information-Extraction-Romanian
pip install -r requirements.txt

Usage

To extract information from a file, run the extract_information.py script as following:

python3 extract_information.py [ro_text_file_path] [output_dir]

The script automatically creates two files: output.conllup and output.xml in [output_dir], representing the output of the Named Entity Recognizer in CoNLL-U Plus format and the output RDF graph, respectievly.

Query

The repository contains a RDF graph in resources\rdf_graph.xml, that has been obtained by crawling news sites. The query.py script allows you to select specific Subjects and Predicates from the RDF graph. It must be used as follows:

python3 query.py [rdf_graph_path] [sql_out] [--subj] [--pred]

The command will create a file that contains the output of the query.

Notes

  • Feel free to ask any questions regarding the system by opening an issue or by directly sending me an email at avram.andreimarius@gmail.com.
  • We are looking for a team to develop a relationship corpus for the Romanian language to further improve the system. Contact me at avram.andreimarius@gmail.com for more details.

Authors

Cite

Please consider citing the following paper as a thank you to the authors:

Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).

or in .bibtex format:

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}

About

This repository contains an information extraction system for the Romanian language.

License:MIT License


Languages

Language:Python 100.0%