Clearailhc / ACE2005-toolkit

Focusing on ACE 2005 data preprocessing, we provide doc-level, sentence-level and BIO-style golden data preprocessing, the only thing you need is the ACE05 row data. Hope you enjoy!😎

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ACE2005-toolkit

ACE 2005 data preprocess

File structure

ACE2005-toolkit
β”œβ”€β”€ ace_2005 (the ACE2005 raw data)
β”‚   β”œβ”€β”€ data
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ docs
β”‚   β”‚   └── ...
β”‚   │── dtd
β”‚   β”‚   └── ...
β”‚   └── index.html
β”œβ”€β”€ cache_data (empty before run)
β”‚   β”œβ”€β”€ Arabic/
β”‚   β”œβ”€β”€ Chinese/
β”‚   └── English/
β”œβ”€β”€ filelist (train/dev/test doc files)
β”‚   β”œβ”€β”€ ace.ar.dev
β”‚   β”œβ”€β”€ ace.ar.test
β”‚   β”œβ”€β”€ ace.ar.train
β”‚   β”œβ”€β”€ ace.en.dev
β”‚   β”œβ”€β”€ ace.en.test
β”‚   β”œβ”€β”€ ace.en.train
β”‚   β”œβ”€β”€ ace.zh.dev
β”‚   β”œβ”€β”€ ace.zh.test
β”‚   └── ace.zh.train
β”‚   
β”œβ”€β”€ output (final output, empty before run)
β”‚   β”œβ”€β”€ BIO (BIO output)
β”‚   β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ test/
β”‚   β”‚   └── dev/
β”‚   └── ...
β”œβ”€β”€ udpipe (udpipe files)
β”‚   β”œβ”€β”€ arabic-padt-ud-2.5-191206
β”‚   β”œβ”€β”€ chinese-gsd-ud-2.5-191206
β”‚   └── english-ewt-ud-2.5-191206
β”œβ”€β”€ ace_parser.py
β”œβ”€β”€ extract.py
β”œβ”€β”€ format.py
β”œβ”€β”€ transform.py
β”œβ”€β”€ udpipe.py
β”œβ”€β”€ requirements.txt
└── run.sh

Preprocess steps

  1. Download the ACE2005 raw data and rename into ace_2005 ;
  2. Install all the requirements by pip install -r requirements.txt;
  3. Start preprocess by bash run.sh en, en can be replaced by zh or ar;
  4. Enter n to get data divided by filelist, or enter y and train/dev/test rate(e.g. 0.8 0.1 0.1) to get data divided by sentences;
  5. Enter y to get transform the data into BIO-type format, the transformed data will be in output/BIO/, each train (test or dev) data will be transformed into 4 BIO-style json files(token, entity_BIO, event_trigger_BIO and event_argument_BIO);
  6. The final output will be in directory output/.

Output format

The output will save separately in output/, each file can be loaded by json.loads(). After loading, the data will be in python list type, each line will be in python dict type:

{
    "sentence": "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region.",
    "tokens": [
        "Orders",
        "went",
        "out",
        "today",
        "to",
        "deploy",
        "17,000",
        "U.S.",
        "Army",
        "soldiers",
        "in",
        "the",
        "Persian",
        "Gulf",
        "region",
        "."
    ],
    "golden-entity-mentions": [
        {
            "entity-id": "CNN_CF_20030303.1900.02-E4-186",
            "entity-type": "GPE:Nation",
            "text": "U.S",
            "sent_id": "4",
            "position": [
                7,
                7
            ],
            "head": {
                "text": "U.S",
                "position": [
                    7,
                    7
                ]
            }
        },
        ...
    ],
    "golden-event-mentions": 
        {
            "event-id": "CNN_CF_20030303.1900.02-EV1-1",
            "event_type": "Movement:Transport",
            "arguments": [
                {
                    "text": "17,000 U.S. Army soldiers",
                    "sent_id": "4",
                    "position": [
                        6,
                        9
                    ],
                    "role": "Artifact",
                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"
                },
                {
                    "text": "the Persian Gulf region",
                    "sent_id": "4",
                    "position": [
                        11,
                        15
                    ],
                    "role": "Destination",
                    "entity-id": "CNN_CF_20030303.1900.02-E76-191"
                }
            ],
            "text": "Orders went out today to deploy 17,000 U.S. Army soldiers\nin the Persian Gulf region",
            "sent_id": "4",
            "position": [
                0,
                15
            ],
            "trigger": {
                "text": "deploy",
                "position": [
                    5,
                    5
                ]
            }
        },
        ...
    ],
    "golden-relation-mentions": [
        {
            "relation-id": "CNN_CF_20030303.1900.02-R1-1",
            "relation-type": "ORG-AFF:Employment",
            "text": "17,000 U.S. Army soldiers",
            "sent_id": "4",
            "position": [
                6,
                9
            ],
            "arguments": [
                {
                    "text": "17,000 U.S. Army soldiers",
                    "sent_id": "4",
                    "position": [
                        6,
                        9
                    ],
                    "role": "Arg-1",
                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"
                },
                {
                    "text": "U.S. Army",
                    "sent_id": "4",
                    "position": [
                        7,
                        8
                    ],
                    "role": "Arg-2",
                    "entity-id": "CNN_CF_20030303.1900.02-E66-157"
                }
            ]
        }, 
        ...
    ]
}

You will get all the golden data of entities, events and relations in output files.

Adjustment

You can change the file names in filelist/, which will directly change the files belong to train/dev/test, we use a default (529/30/40) division.

Related work

Email us

Any questions can contact us by haochenli@pku.edu.cn.

About

Focusing on ACE 2005 data preprocessing, we provide doc-level, sentence-level and BIO-style golden data preprocessing, the only thing you need is the ACE05 row data. Hope you enjoy!😎

License:MIT License


Languages

Language:Python 97.0%Language:Shell 3.0%