AlonEirew / extract-wec

Extract links from Wikipedia pages to create a cross-document coreference dataset (multilingual support)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extract WEC Dataset

This project is following our research paper: ״WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia״
A video tutorial of our research is also available here.

Here can be found both The WEC-Eng cross document dataset from English Wikipedia and the method for creating WEC for other languages.

Note: In our original WEC paper, we used several methods that were all aggregated into one project here. To that end, we replaced some of the original python implementations with corollating Java ones (for examle: SpaCy implementation replaced with StanfordNLP).

WEC-Eng Coreference Dataset

WEC-Eng is part of huggingface_hub and available at this location: https://huggingface.co/datasets/biu-nlp/WEC-Eng

See the Dataset card, for instructions on how to read and use WEC-Eng

Generating a new WEC Dataset

Below are the instructions of how-to generate a new version of WEC, whether required from a more recent English Wikipdia dump, or in order to extract it from one of the other supported languages (e.g., French, Spanish, German, Chinese).

Requisites

  • A Wikipedia ElasticSearch Index created by wikipedia-to-elastic project (index must contain at least the Infobox "relationTypes").
  • Java 11

Processes

This code repo contains two main processes:

  1. Code to generate the initial crude version of WEC-Lang
  2. Code to generate the final Json of WEC-Lang

WEC to DB Configuration:

Configuration file - resources/application.properties

spring.datasource.url=jdbc:h2:file:/demo => h2 database file url
poolSize=8 => Number of thread to run
elasticHost=localhost => Elastic engine host
elasticPort=9200 => (Elastic engine port)
elasticWikiIndex=enwiki_v3 => (Elastic index to read from (as generated by *wikipedia-to-elastic*)
infoboxConfiguration=/infobox_config/en_infobox_config.json => Explained below
multiRequestInterval=100 (recommended value) => Control the number of search pages to retrive from elastic
elasticSearchInterval=100 (recommended value) => Control the number of pages to read by the elasitc scroller
totalAmountToExtract=-1 => if < 0 then read all wikipedia pages, otherwise will read upto the amount specified

WEC to Json Configuration:

main.lexicalThresh=4 => lexical diversity threshold
main.outputDir=output => the output folder where WEC json should be created and saved 
main.outputFile=GenWEC.json => WEC json file name, will contain the final version of the generated dataset 

Language Adaptation

We have extracted the relevant infobox configuration for the English Wikipedia.
In order to create a newer version of WEC-Eng, use/update the default infobox_config/en_infobox_config.json in configuration.

To generate WEC in one of the supported languages (other than English) follow those steps:

  • Export Wikipedia in the required language using wikipedia-to-elastic project
  • Explore for infoboxs categories, the script below can help by producing candidate as well as the amount of pages related to an infobox category.
  • Run the infobox categories report:
    ./gradlew bootRun --args=infobox
  • Now, you can create a new infobox configuration (for the new language) file in src/main/resources/infobox_config/<lang>_infobox_config.json
    File should contain all needed infobox language specific configurations (based on the generated infobox categories report).
  • Finally, set it as the infoboxConfiguration file in application.properties

English infobox example (from - en_infobox_config.json)

{
  "infoboxLangText" : "Infobox", // wikipedia markdown element name in the language (e.g., <Infobox sport>)
  "infoboxConfigs": [
    {
      "corefType": "ACCIDENT_EVENT", // Type you would like to give the infobox category
      "include": true, // Should be included when extracting WEC
      "infoboxs": [ // list of infobox categories that should be included in this type (lowercased and concat)
        "airlinerincident", 
        "airlineraccident",
        "aircraftcrash",
        "aircraftaccident",
        "aircraftincident",
        "aircraftoccurrence",
        "railaccident",
        "busaccident",
        "publictransitaccident"
      ]
    }
  ]
}

Extracting WEC-Lang

Make sure the Wikipedia Elastic engine is running

  • Running WikiToWECMain in order to generate the H2 database:
    #>./gradlew bootRun --args=wecdb
    Program output - an H2 dataset containing the crude extraction of coreference relations from Wikipedia (this resource can be used for experiments before generating the final version of WEC-Lang)
  • Generate the WEC-Lang Json format file:
    #>./gradlew bootRun --args=wecjson
    Program output - A JSON format resource of the WEC-Lang dataset

Visualization And Stats

In order to produce more statistics and/or create a visualized output of the generated dataset, refer to those scripts for more information.

About

Extract links from Wikipedia pages to create a cross-document coreference dataset (multilingual support)

License:Apache License 2.0


Languages

Language:Java 100.0%