binarymax / taxi

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TAXI

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

More information about the approach can be found at the TAXI web site.

System Requirements

The system was tested on Debian/Ubuntu Linux and Mac OS X. To load all resources in memory you need about 64 Gb of RAM.

Installation

  1. Clone repository:
git clone https://github.com/tudarmstadt-lt/taxi.git
  1. Download resources into the repository (4.4G compressed by gzip):
cd taxi && wget http://panchenko.me/data/joint/taxi/res/resources.tgz && tar xzf resources.tgz
  1. Install dependencies:
pip install -r requirements.txt

Induction of SemEval Taxonomies

Run the semeval.py to reproduce experimental results, e.g.:

For a test run (few resources loaded, quick):

python semeval.py vocabularies/science_en.csv en simple --test

For a normal run (all resources are loaded, requires 64Gb of RAM):

python semeval.py vocabularies/science_en.csv en simple

The vocabularies directory contains input terms for different domains and languages. The script lets you reproduce results in the SemEval 2016 Task 13 Taxonomy Extraction Evaluation described in the our paper. This script load hypernyms from the downloaded resources and constructs a taxonomy for every input vocabulary of the SemEval datasets, e.g. English Food domain. Generally, the TAXI approach takes as input a vocabulary and outputs a taxonomy for a linked subset of the terms from this vocabulary. Currently the main purpose of this repository is to ensure reproducibility of the SemEval results. The results taxonomies will be generated next to the corresponding input vocabulary file. If you need to adapt the script for your needs and require help do not hesitate to contact us.

About

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling

License:Apache License 2.0


Languages

Language:Jupyter Notebook 88.7%Language:Python 11.3%