meshiguge / SetExpan

The source code for SetExpan framework, published in ECML-PKDD 2017

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SetExpan: Corpus-based Set Expansion Framework

Introduction

This is the source code for SetExpan framework developed for corpus-based set expansion (i.e., finding the "complete" set of entites belonging to the same semantic class based on a given corpus and a tiny set of seeds).

Usage

We provide the data preprocessing code and the python implementation of SetExpan. If you want to use our data preprocessing code, then you need to download the following two related packages and put them in the "/src/tools/" folder:

  • AutoPhrase: used to extract quality phrases from raw input data.
  • Stanford CoreNLP 3.8.0: used to do POS tagging and select quality Noun Phrases from the previous phrase list generated by AutoPhrase. The quality Noun Phrase will be treated as the "entity".

Otherwise, you can directly download our preprocessed data from Google Drive; unzip it and put the dataset in under the "./data/" folder.

Files in the folder

  • /data/, the input folder of SetExpan;
  • /result/, the output folder of SetExpan;
  • /src/corpusProcessing/, the first step of data preprocessing, convert raw text to sentences.json
  • /src/dataProcessing/, the second step of data preprocessing, generate all SetExpan input files from sentences.json
  • /src/tools/, tools used in the data processing
  • /src/SetExpan/, the python implementation of SetExpan algorithms
    • /src/SetExpan/set_expan_main.py: the main entrance of SetExpan, including loading data, forming queries, and running algorithm.
    • /src/SetExpan/set_expan.py: the main implementation of SetExpan. You can change model hyper-parameters in this file.

To Run

cd src/SetExpan/ 
python3 ./set_expan_main.py

Results are saved under the same folder and named "setexpan_result.txt"

Publications

Please cite the following paper if you are using this code. Thanks!

About

The source code for SetExpan framework, published in ECML-PKDD 2017

License:Apache License 2.0


Languages

Language:Python 36.4%Language:C 31.2%Language:C++ 25.0%Language:Shell 6.5%Language:Makefile 0.8%