Toponym Disambiguation

In this subtask, all names of locations are known by the resolver but not their precise coordinates. The resolver has to select the GeoNames ID corresponding to the expected place among all possible candidates.

it contains:

Preprocessing for GEO dictionary
feature engineering (distance and population)
Logistic Regression Modeling
Evaluation

Getting Started

Prerequisites

python>=3.72
numpy>=1.15
scikit-learn>=0.20

Installing

Clone this repository

git clone https://github.com/AllenTiTaiWang/SemEval2019_Task12.git

Going to the right Directory

First of all, go to EnrollDec file.

cd SemEval2019_Task12

Make sure you have a file named Final_dataset under file Feature. Final_dataset should be provided by the task , which contains train, dev, and test data.

cp -r /path/to/Final_dataset /path/to/SemEval2019_Task12/Feature/

Also, create a file named data which contains three empty files: dev, train, and test, and make sure you have a dictionary called geo_dict_with_population_lonalt.txt in this same directory.

mkdir /path/to/SemEval2019_Task12/data
mkdir /path/to/SemEval2019_Task12/data/train
mkdir /path/to/SemEval2019_Task12/data/test
mkdir /path/to/SemEval2019_Task12/data/dev

cp /path/to/geo_dict_with_population_lonalt.txt /path/to/SemEval2019_Task12/data/

This task can't run without data and dictionary.

Pipeline

The command below assumes that Feature engineering and modeling runnung under HPC environment.

Preprocessing

downsize the dictionary and store it in json file.

cd Feature
python3 DicGEO.py

This code generates geoname2id.json and geoid2la.json.

###Feature Engineering

and then upload files on HPC, and run the following scripts under HPC. (highly recommend)

Make sure you change the path in .sh file before running them.
Make sure create output and error file in the directory you want.

qsub < CalDistance.sh
qsub < CalDistance_train.sh
qsub < CalDistance_test.sh

These three jobs will create the features for each involved geonameid. it could take a while, since minimum and average distance will be calculated. The result will be in data/dev, data/train, and data/test.

Modeling

In this step, the data in data/train, data/dev, and data/test will be organized into a huge feature matrix, and logistic regression model will then be trained.

qsub < LogReg.sh

Evaluation

Evaluation doesn't need to be conducted under HPC.

Dev Performance

Generate the annotation file from your prediction results

python3 Evaluation/generateann.py

Run their evaluator

python3 Evaluation/SemEvalTask12Evaluator/Evaluator/ToponymEvaluator.py disambiguation/ results/ Disambiguation

Check the dev score

less results/scores.txt

Generate the test annotation predictions

python3 testfinal.py

then the annotation files should be in testann file.

titaiwangms / SemEval2019_Task12