cltl / LabelingPeople

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Labeling other people

This repository contains code and data for the following paper:

@InProceedings{W18-6550,
  author = 	"van Miltenburg, Emiel
		and Elliott, Desmond
		and Vossen, Piek",
  title = 	"Talking about other people: an endless range of possibilities",
  booktitle = 	"Proceedings of the 11th International Conference on Natural Language Generation",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"415--420",
  location = 	"Tilburg University, The Netherlands",
  url = 	"http://aclweb.org/anthology/W18-6550"
}

The exact code and data for this paper (this commit) is captured as a release.

Folder structure

There are three folders:

  1. Flickr30K contains all code and data for the categorization of person labels in Flickr30k-Entities.
  2. VisualGenome contains code and data for the categorization of attributes from Visual Genome.
  3. Other contains some additional functions to compute relevant statistics.

General requirements

The code has been tested with the following software. Results shouldn't be different for other versions of Python or the NLTK, but this is untested.

  • Python 3.6.3
  • nltk 3.2.2

How to use the code

We'll take the Flickr30K data as an example. The general logic is as follows:

  • The resources folder contains all files with categories, stopwords, etc.
  • The grammar is generated by using python update_grammar.py. This script takes the resources and compiles a grammar to match the labels with the categories.
  • You can check the labels by using python check_labels.py. This script checks which labels are covered by the grammar. Labels that are covered are written to grammatical.txt. Ungrammatical labels are written to ungrammatical.txt. By reading the latter, we can identify (parts of) labels that should be categorized.
  • After adding (parts of) labels to the category files in the resources folder, run python update_grammar.py again.

Then there are two non-essential script files.

  • If you want to parse any labels, just import the analyze_label function from label_parser.py.
  • Run flickr_stats.py to get some statistics about the original data. Specifically: total number of unique labels classified as PEOPLE; size of the subset of those labels that end in boy, girl, male, female, woman, or man.

About

License:Apache License 2.0


Languages

Language:Python 100.0%