mjpost/post2011judging

Matt Post <post@cs.jhu.edu>
January 30, 2012
--

This document describes how to repeat the experiments described in my 2011 paper,

  @inproceedings{post2011judging,
    Address = {Portland, Oregon, USA},
    Author = {Post, Matt},
    Booktitle = ACL2011,
    Month = {June},
    Title = {Judging Grammaticality with Tree Substitution Grammar Derivations},
    Year = {2011},
    url = {www.aclweb.org/anthology/P/P11/P11-2038.pdf}
  }

It includes data and code used to extract TSG derivations and the
Charniak & Johnson (2005) feature set, plus the environment used to
evaluate arbitrary feature sets in a simple, extendable way.  Due to
LDC licensing restrictions, it does not include the data splits that
we used for our experiments.  If you wish to have those splits and
have the appropriate LDC license, please email me, and I'll send them
to you.

1. Download my code for building TSGs, which can be found on Github.
   Note that you do not need to build your own TSG since this
   repository includes the TSG I used in my experiments, but that code
   contains a number of support scripts that you will need here.

     git clone git@github.com:mjpost/dptsg.git

   Then set the environment variable DPTST to point to that
   directory.  In bash:

     export DPTSG=$(pwd)/dptsg

   Next, download my modifications to Mark Johnson's code for CKY
   parsing.

     git clone git@github.com:mjpost/cky.git

   This code includes modifications I added to enable parsing
   flattened versions of TSGs, to work with our black-box
   parallelizer, and to incorporate some convenient command-line
   options.

2. Edit the file builddir.sh.  At the top, there are two environment
   variables you need to define: (1) DPTSG (as above), and "basedir",
   which should point to the directory containing this README file.

     export basedir=$(pwd)

3. Compile Mark Johnson's CKY code.  My version of this code contains
   some modifications that enable it to parse TSG grammars.

     make -C cky/

4. To compute TSG features over a corpus, you need to parse the corpus
   with the TSG grammar and then extract the TSG features from the
   resulting derivations.  This requires a number of pre- and
   post-processing steps which convert unknown words in the corpus,
   flatten the TSG, parse with it, and expand it afterwards.  

   All of this functionality is contained in the "builddir.sh"
   script.  To run that script, you simply point it at a directory
   which contains a single file named "words".  This file contains the
   sentences of the corpus, one per line.

     bash builddir.sh DIR

   Alternately, you can pass the directory as an environment variable
   (which makes it amenable to qsub).  e.g.,

     qsub -v dir=DIR builddir.sh

   As mentioned, in the directory DIR, builddir.sh expects to find a
   file named "words", which contains the sentences to parse and
   process, one per line.  It will then

   - preprocess the file to mark and convert OOVs
   - parse with the grammar
   - restore the TSG fragments from the flattened versions the Johnson parser produces

   Note that the script I've provided does sequential parsing of
   sentences with at most 100 words.  Mark Johnson's CKY parser is
   exhaustive, which makes it somewhat slow.  If you want to
   parallelize the parsing you can use the included black-box
   parallelizer (written by Adam Lopez).  You can enable this by
   uncommenting out the appropriate line in builddir.sh, and
   commenting out the sequential version.  You have to edit
   environment/LocalConfig.pm to add your environment, which describes
   how to call qsub.  If you want to use this, compile it by typing

     make -C parallelize/

5. When builddir.sh is done, the directory you passed it will contain
   a number of files containing different feature sets.  These files
   are all parallel to words, so that, for example, line 17 of each
   file will correspond to the features extracted for sentence 17.
   With respect to TSGs, the feature file you care about is "rules",
   which contains counts of the TSG fragments used in the Viterbi
   derivation of each sentence.  The format of this file is

     fragment:count fragment:count ...

  where "fragment" is a TSG fragment (collapsed to remove colons and
  spaces) and "count" is a count of the number of times it was seen.
  This facilitates conversion for toolkits such as SVM-light.

6. My classification environment relies on six data sets: positive and
   negative training, development, and test data.  As described in the
   paper, training proceeds on the training data.  Dev is used to tune
   the regularization parameter, and the best model is then used to
   score the test set.

   The training and evaluation script is eval.sh.  It assumes the
   existence of the following six directories that correspond to the
   six data sets just described.

     train/good
     train/bad
     dev/good
     dev/bad
     test/good
     test/bad

   The script is called with

     ./eval.sh FEATURE1 FEATURE2 FEATURE3 ...

   It then searches for a file named FEATURE1 in *each* of the six
   directories.  Each of these files contains a single training or
   testing example, with any number of feature:value pairs on each
   line, corresponding to the features extracted for that sentence or
   training instance.  For example, the following invocation:

     ./eval.sh sentlens rules

   Would look for files {train,dev,test}/{good,bad}/{sentlens,rules}.
   The sentlens file would have something like

     sentlen:34
     sentlen:25
     ...

   And the "rules" file is as described above.  The eval.sh script
   constructs files usable by liblinear in a directory named
   "run.FEATURE1+FEATURE2+...".

   The main thing to note in adding your own features is that each
   file must contain feature:value pairs, and the feature names should
   be globally unique.

7. The builddir.sh script described above can be used to easily
   produce feature sets.  Just create the six directories, and within
   each, create a file "words" that contains the sentences.  Then
   call:

     ./builddir.sh train/good
     ./builddir.sh train/bad
     ./builddir.sh dev/good
     ./builddir.sh dev/bad
     ./builddir.sh test/good
     ./builddir.sh test/bad

8. Download liblinear from
   http://www.csie.ntu.edu.tw/~cjlin/liblinear/ .  Then edit the
   variables "train" and "predict" at the top of eval.sh to point to
   the liblinear binaries.

--

If you have any questions, please feel free to email me and ask.
mjpost / post2011judging

About

Languages