sheller2010 / Causeway

Tagger for explicit cause-and-effect relationships in text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Causeway causal language tagger

Causeway is a system for detecting explicit causal relations in text. It tags text using the BECAUSE 1.0 annotation scheme, described in Dunietz et al., 2015. The system itself is described in Dunietz et al., 2017.

Note that the repository includes some code for reading in data in an updated version of the annotation scheme (BECAUSE 2.x). This newer scheme is backwards-compatible with the original.

The steps to reproduce the results from the 2017 Causeway paper are given below. If you have any difficulty doing so or have additional questions, please contact Jesse Dunietz, who will be happy to assist.

NOTE: You may also be interested in DeepCx, a neural network tagger that supersedes Causeway. DeepCx achieves substantially better performance on all versions of the BECAUSE dataset.

Running the tagger

To reproduce the results from the Causeway paper:

  1. You'll want to do this in Ubuntu, the only platform Causeway has been tested on. It may work on other *nix platforms, but you'll be on your own for getting it to do so.

    You'll need some standard Ubuntu packages, which you can install using apt if you don't have them:

    sudo apt install git python2 python-pip sed task-spooler default-jdk # or any JDK
  2. Install the external Python packages that Causeway depends on:

    sudo pip2 install bidict colorama nltk cython python-gflags numpy scipy scikit-learn python-crfsuite

    Also make sure that NLTK has access to WordNet:

    python2 -c "import nltk; nltk.download('wordnet')"
  3. Clone the Causeway repository, including the NLPypline framework for NLP pipelines (included as a Git submodule):

    git clone --recursive https://github.com/duncanka/Causeway.git

    We'll refer to the resulting Causeway directory as $CAUSEWAY_DIR.

  4. Compile the one Cython file in the project:

    (cd $CAUSEWAY_DIR/NLPypline/src/nlpypline/util && cythonize -i streams.pyx)
    
  5. Reconstitute the BECAUSE 1.0 corpus. (Of course, you can also use the latest version of BECAUSE if you are not trying to reproduce the Causeway paper results.)

    1. Clone the repository from whatever directory you'd like the data to live in.

      git clone https://github.com/duncanka/BECAUSE.git
      (cd BECAUSE && git checkout 1.0) # skip if using latest BECAUSE version

      We'll refer to the resulting directory named BECAUSE as $BECAUSE_DIR.

    2. Extract the raw WSJ text corresponding to the PTB subset used in BECAUSE. Assuming you have the PTB2 files unpacked in $PTB_DIR (with the same directory structure as the official CD), run the following:

      for ANN_FILE in $BECAUSE_DIR/PTB/*.ann; do
          BASE_FILE=$(basename $ANN_FILE)
          DIGITS=$(echo $BASE_FILE | cut -d'_' -f2)
          tail -n +3 $PTB_DIR/raw/${DIGITS:0:2}/${BASE_FILE%.*} > $BECAUSE_DIR/PTB/${BASE_FILE%.*}.txt
      done

      You should end up with a bunch of .txt files alongside the .ann files in the PTB subdirectory.

    3. Run the NYT text extraction script on your LDC-licensed copy of the NYT corpus, which let's assume is stored in directory $NYT_DIR:

      python2 $BECAUSE_DIR/scripts/extract_nyt_txt.py $BECAUSE_DIR/NYT $(for FNAME in $BECAUSE_DIR/NYT/*.ann; do find $NYT_DIR -name $(basename "${FNAME%.ann}.xml"); done)

      Again, you should end up with a bunch of .txt files alongside the .ann files in the NYT subdirectory.

  6. Set up version 3.5.2 of the Stanford parser.

    1. Download the full Stanford CoreNLP package. Unzip it somewhere, resulting in a folder called stanford-corenlp-full-2015-04-20 (henceforth, $STANFORD_DIR).

    2. Unzip the pretrained PCFG and NER models:

      unzip $STANFORD_DIR/stanford-corenlp-3.5.2-models.jar edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz -d $STANFORD_DIR
      unzip -j $STANFORD_DIR/stanford-corenlp-3.5.2-models.jar edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz -d $STANFORD_DIR/classifiers
    3. Apply the Causeway-specific patches to the Stanford parser. The following hacky script should do the trick:

      mkdir /tmp/stanford-sources
      unzip $STANFORD_DIR/stanford-corenlp-3.5.2-sources.jar -d /tmp/stanford-sources
      cp $CAUSEWAY_DIR/stanford-patches/*.patch /tmp/stanford-sources
      (cd /tmp/stanford-sources && {
          for PATCH in *.patch; do
              patch -p 2 < $PATCH
          done
      })
      TO_RECOMPILE=$(grep '+++' /tmp/stanford-sources/*.patch | sed -e 's/.*\(edu.*\.java\).*/\1/' | sort | uniq)
      for SRC_FILE in $TO_RECOMPILE; do
          javac -cp /tmp/stanford-sources "/tmp/stanford-sources/$SRC_FILE"
          for CLASS_FILE in /tmp/stanford-sources/${SRC_FILE%.java}*.class; do
              jar uf $STANFORD_DIR/stanford-corenlp-3.5.2.jar -C /tmp/stanford-sources/ "${CLASS_FILE#/*/*/}"
          done
      done
      
      rm -R /tmp/stanford-sources

      You might see a bit of error output from the Java compiler. Don't worry about it.

    4. Create the TRegex/TSurgeon run scripts (adapted from the standalone TRegex download).

      printf '#!/bin/bash\nexport CLASSPATH=$(dirname $0)/stanford-corenlp-3.5.2.jar:$CLASSPATH\njava -mx100m edu.stanford.nlp.trees.tregex.TregexPattern "$@"\n' > $STANFORD_DIR/tregex.sh
      printf '#!/bin/bash\nexport CLASSPATH=$(dirname $0)/stanford-corenlp-3.5.2.jar:$CLASSPATH\njava -mx100m edu.stanford.nlp.trees.tregex.tsurgeon.Tsurgeon "$@"\n' > $STANFORD_DIR/tsurgeon.sh
      chmod ugo+x $STANFORD_DIR/tregex.sh $STANFORD_DIR/tsurgeon.sh
  7. Run the Stanford parser on the data:

    for DATA_DIR in $BECAUSE_DIR/PTB $BECAUSE_DIR/NYT $BECAUSE_DIR/CongressionalHearings; do
        $CAUSEWAY_DIR/scripts/preprocess.sh $DATA_DIR $STANFORD_DIR
    done
  8. For the PTB files, extract the gold-standard parse trees (to enable gold-standard parse experiments).

    1. Correct a silly PTB tokenization error in one of the .mrg files that breaks the system:
      (cd $PTB_DIR/combined/14/ && patch -p1 < $CAUSEWAY_DIR/wsj_1457.mrg.patch)
      (If you don't want to modify your main PTB copy, you can copy the PTB data over to a new directory and point $PTB_DIR to it.)
    2. Run the command to extract the trees:
      $CAUSEWAY_DIR/scripts/convert-mrg.sh $BECAUSE_DIR/PTB $PTB_DIR/combined $STANFORD_DIR
  9. Run the system.

    1. Edit the BECAUSE_DIR and STANFORD_DIR variables in run_all_pipelines.sh to match your setup.
    2. Run the script from the root Causeway directory.

Citations

Dunietz, Jesse, Lori Levin, and Jaime Carbonell. Automatically Tagging Constructions of Causation and Their Slot-Fillers. In press; to be published in 2017. Transactions of the Association for Computational Linguistics.

Dunietz, Jesse, Lori Levin, and Jaime Carbonell. Annotating Causal Language Using Corpus Lexicography of Constructions. Proceedings of LAW IX – The 9th Linguistic Annotation Workshop (2015): 188-196.

About

Tagger for explicit cause-and-effect relationships in text

License:MIT License


Languages

Language:Python 97.4%Language:Shell 2.0%Language:TypeScript 0.6%Language:sed 0.0%