cainesap / syn-agreement

Compute alpha-agreement for syntactic data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This package requires one more library, in addition to the files shipped with
the program: Tim Henderson's Zhang-Shasha library, available at
https://github.com/timtadh/zhang-shasha. It is known to work with revision
138c991, and should work with versions after commit 7c910cc. This includes the
most recent version on PyPi at the time of writing, version 1.1.

SYNOPSIS:
    syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] fileA fileB
    syn-agreement.py [--tree|--conll] [--acc] [--metric=plain|diff|norm|all] --dirs dir...

DESCRIPTION:
    --tree
        Read phrase structure trees instead of dependency trees. The phrase
        structure format is slightly idiosyncratic. See the section "Phrase
        structure format" below for details.

    --conll
        Read CoNLL formatted dependency trees. This is the default.

    --acc
        Compute uncorrected accuracies in addition to alpha score. For
        dependency trees, UAS, LAS and label accuracy is computed, and for
        phrase structure trees Jaccard similarity is computed.

    --metric=plain|diff|norm|all
        Select the metric to use, or compute all metrics at the same time. The
        default metric is the plain metric.

        NOTE: For any use beyond the reproduction of the results presented in
        Skjærholt (2014) we discourage the use of any other metric that
        α_plain.

    --dirs
        Enable multi-annotator mode. In cases where there are more than two
        annotators, it is common that not all annotators have annotated all of
        the texts. Therefore, we use a mode of operation where each
        annotator's output is in a separate directory. Sentences from files
        with matching names will be grouped together to account for missing
        annotations.

        The file and directory structure must follow the following convention:
        We assume the basename of the directory path to be the "name" of the
        annotator, and the files within to be named thusly:
        $prefix-$name.conll (or $prefix-$name.tree for constituency trees),
        where files with the same prefix in different directories are assumed
        to contain *exactly* the same sentences.

        If the --acc option is also passed, pairwise accuracies are computed.

PHRASE STRUCTURE FORMAT:
    Assume we have the following tree for the sentence "I saw the dog":
                       S
                       ^
                      / \
                     /   VP
                    |     ^
                    |    / \
                    |   /   NP
                   NP  |     ^
                    |  |    / \
                    P  V   D   N
                    |  |   |   |
                    I saw the dog

    The program then expects the tree to be stored *delexicalised* as follows:
        (S (NP P) (VP V (NP D N)))

BUGS
    Probably. If you find any, please create an issue in the GitHub repository
    at <https://github.com/arnsholt/syn-agreement/issues> or contact the
    author by email.

AUTHOR
    Arne Skjærholt <arnsholt@gmail.com>

    Also, many thanks to Andreas Peldszus for invaluable help with finding and
    debugging issues before the initial realease of the code.

LICENCES:
    The files syn-agreement.py and conll.py are (c) 2014 Arne Skjærholt and
    released under the GNU GPL version 2 or later:
    <http://gnu.org/licenses/gpl.html>

    The code in alpha.py is (c) 2011-2014 Thomas Grill and released under the
    Creative Commons Attribution-ShareAlike licence:
    <http://creativecommons.org/licenses/by-sa/3.0/>

    The data from the Norwegian Dependency Treebank in data/ndt/ is free for
    all uses, as long as they are not published as running, human readable
    text.

    The data from the Copenhagen Dependency Treebanks in data/cdt/ is licenced
    under the GNU GPL version 2: <http://gnu.org/licenses/gpl.html>

    The SSD dataset in data/ssd/ is released under the MIT licence.

About

Compute alpha-agreement for syntactic data