WladimirSidorenko / PotTS

The Potsdam Twitter Sentiment Corpus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Description

The MIT License

This directory contains the data of the Potsdam Twitter Sentiment Corpus (ISLRN 714-621-985-491-3). To open the files of this corpus, you need to download and launch MMAX2—a freely distributed annotation tool—and then select one of the *.mmax projects from the directories corpus/annotator-1/ or corpus/annotator-2/.

Folder Structure

The folders of this project are structured as follows:

  • corpus/ – directory containing corpus files;

    • annotator1/ – directory containing MMAX projects for the first annotator;
      • markables/ – directory containing annotation files for the first annotator;
    • annotator2/ – directory containing MMAX projects for the second annotator;
      • markables/ – directory containing annotation files for the second annotator;
    • basedata/ and source/ – original corpus tokenization;
    • custom/, scheme/, and style/ – auxiliary MMAX2 data;
  • docs/ – directory containing annotation guidelines and other accompanying documents;

  • scripts/ – directory containing scripts that were used to process corpus data;

    • examples/ – directory containing examples of input files for the scripts;
    • align.py – auxiliary module used for annotation alignment;
    • alt_fio.py – auxiliary module for AWK-like input/output operations;
    • conll.py – auxiliary module for handling CONLL sentences;
    • measure_corpus_agreement.py – script for measuring corpus agreement;
    • merge_conll_mmax.py – script for aligning annotation from the corpus with the automatically processed CONLL data;

You can see the examples of invocations in the script files or by just typing --help to see their usage.

Note

I strongly recommend using the annotation of annotator-2 on the branch eexpression-revision (run git checkout eexpression-revision after cloning this project).

About

The Potsdam Twitter Sentiment Corpus

License:MIT License


Languages

Language:Python 97.7%Language:Awk 1.6%Language:XSLT 0.6%