SowaLabs/Tweebo

========================================================================================
TweeboParser -- A Dependency Parser for Tweets.
Version 1.0

Written and maintained by Lingpeng Kong (lingpenk [at] cs.cmu.edu).
========================================================================================

This file is part of TweeboParser, a project started at the computational
linguistics research group, ARK (http://www.ark.cs.cmu.edu/), in Carnegie Mellon 
University.

This package contains the implementation of the dependency parser for English Tweets
described in:

[1] A Dependency Parser for Tweets. Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta,
Archna Bhatia, Chris Dyer, and Noah A. Smith. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, October 2014.

TweeboParser is based on TurboParser (http://www.ark.cs.cmu.edu/TurboParser/).
The preprocessing steps depend on Twitter Part-of-Speech Tagger
(http://www.ark.cs.cmu.edu/TweetNLP/). This package is self-contained, it includes
all the required software.

========================================================================================

TweeboParser is free software; you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.

TweeboParser is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License
for more details.

You should have received a copy of the GNU Lesser General Public License
along with this program; if not, write to the Free Software Foundation, Inc., 
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

========================================================================================
Contents
========================================================================================

1. TweeboParser

	1.1 Compiling

	1.2 Example of usage

	1.3 Directory Structure

2. Tweebank

========================================================================================
1.	TweeboParser
========================================================================================
1.1	Compiling
========================================================================================
To compile the code, first unzip/tar the downloaded file:

> tar -zxvf TweeboParser-1.0.tar.gz
> cd TweeboParser-1.0

Next, run the following command

> ./install.sh

This will install TweeboParser and all its dependencies.

========================================================================================
1.2	Example of usage
========================================================================================
To run the TweeboParser on raw text input with one sentence per line (e.g. on the
sample_input.txt):

> ./run.sh sample_input.txt

The run.sh file contains the steps we run the whole TweeboParser pipeline, including
twokenization, POS tagging, appending brown clustering features and PTB features etc.

The output file will be "sample_input.txt.predict" in the same directory as
"sample_input.txt".

which contains the CoNLL format (http://ilk.uvt.nl/conll/#dataformat) output of the
parse tree. (HEAD < 0 means the word is not included in the tree)

========================================================================================
1.3	Directory Structure
========================================================================================

ark-tweet-nlp		----	The Twitter POS Tagger (http://www.ark.cs.cmu.edu/TweetNLP/)
pretrained_models	----	Tagging, token selection, brown clusters obtained from
				Owoputi et al. (2012) and pre-trained parsing models for PTB
							and Tweets.
scripts			----	Supporting scripts.
TBParser		----	TweeboParser, which is based on TurboParser version 2.1.0.
							The source code of TweeboParser can be found at TBParser/src
token_selection		----	The token selection tool implemented in Python.
Tweebank		----	Tweebank data release.
working_dir		----	The working space for the parser. The temp files generated by
				TweeboParser when parsing a sentence are putted here. (So don't
							remove or rename it.)
run.sh			----	The bash script which runs the parser on raw inputs (see sec. 1.2).
install.sh		----	The bash script which installs everything (see sec. 1.1).

========================================================================================
2.	Tweebank
========================================================================================

Most of TWEEBANK was built in a day by two dozen annotators, most of whom had only
cursory training in the annotation scheme.

Train_Test_Splited	----	The CoNLL format train and test split we use in the paper.
							"Train" and "Test-New" in the paper.

Raw_Data		----	The json format file which contains all the annotation we have.
				(Note that some of them are under-specific in annotation. But We
				only use the ones with the full annotation.)

				The CoNLL format file contains all the full annotated data we
				have, the MWE is pre-processed by first-order TurboParser 
				trained on the Penn Treebank. (See the paper for more details.)

========================================================================================
SowaLabs / Tweebo

About

Languages