paulomalvar / Galician-Dependency-Treebank

This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is the readme_first file for the Galician Dependency Treebank v0.5 (UTF-8).

San Diego, March 17th 2008:

Developer: Paulo Malvar Fernández
email: paulomal@gmail.com


1- Introduction:

This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006, New York City June 8-9 2006.

The original Portuguese Dependency Treebank ws downloaded from:

http://nextens.uvt.nl/~conll/free_data.html

More information about that original Portuguese data can be found at:

http://acdc.linguateca.pt/treebank/info_floresta_English.html

The readme (README.txt) file that comes with the original Portuguese data is also included in this package. Therefore, further information regarding the Portuguese part of the CONLL-X 2006 can be found in it.

2- License

Consistently with the availability of the original Portuguese data, Galician Treebank is released, under the GPLv2 license, as a freely available contribution to the research community.


3- Development preprocessing steps:

The processing steps for development of the Galician version of the Portuguese Dependency Treebank were two:

- Transliteration: The original Portuguese data were transliterated into Galician using port2gal (written in Perl). port2gal original implementation was made by Berto Garcia. However for the development of this Galician Treebank, Pablo Gamallo's version was used. port2gal is freely available at:

http://gramatica.usc.es/~gamallo/

- Lexical adaptation and orthographical correction: Because the transliteration process ouput is not quite Galician, an additional process of lexical adaptation and orthographical correction was necessary.


4- Technical specs:

Version 0.5 of the Galician Dependency Treebank keeps the same syntactic structure of the sentences as the original. Although this syntax is not in some case exactly the same of Galician it is really close and by most speakers could be considered a correct (although somehow weird) variation of their syntactic rules.
	Of course a completely revised version of a Galician Dependency Treenbank would need to go through the syntactic adaptation of those cases in which the syntactic structure of Portuguese and Galician are not exactly the same. For this reason, any contribution, as well as any comment or suggestion, to the development of a more advanced version of this Galician Dependency Treebank will be welcome.
	Differently from the original Portuguese data, the Galician Dependency Treenbank is not in CONLL-X format, but rather it is in MaltTAB data format. This means is that each entry instead of having 8 different information fields, it has only 4.
	The 4 info fiels found in each entry are the following (as described in README.txt):

Field 1: FORM

Word form or punctuation mark.

Field 2: POSTAG         

Fine-grained part-of-speech tag. This is the part-of-speech as annotated in the Bosque Treebank. See http://visl.sdu.dk/visl/pt/symbolset-floresta.html
2.3. Word classes (terminals)

Fields 3: HEAD         

Head of current token, which is either a value of ID or zero ('0'). A value of zero means the token attaches to the virtual root node.
The dependency structure resulting from the HEAD information can be non-projective. This is certainly true for the converted Bosque treebank, as the original treebank contains discontinuous constituents and the head rules (unlike Collins-style head rules) also introduce non-projectivity.

Field 8: DEPREL         

Dependency relation to the HEAD. 
See http://visl.sdu.dk/visl/pt/symbolset-floresta.html
1. Function labels (F:...)

The following DEPREL values occur in the CoNLL-X data
(comments only for those that are not explained (as of 08/01/2006) on the English web page):

	  >A          
          >N
          >P          premodifier of preposition (?)
          >S
          ?           unclear function (?)
          A<
          A<PRED      ??? (rare)
          ACC
          ACC-PASS    ??? (rare)
          ACC>-PASS   ??? (rare)
          ADVL
          ADVO
          ADVS
          APP
          AS<
          AUX
          AUX<        constituent containing coordinated main verb
          CJT
          CJT&ADVL
          CJT&PRED
          CMD         contrary to what the English web page
                      currently says, "CMD" is used for "command" and "COM" 		      for something else 
          CO
          COM         contrary to what the English web page currently says, 		      this is not "command" but probably a comparison 			      conjunction or adverb (e.g. "como" and "que")
          DAT
          EXC
          FOC
          H
          KOMP<
          MV
          N<
          N<PRED
          NUM<
          OC
          P
          P<
          PASS
          PCJT        "PCJT are a cross of conjuncts and
                       predication, used for pp-chaining, where the second
                       depends on the first, rather than be merely 			       conjuncted"
          PIV
          PMV         main verb within coordination
          PRD         used for "como" in certain (?) constructions ...
          PRED
          PRT-AUX     variant (?) of PRT-AUX<
          PRT-AUX<    particle belonging to auxiliary, e.g. "a", "de", 		              "por", "que"
          PRT-AUX>    typo (?) for PRT-AUX<
          PUNC
          QUE
          S<
          SC
          STA
          SUB
          SUBJ
          TOP
          UTT
          VOC          the 7.3 treebank version is inconsistent in
          VOK          the spelling of VOC/VOK 


For information in the textual material used in the original Portuguese Dependency Treebank see section "2.2 Text" in CONLL-X_README_v1.1.txt.

About

This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006.