NAMD / ptwp_tagger

Tagging Portuguese Wikipedia with PyPLN and Palavras

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tagging Portuguese Wikipedia

We want to tag Portuguese Wikipedia using PyPLN and Palavras (we have a license). The goals of this project are:

  • Release a part-of-speech tagged Portuguese Wikipedia Corpus under a Creative Commons license.
  • Train a part-of-speech tagger on NLTK and release it under a free/libre software license.

Assumptions

  • We're going to use all Portuguese Wikipedia articles (pages).
  • Probably we're going to use the Palavras' tagset, but we can then translate it to NLTK's tagset.
  • We won't use an incremental tagger (the entire corpus will be loaded in memory to train a NLTK tagger).

Next Goals

  • Split the entire corpus (and tagger) into Wikipedia Portals, so we'll have a tagged corpus by subject.
  • Compare taggers (Palavras versus NLTK with our trained tagger)

Related Links

About

Tagging Portuguese Wikipedia with PyPLN and Palavras


Languages

Language:Python 99.7%Language:Shell 0.3%