portuguese-nlp
Nlp work on Brazil Portuguese newswire text
You can browse the dataset online and see annotations on drive
We have x number of newswire articles collected between years 1994-2016. After preprocessing the dataset, since the articles are in html format, we first clean the tags and rename all files such as:
folca/data/2005/01/01/19.html --> folca/parsed-data/2005_01_01_19.html
and collect them all in one folder.
- 1. Preprocessing on dataset
- 2. Crawling and Organizing the Training Set
- 3. Classification by Graphlab
- 4. Reports