xiaominghero / anlp19

Course repo for Applied Natural Language Processing (Spring 2019)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Course materials for Applied Natural Language Processing (Spring 2019). Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Date Activity Summary
1/22 Follow setup instructions in 0.setup/ Install anaconda and set up environment for class with specific Python libraries.
1/24 Complete 1.words/ExploreTokenization_TODO.ipynb before class This notebook outlines several methods for tokenizing text into words (and sentences), including whitespace, nltk (Penn Treebank tokenizer), nltk (Twitter-aware), spaCy, and custom regular expressions, highlighting differences between them.
1/24 Execute 1.words/EvaluateTokenizationForSentiment.ipynb This notebook evaluates different methods for tokenization and stemming/lemmatization and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the Large Movie Review Dataset. Each tokenization method is evaluated on the same learning algorithm (L2-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html
1/24 Complete 1.words/TokenizePrintedBooks_TODO.ipynb Design a better tokenizer for printed texts that have been OCR'd (where words are often hyphenated at line breaks).
1/29 Complete 2.distinctive_terms/CompareCorpora_TODO.ipynb This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one: Difference of proportions (described in Monroe et al. 2009, Fighting Words section 3.2.2; and the Mann-Whitney rank-sums test (described in Kilgarriff 2001, Comparing Corpora, section 2.3).

About

Course repo for Applied Natural Language Processing (Spring 2019)


Languages

Language:Jupyter Notebook 82.7%Language:Python 17.3%