ianmcaulay / anlp19

Course repo for Applied Natural Language Processing (Spring 2019)

Course materials for Applied Natural Language Processing (Spring 2019). Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Date	Activity	Summary
1/22	Follow setup instructions in 0.setup/	Install anaconda and set up environment for class with specific Python libraries.
1/24	Complete 1.words/ExploreTokenization_TODO.ipynb before class	This notebook outlines several methods for tokenizing text into words (and sentences), including whitespace, nltk (Penn Treebank tokenizer), nltk (Twitter-aware), spaCy, and custom regular expressions, highlighting differences between them.
1/24	Execute 1.words/EvaluateTokenizationForSentiment.ipynb	This notebook evaluates different methods for tokenization and stemming/lemmatization and assesses the impact on binary sentiment classification, using a train/dev dataset of sample of 1000 reviews from the Large Movie Review Dataset. Each tokenization method is evaluated on the same learning algorithm (L2-regularized logistic regression); the only difference is the tokenization process. For more, see: http://sentiment.christopherpotts.net/tokenizing.html
1/24	Complete 1.words/TokenizePrintedBooks_TODO.ipynb	Design a better tokenizer for printed texts that have been OCR'd (where words are often hyphenated at line breaks).
1/29	Complete 2.distinctive_terms/CompareCorpora_TODO.ipynb	This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one: Difference of proportions (described in Monroe et al. 2009, Fighting Words section 3.2.2; and the Mann-Whitney rank-sums test (described in Kilgarriff 2001, Comparing Corpora, section 2.3).
1/29	Complete 2.distinctive_terms/ChiSquare.ipynb	This notebook illustrates the Chi-Square test in finding distinctive terms between @realdonaldtrump and @AOC
1/31	Complete 3.dictionaries/DictionaryTimeSeries_TODO.ipynb	This notebook introduces the use of dictionaries for counting the frequency of some category of words in text, using sentiment (from the AFINN sentiment lexicon) in the time series data of tweets as an example.
2/5	Complete 4.classification/CheckData_TODO.ipynb	Collect data for classification; verify that it's in the proper format.
2/5	Complete 4.classification/Hyperparameters_TODO.ipynb	This notebook explores text classification, introducing a majority class baseline and analyzing the affect of hyperparameter choices on accuracy.

About

Course repo for Applied Natural Language Processing (Spring 2019)

Languages

Language:Jupyter Notebook 96.7%Language:Python 3.3%