Bigram Probabilities

Program to compute the bigram model (counts and probabilities) on the given corpus (HW2_F17_NLP6320-NLPCorpusTreebank2Parts-CorpusA.txt) under the following three (3) scenarios:

No Smoothing
Add-one Smoothing
Good-Turing Discounting based Smoothing

Note:

The “ . ” string sequence in the corpus is used to break it into sentences.
Each sentence is tokenized into words and the bigrams computed ONLY within a sentence.
Used whitespace (i.e. space, tab, and newline) to tokenize a sentence into words/tokens that are required for the bigram model.
Any type of word/token normalization (i.e. stem, lemmatize, lowercase, etc.) have not been performed.
Creation and matching of bigrams is exact and case-sensitive.

Input Sentence:

The Fed chairman warned that the board 's decision is bad

POS Tagging

Transformation-based POS Tagging:

Implemented Brill’s transformation-based POS tagging algorithm using ONLY the previous word’s tag to extract the best transformation rule to:

Transform “NN” to “JJ”
Transform “NN” to “VB”

About

Computing bigram probabilities and POS tagging

Languages

Language:Java 100.0%