The file finnish_train.tsv
(in this repository) contains data on Finnish declension (i.e., nominal inflection) extracted from Wiktionary. The data consists of a 150-line TSV (tab-separated values) file in the following format:
LEMMA WORDFORM MORPHOLOGY
where
LEMMA
: citation form (in orthography)WORDFORM
: inflected word (in orthography)MORPHOLOGY
: the wordform's morphology
All the data consist of singular nouns in the adessive ("near", "adjacent to") case, indicated by "AT+ESS" in MORPHOLOGY
, or the inessive ("in", "inside") case, indicated by "IN+ESS" in MORPHOLOGY
. These cases are marked by suffixes which show several allomorphs (i.e., contextual variants) due to a phenomenon known as vowel harmony. According to Ringen and Heinämäki (1999), which is included in this repository for your reference, the adessive suffix is usually realized as -llä [lːæː] except when the preceding stem contains one of u, o, and a and there is no intervening y, ö, or ä; in this case, it is -lla [lːɑː]. For example, käde 'hand’ has an adessive kädellä, whereas vero ‘tax’ forms the adessive verolla because the nearest stem vowel is) o. Similar rules hold for the inessive suffix, which has -ssä [sːæː] and -ssa [sːɑː] allomorphs. However, there are several other phonological factors at play in the provided data.
Your assignment is to compile two sets of finite-state rules—one for the adessive, one for the inessive—which generate the wordforms from the corresponding lemma. Your solution needs to be sufficiently general that it is capable of generating the appropriate adessive and inessive forms for held-out examples. Your script should compile the rules and write them into a FAR file, as follows:
with pynini.Far("finnish.far", "w") as sink:
sink["ADESSIVE"] = your_adessive_rule
sink["INESSIVE"] = your_inessive_rule
- A script that generates
finnish.far
as described above, and - A brief (roughly one-page) write-up describing the rules you posited, and providing:
- the percentage of adessive examples in
finnish_train.tsv
covered by your rules - the percentage of inessive examples in
finnish_train.tsv
covered by your rules, and - some examples not covered by your rules and why they are not covered.
- the percentage of adessive examples in
You will be graded on:
- Your rules' coverage of the data in
finnish_train.tsv
, - Your rules' coverage of a set of 50 held-out examples, and
- the quality and detail of your write-up.
- Suffixation is just concatenation (
concat
or the+
operator). - Vowel harmony (and other changes) are best accomplished using
cdrewrite
.
Use finnish_train.tsv
to evaluate your solution as described above.
T9 ("Text on 9 keys"; Grover et al. 1998) is a patented predictive text entry method in which each of the 10 digit keys of a traditional 3x4 touch-tone phone grid maps onto a larger alphabet including alphabetic and punctuation characters. This transduction is massively ambiguous: for instance, 4604663
can be read as go home
or go hood
, among many other possibilities. Therefore, T9 makes use of a hybrid lexicon/language model to disambiguate.
Your assignment is to build a simple version of a T9 decoder which maps input digit strings onto the alphabet consisting of digits, lowercase ASCII characters, the space character, and ASCII punctuation characters, using a language model to disambiguate.
First, construct an unweighted FST which supports the following mappings (simplified slightly from the original T9 spec):
0: 0, <SPACE>
1: 1, ASCII punctuation characters (`string.punctuation`)
2: 2, a, b, c
3: 3, d, e, f
4: 4, g, h, i
5: 5, j, k, l
6: 6, m, n, o
7: 7, p, q, r, s
8: 8, t, u, v
9: 9, w, x, y, z
Then, construct a trigram (or higher-order) language model over the output alphabet using the OpenGrm-NGram tools. Then, create a decoder function which applies the input digit string to the mapping FST, scores the resulting lattice with the language model, and returns the shortest path string.
- A working decoder script or function and any assets I will need to test it locally, and
- a brief (roughly one page) write-up describing any challenges that arose.
You will be graded on:
- Your decoder's performance on a held-out data set,
- the quality of your decoder code, and
- the quality and detail of your write-up.
Test your decoder by passing it sensible English strings in T9; it may get some wrong, but in general it should be able to recover your intended text.
- Use
string_map
orstring_file
for constructing the mapper; don't forget to compute theclosure
so that your mapper can handle strings of arbitrary length; you will have to escape the square brackets, however. - You can use the data from MP0 for your language model; however, you shouldn't tokenize the data for this task.
- When building the language model, take special care when handling the space character.
(These are entirely optional and should only be attempted after completing the rest of the assignment.)
- Create a test corpus and code to evaluate the label error rate of your decoder.
- Then, try different language model orders and compute the performance. Or, build a bigram or trigram language model over word tokens, and then map these words tokens onto character sequences.