CSE538_FinalProject

Please provide a google drive link to your packaged code or give us a link to your github repo hosting these. The code should be structured with a README that clearly specifies the following:

Sources

The data-gen scripts were adapted from https://github.com/kkostyszyn/SBFST_2019. The use of Pynini was updated for version 2.1.3. Function rand_gen_no_duplicate() was replaced with more efficient alternatives. Function create_adversarial_examples() was added. Bugs throughout code were fixed. check.py was updated. The model, training, and evaluation code are new contributions.

Dependencies

Python >= 3.6
Tensorflow >= 2.0
Pynini >= 2.1.3

Use

Data Generation

To generate datasets for the languages named in a text file, first create .att files for the languages and store them in src/data_gen/lib. This can be done by hand. Compile the .att files into .fst files stored in src/data_gen/lib/lib_fst using att2fst.sh.

After the .fst files are compiled, run data-gen_alternate_3langs.py. It will generate Training, Dev, Test 1, Test 2, and Test 3 sets for the languages listed in tags_3langs.txt and store them in data_gen/data_3langs. Check whether the data was generated successfully using check_3langs.py. If any of the files is missing strings, a "missing" or "incomplete" message will be printed to the terminal.

Four different sets of data are currently stored in src/data_gen: data, data_3langs, data_n, and data_r. This paper used the data in data_3langs. It was generated by the script data-gen_alternate_3langs.py. The other python scripts in this folder are the data-gen and check scripts corresponding to the other data folders, which all had problems generating the desired number of strings for some languages. Those scripts are preserved here for future debugging and use.

data-gen_alternate_3langs.py generates the data in data_3langs. There are three subsets: 1k, 10k, and 100k. Each one contains 45 files: the _Training, _Dev.txt, _Test1.py, _Test2.py, and _Test3.py for each of the 9 languages we chose. data-gen_alternate_3langs.py creates these files by reading the list of languages provided in CSE538_FinalProject/tags_3langs.txt and using Pynini (Gorman 2016) to generate strings as described in the report.

Neural Models

To train a single model, run the program src/neural_net/tensorflow/main.py. Most of its arguments are self-expanatory, but note that the --bidi flag denotes whether the model's RNN is bidirectional. Valid values for --rnn-type are either "gru" or "lstm".

To produce a set of predictions from a data file and a model, run the program src/neural_net/tensorflow/predict.py. It outputs the model's predictions to a file consisting of the name of the test data file and _pred.txt in the model directory.

To evaluate a model's predictions, run the script src/neural_net/tensorflow/eval.py. This program takes a prediction file as produced by predict.py and writes to an equivalantly-named _eval.txt file also in the model directory. This file reports a number of statistics regarding the model's predictions.

Batch Training Scripts

The scripts train_all.sh and train_all_lstm.sh do not take any arguments and will produce all of the models examined in our report. After they have been run, the scripts collect_evals.sh and collect_evals_lstm.sh or evals_csv.py can be run without arguments to collect all of the evaluation metrics we considered into a single csv file.

joannechau / CSE538_FinalProject