Analysing and Learning From a Chat Command Dataset

The Data

The dataset is present in the data/ folder. Upon close analysis, a few points are revealed:

As data is generated from speech, we see that a lot of data is just repeated. That is the same commands(the same exact transcript) occur many times. Only 248 unique samples exist.
Moreover, the train and validation data differ only in the data order and are composed of exactly the same utterances.
Hence, we are in a low-data scenario. With only 248 chat commands available to us, with around 6 action + 14 object + 4 location labels to learn.

See notebook performing some basic EDA and samples from data and labels at notebooks/eda.ipynb.

Basic Models

We begin our analysis with some basic models like Bag-of-Words based Naive-Bayes Classsifiers, and just maximizing cosine distance from a known set of word embeddings in notebooks/baseline_models.ipynb.

We observe that Naive-Bayes gives an almost perfect fit for the data. This is a strong indicatioin that all the different kinds of labels are independent given the text.

A Description of our Task

The task at hand is quite similar to the task of intent detection and slot filling, which has been studied widely. The intent detection problem is that of detecting the intent of a statement as one of a few, in a closed set of options.

While intent detection is clearly a sentence classification problem, it is not so clear what is the best form for slot filling. Sometimes it can be cast as a sequence tagging problem, and sometimes as a classification problem.

In the case of a closed set of slots with small finite number of options for each slot, it would probably be better to cast it as a classification problem; as language models generally perform better on them. We will follow this route for now, explore others later.

Recurrent Models

We continue our analysis, by beginning to use recurrent models. To prepare the environment and run, see notebooks/run_models/run_rnn.ipynb. Or see logs on WandB.

We try different architectures and observe several phenomenon:

Model with single LSTM layers, fall in local minima, and stay stuck there. Try comparing the lstm_action and lstm_action_single_layer runs WandB.
Multi-Task LSTM's when trained with the same learning rate as single task LSTM's, fall into local minima again. Hence, we have to train with a lower learning rate, which leads to a much gradual decrease in loss(takes quite longer to train). Compare lstm_all_three, lstm_all_three_longer and lstm_all_three_orig_params runs on WandB.
Single Task LSTM's can quickly learn their respective tasks.

Note that to learn semantics well in our model, we train the models on fixed embeddings from FastText.

Transformer Models

Next, we try finetuning pre-trained transformer models. We try several pre-trained models: bert-base-uncased, roberta, albert. All of them seem to learn in very few epochs compared to their recurrant counterparts.

The finetuned checkpoints are available for bert-base-uncased and the recurrent models too.

Evaluation

We generate a sample test set by performing simple replacements like Turn->Blow and Chinese->Mandarin. It is present in test_data.csv. The result on this dataset for various trained models can be found in notebooks/run_models/test_results.md.

To evaluate, we can use the same script and the same command, but with setting inference->run_infer as true in config.yaml. And providing the name of run from wandb, to load weights from in inference->run_name. For example, jeevesh8/chat_cmds/k95jqc9b is name of this run.

Future Work

Try Stack Propagation framework instead of our Multi-Task one.
Try using conversationally pretrained models like ConveRT or DialoGPT, some papers reported that they perform better on intent detection tasks.
Universal Sentence Embeddings have been trained with conversations too, and may be useful.
Try solving same task in a harder setting, like that of compositional generalization tasks.
Try zero-shot setting using Prompting Methods.

Jeevesh8 / chat_command_detect