Expert Gate: Lifelong Learning with a Network of Experts

Code for the Paper

Expert Gate: Lifelong Learning with a Network of Experts
Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars
[CVPR 2017]

If you find this code useful, please consider citing the original work by authors:

@InProceedings{Aljundi_2017_CVPR,
author = {Aljundi, Rahaf and Chakravarty, Punarjay and Tuytelaars, Tinne},
title = {Expert Gate: Lifelong Learning With a Network of Experts},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {July},
year = {2017}
}

Introduction

Lifelong Machine Learning, or LML, considers systems that can learn many tasks over a lifetime from one or more domains. They retain the knowledge they have learned and use that knowledge to more efficiently and effectively learn new tasks more effectively and efficiently (This is a case of positive inductive bias where the past knowledge helps the model to perform better on the newer task).

The problem of Catastrophic Inference or Catastrophic Forgetting is one of the major hurdles facing this domain where the performance of the model inexplicably declines on the older tasks once the newer tasks are introduced into the learning pipeline.

This paper advocates the use of separate "experts" for each task such that each expert is called into action when it faces a training sample that is pertinent to the task on which it is the "expert". They theorize that a shared model would be unable to account for the nuances of each task and hence lead to performance degradation on the older tasks as the number of tasks grows

In order to help distinguish between these tasks, the paper proposes to train a single layer autoencoder on each task, wherein the autoencoders; given the input are taught to reconstruct it. In theory, when a task arrives, the autoencoder having the lowest reconstruction error on the task activates it's corresponding "expert". Thereby, for each task, the authors train a single layer autoencoder model and an "expert" which derives its architecture from AlexNet

Requisites

PyTorch: Use the instructions that are outlined on PyTorch Homepage for installing PyTorch
gdown: A simple pip install should suffice. Follow instructions here

Ensure that you have both these libraries downloaded

Datasets and Designing the experiments

The original paper uses Caltech-UCSD Birds, MIT Scenes and Oxford Flowers in addition to the ImageNet (for training the autoencoder). Computational and hardware limitations necessitated the design of experiments such that the smaller versions of these standard datasets were used. However, this was complicated by two major reasons:

The smaller versions of most of the standard datasets were not available publically
The ones that could be found (Oxford 17 categories dataset, Birds 200 categories) were getting corrupted by the system such that the dataloaders in PyTorch were reading in files that were prepended with a _ sign.

The Tiny-Imagenet dataset was used and the 200 odd classes were split into 4 tasks with 50 classes being assigned to each task randomly. This division can also be arbitrary and no special consideration has been given to the decision to split the dataset evenly. Each of these tasks has a "train" and a "test" folder to validate the performance on these wide-ranging tasks.

The purpose behind using the MNIST dataset was to introduce some tasks that were significantly different from the ones in the Tiny Imagenet dataset. This is an attempt to recreate the setting of the original paper on a lower scale

Training

Training a model on a given task takes place using the generate_models.py file. Simply execute the following lines to begin the training process

Execute the following lines of code (along with the necessary arguments) to generate the expert models for the 4 tasks. Make sure you have downloaded the datasets used in these experiments. The steps are detailed here. If you are using this to test it out on your own datasets, make sure that you make the necessary changes in the generate_models.py and test_models.py with regards to the number of tasks being used in the experiment here and here. You would also need to make changes to the transforms used in these files.

python3 generate_models.py

The file takes the following arguments

init_lr: Initial learning rate for the model. The learning rate is decayed every 5 epochs.Default: 0.1
num_epochs_encoder: Number of epochs you want to train the encoder model for. Default: 5
num_epochs_model: Number of epochs you want to train the model for. Default: 15
batch_size: Batch Size. Default: 16
use_gpu: Set the GPU flag to True to use the GPU. Default: False

Once you invoke the generate_models.py file with the appropriate arguments, the following things shall happen

The Autoencoder model is trained on the features of the last convolutional layer of an Alexnet model (the preprocessing steps are as detailed in section 3.1 of the paper) and the model is stored in ./models/autoencoders with the appropriate task number. The facility to restart training from a given checkpoint is provided so as to protect against abrupt failures whilst training
After an autoencoder model is trained, a search is carried out over the already existing autoencoders to determine the "expert" which is most closely related to the new task using the task-relatedness metric described in section 3.3 of the paper
After the appropriate model is decided upon, there are two ways to train this model depending on the task-relatedness value as described in section 3.3 of the paper:
- LwF approach: Use the method outlined in the Learning Without Forgetting paper when the value for the task-relatedness is greater than 0.85
- Finetuning: If the value if less than 0.85 proceed with the finetuning approach as described in Learning Without Forgetting
The implementation comes with a slight caveat: PyTorch does not allow setting the train attribute of some weights in a layer to True and the could be set to False and some weights could be set to True. In order to implement this idea, the gradients for these weights are manually set to zero so as to ensure that these weights don't train
The final model is stored in /models/trained_models with a text file classes.txt which describe the number of classes that the model was exposed to in this particular task

Refer to the docstrings and the inline comments that are made in encoder_train.py and model_train.py for a more detailed view

MAKE SURE THAT YOU TRAIN THE MODEL FOR ATLEAST 10 EPOCHS BUT ALSO KEEP IT BELOW 25 EPOCHS

The training procedure is really volatile, and these were the boundaries that I could find. I did not carry out an extensive search over the optimum number of epochs and these boundaries were obtained from initial tests. For this range, the loss function at least returned a numerical value, however even in this case, if the model gets stuck in a bad optimum, the loss function starts giving out NaN values and this snowballs into the model not learning at all.

Evaluating the model

To recreate the experiments performed, first, execute the following lines of code

cd data_utils
python3 data_prep_tin.py
python3 data_prep_mninst.py
cd ../

This will download the tiny-imagenet dataset (TIN) and the MNIST dataset to the Data folder and split it into 4 + 5 tasks with each task consisting of 50 classes (TIN) + 2 classes (MNIST) each. The directory structure of the downloaded datasets would be:

Data
├── test
│   ├── n01443537
│   └── n01629819
└── train
    ├── n01443537
    └── n01629819

Train the model as detailed in the procedure outlined in the Training Section

Next to assess how well the model adapts to a particular task at hand, execute the following lines to generate the final scores (along with the arguments)

python3 test_models.py

use_gpu: Set the GPU flag to True to use the GPU. Default: False
batch_size: Batch Size. Default: 16

A file called results.txt will be created and the results (task accuracy) shall be recorded in this file

Results

My system could not handle all the number of tasks in this sequence (9 in all) and it frequently froze up before completion. The test_models module is O(number_of_tasks X number_of_tasks X sizeof(task)). This is necessary since for each task we need to search over all the autoencoders created for the best performing model and activate the corresponding trained_model over which the final epoch_accuracy is calculated. Due to this, I manually cut the number of classes in each of the TIN dataset to 25 and used only one of the task from the MNIST dataset. It is quite clear from the architecture proposed in this paper that this is not optimum.

Another key caveat is that in all these trained models that are derived from the Alexnet architecture, only the last two convolutional layers and the classification layers are being trained. The rest of the layers are frozen and hence are not trained and the results are reported for this setting

There are two evaluation criteria:

The first takes a max over all the output units in the final layer of the network (Results have been reported for this setting). This is a more challenging criteria
The second takes a max over only the relevant output units in the final layer of the network (More similar to the setting in paper)

Uncomment any of these lines to follow that particular evaluation criteria

The present test_models.py is written assuming that your system can handle all the tasks in the full sequence. Please make the necessary changes to make the testing procedure compatible with your computational requirements.

The results reported are for this particular setting [Number of epochs used for training: 15]:

Input_Task_Number: The task that was fed to the model
Model_activated: The model that was identified for this task. The correct model was identified in these cases
Accuracy: Has been rounded to the nearest two decimals [number of right labels identfied]

Input Task_number	Model activated	Accuracy (in %)
3	3	63
1	1	64
5	5	59
2	2	54
4	4	69

Final Takeaways

The ideas proposed in this model; load only the required model into memory at inference. However, it is a really expensive procedure to search over all the autoencoders to identify the correct model and this situation will only get worse with an increasing number of tasks. Clearly, this would not scale to much longer sequences. It is also not clear how the authors stabilized the training procedure for the "Learning without Forgetting" approach.

To-Do's for this Project

-[ ] Figure out ways to stablize the training procedure, have isolated the problem to the distillation loss calculation. The problem also arises if the first "expert" has been trained on a lesser number (<10 in my experiments) of epochs.

References

Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars Expert Gate: Lifelong Learning with a Network of Experts CVPR 2017. [arxiv]
Zhizhong Li, Derek Hoiem Learning without Forgetting ECCV 2016 [arxiv]
Geoffrey Hinton, Oriol Vinyals, Jeff Dean Distilling the Knowledge in a Neural Network NeurIPs 2014 (formerly known as NIPS). [arxiv]
PyTorch Docs. [http://pytorch.org/docs/master]

This repository owes a huge credit to the authors of the original implementation (code in Matlab). This code repository could only be built due to the help offered by countless people on Stack Overflow and PyTorch Discus blogs

License

BSD

wannabeOG / ExpertNet-Pytorch