Source code to the paper https://arxiv.org/abs/1606.01467
My current thoughts on hyperparameter optimization can be found in my blog post.
We are managing this project with ZenHub following http://devblog.edsurge.com/scrum-kanban-trello-github-issues/
Let's take a look at the video of the training of (lower-layer) deep newtorks' weights: http://cs.nyu.edu/~yann/research/sparse/psd-anim.gif
Actually the video is about sparse coding, but they are similar to the training at lower-layer deep networks.
Imagine that we are playing a weird Atari game with the above screen. The screen seems simpler than real Atari games'. This reminds us of the difference between biomedical image processing and natural image processing. In biomedical images, the objects (e.g. red blood cells) are much simpler, thus needing simpler and smaller models. This means, we can use a small model (a DQN) to contrl the training of a larger model, e.g. a ultra deep convolutional neural networks.
However, there is one problem with this approach: when using stochastic training methods, the order of weights at very episode changes spontaneously. In analogy to applying DQNs to Atari games, it does not make sense for the Atari games to change their graphics APIs at every episode. Thus, we add a meta-momentum
term to elementary optimization objective function. We suspect that this is the main reason other methods need hundreds of thousands of episodes, whereas we only need 20 episodes on MNIST dataset.
Then we realize that the meta-momentum
can actually itself accelerate the overall hyperparameter tuning process significantly. Obviously, if we only gradually change some hyperparameters, the training trajectories of the DNN being tuned by DQNs should not differ significantly across episodes. Oh, this is also like the catapult used in Angry Birds...
Experimental log can be found here: https://github.com/bigaidream-experiments/qan-exp
We are using Lua/Torch. The DQN component is mostly modified from DeepMind Atari DQN.
You might need to run install_dependencies.sh
first.
cd mnist_lr/;
cd mnist;
th train-on-mnist.lua; #get regression filter, save in ../save/
./run_gpu; #Start tune learning rate using dqn
#To get the test curve, run following command
cd mnist_lr/dqn/logs;
python paint_lr_episode.py;
python paint_lr_vs.py;
cd mnist_minibatch;
cd mnist;
th train-on-mnist.lua; #get regression filter, save in ../save/
./run_gpu; #Start select mini-batch using dqn
#To get the test curve, run following command
cd mnist_minibatch/dqn/logs;
python paint_mini_episode.py;
python paint_mini_vs.py;
luarocks install torchnet
luarocks install optnet
luarocks install iterm
#Install cuDNN-5.1
wget http://developer.download.nvidia.com/compute/redist/cudnn/v5.1/cudnn-7.5-linux-x64-v5.1.tgz
tar xvf cudnn-7.5-linux-x64-v5.1.tgz
sudo cp cuda/lib64/* /usr/local/cuda/lib64/
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
mkdir cifar_lr/dqn/datasets;
#Download CIFAR_10 whitened dataset from [here](https://yadi.sk/d/em4b0FMgrnqxy) and save in cifar_lr/dqn/datasets
cd cifar_lr/dqn/datasets;
th split.lua; #get trainvalidata.t7
cd ../..;
#get baseline log and save baseline weight
max_episode=1 take_action=0 savebaselineweight=1 output_file='logs/torchnet_test_baseline.log' ./run_gpu
./run_gpu; #Start tune learning rate using dqn
#To get the test curve, run following command
cd cifar_lr/dqn/logs;
python paint.py;
#and check the acc.pdf
- GPU device can be set in
run_gpu
wheregpu=0
- Learning rate can be set in
/ataricifar/dqn/cnnGameEnv.lua
, in thestep
function. - When to stop doing regression is in
/ataricifar/dqn/cnnGameEnv/lua
, in line 250
-
Q:
Do the actions change over time? Do they converge to something like the alternated classes with uniform sampling heuristics that is always used in CNNs?A:
From what we observed, the actions do change over time. But the pattern is much more complex than uniform sampling heuristics. The visualization plot can be found here -
Q:
Is the meta-momentum really necessary?A:
We think the added meta-momentum is the most important factor making our QAN converge so fast. Actually this is also extremely useful for all hyperparameter optimization tasks. Thus we already integrated this trick into another hyperparameter tuning tool DrMAD. The comparison can be found here. This is somehow similar to using knowledge distillation techniques. We are still investigating this effect in more detail.
@article{dqn-accelerate-dnn,
title={Deep Q-Networks for Accelerating the Training of Deep Neural Networks},
author={Fu, Jie and Lin, Zichuan and Chen, Danlu and Liu, Miao and Leonard, Nicholas and Feng, Jiashi and Chua, Tat-Seng},
journal={arXiv preprint arXiv:1606.01467},
year={2016}
}
If you have any problems or suggestions, please contact: jie.fu A~_~T u.nus.education