Batch normalization - why not use theano?

Question

Batch normalization - why not use theano?

botev opened this issue 7 years ago · comments

So, is there any reason why Lasagne has still not started using the T.nnet.batch_normalization_train/test for the BatchNormalizationLayer? I did a small test and with fast_compile the Lasagne implementation is very numerically unstable while the Theano works fine even with optimizer none.

Ramana Sundararaman · Answer 1 · Sat Sep 02 2017 19:23:36 GMT+0800 (China Standard Time)

IIRC, before the batch norm PR was merged into Theano, there is a Lasagne Layer.
Can you explain a bit more about the numerical instability ? I think fast_compile might not run all the optimizers and hence the graph might be not fully optimized. But yeah, I think we can have the BatchNormalizationLayer to use the Theano Op, @f0k has to approve

Alexander Botev · Answer 2 · Sat Sep 02 2017 20:29:59 GMT+0800 (China Standard Time)

Yeah, yeah that is the exact point, but the Theano op remains stable under fast_compile. Sometimes for debugging (catching errors), I need fast_compile or none in order to get the correct place for the error, but because the Lasagne BatchNorm now gives NaNs it's very hard to track what goes wrong in the computation. The Theano Op _train and _test pretty much fit everything that currently is done and is easy to add. I can make a PR as long as you guys are on board with that.

Ramana Sundararaman · Answer 3 · Sun Sep 03 2017 18:10:12 GMT+0800 (China Standard Time)

I need fast_compile or none in order to get the correct place for the error, but because the Lasagne BatchNorm now gives NaNs

Oh! is it possible to share a segment of a code with which we can reproduce this ? I haven't tried running a Lasagne code with fast_compile flag. Also just to confirm is this FAST_COMPILE (mode) or fast_compile(optimizer) ? The latter is unstable and it doesn't pass tests in Theano master. So I'm not sure if it gives bizarre results, it has to be to be considered abnormal. cc : @nouiz can say more about this.

The Theano Op _train and _test pretty much fit everything that currently is done and is easy to add. I can make a PR as long as you guys are on board with that.

I think this is the general order of preference, to use Theano's Op/interface wherever it is possible in Lasagne. Normalization Layers aren't in the current stable version and we're planning for a new release soon, hence a PR on this is most welcomed, in my opinion. But, Jan has to confirm about this :)

Alexander Botev · Answer 4 · Mon Sep 04 2017 00:00:54 GMT+0800 (China Standard Time)

I'll need to dig up the code that initially made that error if I can. And yes its optimizer=fast_compile or none. And yes as long as @f0k is on board with it I can make a PR pretty fast.

Jan Schlüter · Answer 5 · Sat Sep 09 2017 02:26:22 GMT+0800 (China Standard Time)

So, is there any reason why Lasagne has still not started using the T.nnet.batch_normalization_train/test for the BatchNormalizationLayer?

It's just because I wanted to avoid relying on Theano 0.9 before releasing Lasagne 0.2. (I'm aware that clinging to this is getting sillier and sillier with every month that passes.)

I think this is the general order of preference, to use Theano's Op/interface wherever it is possible in Lasagne.

Yes. The BatchNormLayer should be implemented similarly to the BatchNormDNNLayer, and then we could ditch the latter. We can also think about providing a way to use std instead of inv_std so we can leave all updates to cuDNN if the arguments allow.

Ramana Sundararaman · Answer 6 · Thu Sep 14 2017 02:18:51 GMT+0800 (China Standard Time)

I wanted to avoid relying on Theano 0.9 before releasing Lasagne 0.2.

If I am not wrong, Theano is getting ready for 0.10.0 release. Isn't it safe to rely on 0.9 itself, given that there is significant speed up from 0.8 to 0.9 ?

Zhihua Huang · Answer 7 · Thu Sep 28 2017 15:36:23 GMT+0800 (China Standard Time)

According to the docs and source, batch_norm is just a warp function of BatchNormLayer that apply batch normalization on input layer and add a NonlinearityLayer with nonlinearity come from input layer.

So, on my understanding, it is same in case of batch_norm(Conv2DLayer(inp, 1, 1)) and NonlinearityLayer(BatchNormLayer(Conv2Dlayer(inp, 1, 1, nonlinearity=None))).

But when I test the model this two seem has different number of parameters. What leads to this?

Jan Schlüter · Answer 8 · Thu Sep 28 2017 20:53:53 GMT+0800 (China Standard Time)

But when I test the model this two seem has different number of parameters. What leads to this?

According to the docs and source, in addition to moving the nonlinearity to the end of the chain, the batch_norm convenience function also removes the bias from the layer if there is any (as it would be useless). So when you apply this manually, in addition to nonlinearity=None, you should also pass b=None to the layer before.

Zhihua Huang · Answer 9 · Thu Sep 28 2017 22:27:51 GMT+0800 (China Standard Time)

yeah, it is my fault, I forgot the bias term.

Jan Schlüter · Answer 10 · Thu Feb 22 2018 00:20:05 GMT+0800 (China Standard Time)

The BatchNormLayer should be implemented similarly to the BatchNormDNNLayer, and then we could ditch the latter. We can also think about providing a way to use std instead of inv_std so we can leave all updates to cuDNN if the arguments allow.

I just tried the latter (it's actually var, not std), but curiously enough, for a simple example (a WGAN on MNIST), training gets about 10% slower when leaving the updates to cuDNN instead of doing them ourselves. That's a pity, but the good news is we won't need to break the interface to store the running variance instead of the running inverse standard deviation. Unless I did something silly in my code, of course. My implementation is attached below.

An updated `BatchNormLayer` implementation that uses `T.nnet.bn`.

class BatchNormLayer(lasagne.layers.BatchNormLayer):
    """
    Implementation of BatchNormLayer for recent Theano versions
    """
    def __init__(self, incoming, axes='auto', epsilon=1e-4, alpha=0.1,
                 beta=init.Constant(0), gamma=init.Constant(1),
                 mean=init.Constant(0), var=init.Constant(1), **kwargs):
        super(BatchNormLayer, self).__init__(
                incoming, axes, epsilon, alpha, beta, gamma, mean, var,
                **kwargs)
        self.var = self.inv_std
        del self.inv_std

    def get_output_for(self, input, deterministic=False,
                       batch_norm_use_averages=None,
                       batch_norm_update_averages=None, **kwargs):
        # Decide whether to use the stored averages or mini-batch statistics
        if batch_norm_use_averages is None:
            batch_norm_use_averages = deterministic
        use_averages = batch_norm_use_averages

        # Decide whether to update the stored averages
        if batch_norm_update_averages is None:
            batch_norm_update_averages = not deterministic
        update_averages = batch_norm_update_averages

        # Theano requires beta/gamma tensors; create dummies if needed
        shape = tuple(s for (d, s) in enumerate(input.shape)
                      if d not in self.axes)
        gamma = self.gamma or theano.tensor.ones(shape)
        beta = self.beta or theano.tensor.zeros(shape)

        # obtain self-normalized outputs and statistics, if needed
        if update_averages:
            (normalized, input_mean, input_inv_std, new_running_mean,
             new_running_var) = T.nnet.bn.batch_normalization_train(
                     input, gamma, beta, self.axes, self.epsilon,
                     self.alpha, self.mean, self.var)
        elif not use_averages:
            (normalized, input_mean,
             input_inv_std) = T.nnet.bn.batch_normalization_train(
                     input, gamma, beta, self.axes, self.epsilon)

        # normalize with stored averages, if needed
        if use_averages:
            normalized = T.nnet.bn.batch_normalization_test(
                    input, gamma, beta, self.mean, self.var, self.axes)

        # update stored averages, if needed
        if update_averages:
            # Trick: To update the stored statistics, we create memory-aliased
            # clones of the stored statistics:
            running_mean = theano.clone(self.mean, share_inputs=False)
            running_var = theano.clone(self.var, share_inputs=False)
            # set a default update for them:
            running_mean.default_update = new_running_mean
            running_var.default_update = new_running_var
            # and make sure they end up in the graph without participating in
            # the computation (this way their default_update will be collected
            # and applied, but the computation will be optimized away):
            dummy = 0 * (running_mean + running_var).sum()
            normalized = normalized + dummy

        return normalized

Jan Schlüter · Answer 11 · Thu Feb 22 2018 00:34:18 GMT+0800 (China Standard Time)

Ok, it seems my implementation precluded an inplace optimization for the running mean/var updates. With the following variant, performance is equivalent to the current BatchNormDNNLayer. So there's nothing to gain from letting cuDNN perform the updates of the running mean and variance, so we don't need to change the interface.

Note that we should still use T.nnet.bn, but in the same way it's currently done in BatchNormDNNLayer.

Improved `BatchNormLayer` implementation compared to the previous post.

class BatchNormLayer(lasagne.layers.BatchNormLayer):
    """
    Implementation of BatchNormLayer for recent Theano versions
    """
    def __init__(self, incoming, axes='auto', epsilon=1e-4, alpha=0.1,
                 beta=init.Constant(0), gamma=init.Constant(1),
                 mean=init.Constant(0), var=init.Constant(1), **kwargs):
        super(BatchNormLayer, self).__init__(
                incoming, axes, epsilon, alpha, beta, gamma, mean, var,
                **kwargs)
        self.var = self.inv_std
        del self.inv_std

    def get_output_for(self, input, deterministic=False,
                       batch_norm_use_averages=None,
                       batch_norm_update_averages=None, **kwargs):
        # Decide whether to use the stored averages or mini-batch statistics
        if batch_norm_use_averages is None:
            batch_norm_use_averages = deterministic
        use_averages = batch_norm_use_averages

        # Decide whether to update the stored averages
        if batch_norm_update_averages is None:
            batch_norm_update_averages = not deterministic
        update_averages = batch_norm_update_averages

        # Theano requires beta/gamma tensors; create dummies if needed
        shape = tuple(s for (d, s) in enumerate(input.shape)
                      if d not in self.axes)
        gamma = self.gamma or theano.tensor.ones(shape)
        beta = self.beta or theano.tensor.zeros(shape)

        # obtain self-normalized outputs and statistics, if needed
        if update_averages:
            # Trick: To update the stored statistics, we create memory-aliased
            # clones of the stored statistics, and give them a default update
            running_mean = theano.clone(self.mean, share_inputs=False)
            running_var = theano.clone(self.var, share_inputs=False)
            (normalized, input_mean, input_inv_std, new_running_mean,
             new_running_var) = T.nnet.bn.batch_normalization_train(
                     input, gamma, beta, self.axes, self.epsilon,
                     self.alpha, running_mean, running_var)
            running_mean.default_update = new_running_mean
            running_var.default_update = new_running_var
        elif not use_averages:
            (normalized, input_mean,
             input_inv_std) = T.nnet.bn.batch_normalization_train(
                     input, gamma, beta, self.axes, self.epsilon)

        # normalize with stored averages, if needed
        if use_averages:
            normalized = T.nnet.bn.batch_normalization_test(
                    input, gamma, beta, self.mean, self.var, self.axes)

        return normalized