Lasagne / Lasagne

Lightweight library to build and train neural networks in Theano

Home Page:http://lasagne.readthedocs.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Batch normalization, when using "batch_norm_use_averages", the statistics of the current batch is not used?

Sunnydreamrain opened this issue · comments

As in the following line, it seems that when "batch_norm_use_averages" is True, mean and inv_std is not updated with the statistics of the current batch?
https://github.com/Lasagne/Lasagne/blob/master/lasagne/layers/normalization.py#L289

Is this developed on purpose?
If updating with the statistics of the current batch, is the following right?

    if update_averages:
        if use_averages:
            mean = self.mean*(1 - self.alpha)+self.alpha * input_mean
            inv_std = self.inv_std*(1 - self.alpha)+self.alpha *input_inv_std
        # Trick: To update the stored statistics, we create memory-aliased
        # clones of the stored statistics:
        running_mean = theano.clone(self.mean, share_inputs=False)
        running_inv_std = theano.clone(self.inv_std, share_inputs=False)
        # set a default update for them:
        running_mean.default_update = ((1 - self.alpha) * running_mean +
                                       self.alpha * input_mean)
        running_inv_std.default_update = ((1 - self.alpha) *
                                          running_inv_std +
                                          self.alpha * input_inv_std)
        # and make sure they end up in the graph without participating in
        # the computation (this way their default_update will be collected
        # and applied, but the computation will be optimized away):
        mean += 0 * running_mean
        inv_std += 0 * running_inv_std

As in the following line, it seems that when "batch_norm_use_averages" is True, mean and inv_std is not updated with the statistics of the current batch?

batch_norm_use_averages is independent of batch_norm_update_averages. It will not hinder updating the statistics. This is also ensured in the tests: https://github.com/Lasagne/Lasagne/blob/39bc1bc/lasagne/tests/layers/test_normalization.py#L182-L235

Do you find otherwise?

When both "batch_norm_use_averages" and "batch_norm_update_averages" are enabled, the mean and inv_std are obtained from the past data. For example, when training the second batch, mean and inv_std are from the first batch. Instead, I think it should be obtained based on the past data and the current data, which is the mean and inv_std updated with the current batch. This does not affect the test process, only affects the training process.

Line 277 should be behind line 289. (https://github.com/Lasagne/Lasagne/blob/master/lasagne/layers/normalization.py#L277)

Instead, I think it should be obtained based on the past data and the current data, which is the mean and inv_std updated with the current batch.

I see. This would make batch_norm_use_averages behave differently depending on whether batch_norm_update_averages is enabled or disabled. I thought if would be more sensible if it always did the same thing (i.e., using the stored statistics, not introducing any dependency on the current batch). In any case, we cannot change the behaviour without breaking backwards compatibility.

We could add a third option batch_norm_use_updated_averages for what you propose. What's the use case for this? Is this general enough to warrant inclusion in Lasagne, or is it easier if you use a custom layer in your code? The code excerpt you had in the first post looks correct at first glance.

First, I think it is the right way to perform batch normalization when enabling "batch_norm_use_averages ".
Second, when I use BN for RNN, it makes a big difference because the statistics among batches can be changed quite quickly. It is nice to smooth them with the past information and the current batch information.

Thanks for the suggestion. I have used the code I posted earlier and it is working okay.

It is nice to smooth them with the past information and the current batch information.

So your networks train better than either without batch_norm_use_averages, or with the current implementation of batch_norm_use_averages? I haven't seen this in a paper, but if it works (and ideally if it becomes a bit widespread, or is published), we could add a batch_norm_use_updated_averages option. I just wouldn't want to make the code (and documentation) more complex if it's a niche feature.

I'll leave this issue open so we can review and decide on this when the issue is rediscovered sometime in the future. :)

Okay. Thanks. I will let you know when I have the final results.