NaN error in LayerNorm when using ndarray backend

Question

NaN error in LayerNorm when using ndarray backend

jnamika opened this issue 20 days ago · comments

Describe the bug
While using the ndarray backend, if the input to LayerNorm is a zero vector, it results in NaN during backward propagation.

To Reproduce
The reproduction code is as follows:

use burn::{
    backend::{ndarray::NdArrayDevice, Autodiff},
    config::Config,
    module::Module,
    nn::loss::{MseLoss, Reduction},
    nn::{LayerNorm, LayerNormConfig, Linear, LinearConfig, Relu},
    optim::AdamConfig,
    tensor::{
        backend::{AutodiffBackend, Backend},
        Shape, Tensor,
    },
    train::{RegressionOutput, TrainOutput, TrainStep},
};

#[derive(Module, Debug)]
pub struct Model<B: Backend> {
    linear: Linear<B>,
    norm: LayerNorm<B>,
    activation: Relu,
}

#[derive(Config, Debug)]
pub struct ModelConfig {
    pub d_input: usize,
    pub d_output: usize,
}

impl ModelConfig {
    pub fn init<B: Backend>(&self, device: &B::Device) -> Model<B> {
        let linear = LinearConfig::new(self.d_input, self.d_output).init(device);
        let norm = LayerNormConfig::new(self.d_output).init(device);
        let activation = Relu::new();

        Model {
            linear,
            norm,
            activation,
        }
    }
}

type Batch<B> = (Tensor<B, 2>, Tensor<B, 2>);

impl<B: Backend> Model<B> {
    pub fn forward(&self, input: Tensor<B, 2>) -> Tensor<B, 2> {
        let x: Tensor<B, 2> = self.linear.forward(input) - 5.0;
        self.norm.forward(self.activation.forward(x))
    }
}

impl<B: AutodiffBackend> TrainStep<Batch<B>, RegressionOutput<B>> for Model<B> {
    fn step(&self, batch: Batch<B>) -> TrainOutput<RegressionOutput<B>> {
        let input = batch.0;
        let targets = batch.1;
        let output = self.forward(input);
        let loss = MseLoss::new().forward(output.clone(), targets.clone(), Reduction::Mean);
        TrainOutput::new(
            self,
            loss.backward(),
            RegressionOutput::new(loss, output, targets),
        )
    }
}

fn main() {
    type B = Autodiff<burn::backend::NdArray<f32>>;
    let device = NdArrayDevice::default();
    let input = Tensor::<B, 2>::zeros(Shape::new([2, 4]), &device);
    let targets = Tensor::<B, 2>::from_data([[0., 0.5], [0.5, -1.0]], &device);
    let batch = (input.clone(), targets);
    let mut model = ModelConfig::new(4, 2).init::<B>(&device);
    let mut optim = AdamConfig::new().init();
    let output = model.step(batch);
    model = model.optimize(&mut optim, 0.001, output.grads);
    let output = model.forward(input);
    println!("output = {:?}", output.to_data());
}

Expected behavior
Even if a zero vector is input to LayerNorm, the learned model parameters do not become NaN.

Additional context
I suspect that the cause might be that for some reason, epsilon in LayerNorm is not being evaluated during the backward pass calculation, causing the denominator to become zero when the variance is zero.