Strange model initialization errors

Question

Strange model initialization errors

antimora opened this issue a month ago · comments

There are a couple of strange errors that I am see when I load Face ONNX and what to just instantiate a model.

These errors can be reproduced in this repo: https://github.com/antimora/face-onnx

By running:

cargo run --bin with_weights

Which will result in this error:

Uniform::new called with `low` non-finite.

and

cargo run --bin without_weights

Which will result in this error:

cast_slice>PodCastError(TargetAlignmentGreaterAndInputNotAligned)

Guillaume Lagrange · Answer 1 · Mon Jul 22 2024 19:54:24 GMT+0800 (China Standard Time)

Thanks for the reproducing example w/ code and artifacts! Will take a look at this today.

Guillaume Lagrange · Answer 2 · Mon Jul 22 2024 21:03:44 GMT+0800 (China Standard Time)

Both use cases are caused by an error during conv bias initialization. When calling parameter initialization, parameters are randomly initialized from a uniform distribution (based on kaiming init) and one of the initializers samples from (-inf, inf), which is invalid.

In the first use case (loading weights), it seems that only biases are randomly initialized. But they should be directly initialized from the loaded weights so there might be another bug there. Probably due to Option<Param<..>>.

/edit: the (-inf, inf) range is caused by an invalid fan_in value during initialization. The number of channels for the second conv2d is computed as 0 for a convolution with channels [1, 16] and groups=16:

let shape = [
    self.channels[1],
    self.channels[0] / self.groups, // here: 1 / 16 = 0
    self.kernel_size[0],
    self.kernel_size[1],
];

Dilshod Tadjibaev · Answer 3 · Mon Jul 22 2024 21:39:15 GMT+0800 (China Standard Time)

Both use cases are caused by an error during conv bias initialization. When calling parameter initialization, parameters are randomly initialized from a uniform distribution (based on kaiming init) and one of the initializers samples from (-inf, inf), which is invalid.

In the first use case (loading weights), it seems that only biases are randomly initialized. But they should be directly initialized from the loaded weights so there might be another bug there. Probably due to Option<Param<..>>.

/edit: the (-inf, inf) range is caused by an invalid fan_in value during initialization. The number of channels for the second conv2d is computed as 0 for a convolution with channels [1, 16] and groups=16:
let shape = [
    self.channels[1],
    self.channels[0] / self.groups, // here: 1 / 16 = 0
    self.kernel_size[0],
    self.kernel_size[1],
];

We changed recently to make Tensors load lazy.

Ah, should it be self.channels[1] / self.groups? It seems to match the model configs:

impl<B: Backend> Model<B> {
    #[allow(unused_variables)]
    pub fn new(device: &B::Device) -> Self {
        let conv2d1 = Conv2dConfig::new([3, 16], [3, 3])
            .with_stride([2, 2])
            .with_padding(PaddingConfig2d::Explicit(1, 1))
            .with_dilation([1, 1])
            .with_groups(1)
            .with_bias(true)
            .init(device);
        let conv2d2 = Conv2dConfig::new([1, 16], [3, 3])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Explicit(1, 1))
            .with_dilation([1, 1])
            .with_groups(16)
            .with_bias(true)
            .init(device);
        let conv2d3 = Conv2dConfig::new([16, 32], [1, 1])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Valid)
            .with_dilation([1, 1])
            .with_groups(1)
            .with_bias(true)
            .init(device);
        let conv2d4 = Conv2dConfig::new([1, 32], [3, 3])
            .with_stride([2, 2])
            .with_padding(PaddingConfig2d::Explicit(1, 1))
            .with_dilation([1, 1])
            .with_groups(32)
            .with_bias(true)
            .init(device);
        let conv2d5 = Conv2dConfig::new([32, 32], [1, 1])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Valid)
            .with_dilation([1, 1])
            .with_groups(1)
            .with_bias(true)
            .init(device);
        let conv2d6 = Conv2dConfig::new([1, 32], [3, 3])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Explicit(1, 1))
            .with_dilation([1, 1])
            .with_groups(32)
            .with_bias(true)
            .init(device);
        let conv2d7 = Conv2dConfig::new([32, 32], [1, 1])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Valid)
            .with_dilation([1, 1])
            .with_groups(1)
            .with_bias(true)
            .init(device);
        let conv2d8 = Conv2dConfig::new([1, 32], [3, 3])
            .with_stride([2, 2])
            .with_padding(PaddingConfig2d::Explicit(1, 1))
            .with_dilation([1, 1])
            .with_groups(32)
            .with_bias(true)
            .init(device);
        let conv2d9 = Conv2dConfig::new([32, 64], [1, 1])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Valid)
            .with_dilation([1, 1])
            .with_groups(1)
            .with_bias(true)
            .init(device);
        let conv2d10 = Conv2dConfig::new([1, 64], [3, 3])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Explicit(1, 1))
            .with_dilation([1, 1])
            .with_groups(64)
            .with_bias(true)
            .init(device);
        let conv2d11 = Conv2dConfig::new([64, 64], [1, 1])
            .with_stride([1, 1])
            .with_padding(PaddingConfig2d::Valid)
            .with_dilation([1, 1])
            .with_groups(1)
            .with_bias(true)
            .init(device);
...

Guillaume Lagrange · Answer 4 · Mon Jul 22 2024 21:51:21 GMT+0800 (China Standard Time)

No the channels order is correct, the weights are [channels_out, channels_in / groups, kernel_size_1, kernel_size_2] so that's why the order of channels is reversed.

~~I actually think the ONNX model might be ill defined~~ 🤔 the number of channels must be divisible by the number of groups. Also, it doesn't make sense that the second convolution has 1 input channel while the first convolution has 16 output channels.

/edit: ha, there is a small bug in our check. The message states that both in and out channels should be divisible by groups, but the condition is inverted:

pub(crate) fn checks_channels_div_groups(channels_in: usize, channels_out: usize, groups: usize) {
    let channels_in_div_by_group = channels_in % groups == 0;
    let channels_out_div_by_group = channels_out % groups == 0;

    if !channels_in_div_by_group && !channels_out_div_by_group {
        panic!("Both channels must be divisible by the number of groups. Got channels_in={channels_in}, channels_out={channels_out}, groups={groups}");
    }
}

When we actually check that both are divisible (!channels_in_div_by_group || !channels_out_div_by_group), this is where it panics before getting to the incorrect initialization.

Guillaume Lagrange · Answer 5 · Tue Jul 23 2024 00:04:00 GMT+0800 (China Standard Time)

Nvm the part about the model being ill defined, turns out the ONNX import did not take into account the number of groups. So this is where the discrepancy seems to lie.