ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Understanding convolution kernels in dilation layers

redwrasse opened this issue · comments

Hi @ibab,
I'm a bit late to the wavenet paper implementation party, but I'm reading the paper and your code and trying to understand where dilated convolution kernels are present. Your ASCII diagram shows


               |-> [gate]   -|        |-> 1x1 conv -> skip output
               |             |-> (*) -|
        input -|-> [filter] -|        |-> 1x1 conv -|
               |                                    |-> (+) -> dense output
               |------------------------------------|

        Where `[gate]` and `[filter]` are causal convolutions with a
        non-linear activation at the output. Biases and global conditioning
        are omitted due to the limits of ASCII art.

The Wavenet paper diagram shows a single 'Dilated Conv' fed into both tanh and sigmoid functions.
From your ASCII diagram and code (which agree), it seems there is in fact not not one dilated convolution but two dilated convolutions, one for the tanh (defining the 'filter'), and one for the sigmoid (defining the 'gate'). Is this correct and is this what was in fact intended in the Wavenet paper?

Additionally could you give justification for the parameter choices not mentioned in the paper?

  '''Implements the WaveNet network for generative audio.

    Usage (with the architecture as in the DeepMind paper):
        dilations = [2**i for i in range(N)] * M
        filter_width = 2  # Convolutions just use 2 samples.
        residual_channels = 16  # Not specified in the paper.
        dilation_channels = 32  # Not specified in the paper.
        skip_channels = 16      # Not specified in the paper.
        net = WaveNetModel(batch_size, dilations, filter_width,
                           residual_channels, dilation_channels,
                           skip_channels)
        loss = net.loss(input_batch)
    '''

Thanks in advance.

Answering this for myself from looking through the literature, yes it looks like there are in fact two distinct dilated convolutions passed to the 'gated activation unit'- the original wavenet paper diagrams appear misleading.

@redwrasse, I agree that the original paper misses some details here and there. Take a look at (Gated) PixelCNN by WaveNet's main author (https://arxiv.org/pdf/1606.05328.pdf) and you will find that he "copies" the gated activation from there. Also, it seems like they stacked them along the output function dims to spare a conv1d.

For the later, have a look here
https://github.com/cheind/autoregressive/blob/e1f9b72b0f9764f9b4d6b6f65f028cd50db6940e/autoregressive/wave.py#L63

Thanks @cheind, I'll take a look. A side project I'd like to get back into.

@redwrasse, same for me :) I just figured that it works nicely on 2D images as well (without the special architecture of PixelCNN, just plane WaveNet with unrolled images). In addition, once you have the joint distribution the model estimates, you might start to query all kind of things from the model (like given a wavenet conditioned on the speaker id, what is the probability that this speech was spoken by speaker X).

In case you are interested, I have a quite elaborate presentation + code here
https://github.com/cheind/autoregressive/tree/image-support

The branch will be closed soon and merged to main, so I leave a perm-link
https://github.com/cheind/autoregressive/tree/23701bd503843a1de82c6a32ba5bd6e8ad6965a3