SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.

Question for different scheme between the code and the paper: the filter dimension and stride

Soonhwan-Kwon opened this issue 7 years ago · comments

Soonhwan-Kwon commented 7 years ago

from the paper page 9,table4 (https://arxiv.org/pdf/1512.02595.pdf) it describes the filter dimension and stride as below
(the first dimension is frequency and the second dimension is time)

(Architecture) (Channels) (Filter dimension) (Stride) ...
(2-layer 2D ) (32, 32 ) (41x11,21x11) (2x2, 2x1) ...

But in the code, deepspeech.torch/DeepSpeechModel.lua from line 25 to line 28
conv:add(nn.SpatialConvolution(1, 32, 11, 41, 2, 2))
conv:add(nn.SpatialBatchNormalization(32))
conv:add(nn.Clamp(0, 20))
conv:add(nn.SpatialConvolution(32, 32, 11, 21, 2, 1))
it seems to have different stride scheme because the last line translated to the paper's description
would be

(Architecture) (Channels) (Filter dimension) (Stride) ...
(2-layer 2D ) (32, 32 ) (41x11,21x11) (2x2, 1x2) ...

I'm wondering that it is my misunderstanding or it is different scheme to get better performance
Thank you for answering in advance.