DurIAN

Blog

English:

I use the same encoder as tacotron2
I remove attention module in decoder and use average pooling to implement "predicting r frames at once"
I remove position encoding and skip encoder in this implementation

Chinese:

我用了和tacotron2相同的encoder结构，但是参数更小
我去除了decoder中的attention模块，由于一步输出三帧，我对三个时间步的memory进行了相加求均值的操作，表现在代码中为average pooling，经过实验，相比与存在attention模块的decoder，这样的音质会受到很小的负面影响，但是训练速度有了极大的提高
我舍弃了position encoding和skip encoder，这对合成效果的影响很小

sample here, I use waveglow as vocoder, pretrained model here, batchsize is 32, step is 180k.

training:

testing:

testing using pretrained model:

尽管DurIAN的生成速度比不上FastSpeech，但是DurIAN生成的样本音质好于FastSpeech，并且计算量也小于FastSpeech，在实际部署中，DurIAN的生成速度已经完全满足RTF要求。