Feeding audio Waveform in every layer

Question

SatyamKumarr opened this issue 6 years ago · comments

As per issue #83, it was discussed that input is provided as raw waveform in 1st layer and using single channel floating point tensor .

Why can't we extend to multiple layer and may improve accuracy?
What benefit is been provided by feeding raw audio waveform in 1st layer?
Can this idea be extended to other waveform (Music, noisy speech data) instead of focusing on text to speech synthesis?