Our fork of espnet for CSE 5539 experiments.
We explored using ideas from image processing on the spectogram.
- Using axial attention blocks
- Using SWIN transformer blocks
espnet_model with transformer encoder / decoder
Processing 2d frames of the spectogram without prior convolutions. axial attention on frames.
Result: less effective