KorInto
5-class sentence-final intonation classifier for a syllable-timed and head-final language (Korean)
Requirements
librosa, Keras (TensorFlow), Numpy
Data annotation
Manual tagging on the recordings of Korean drama scripts (# instances: 7,000)
Classification into five categories
System Description
from dist import pred_into
- Given the filename input, the sentence-final intonation label is inferred as an output.
- High rise: 0, Low rise: 1, Fall-rise: 2, Level: 3, Fall: 4
Feature (last 300 frames)
- mel spectrogram (128dim) + RMSE (1dim) (vector augmentation) >> (300 x 129)
- FFT window: 2,048
- Hop length: 512
Architecture : (CNN + BiLSTM-Self attention) concatenation >> MLP
CNN
Conv (5 by 5, 32 filters, ReLU) - BN - MaxPool (2 by 2) - Dropout (0.3) >>
Conv (5 by 5, 64 filters, ReLU) - BN - MaxPool (2 by 2) - Dropout (0.3) >>
Conv (3 by 3, 128 filters, ReLU) - BN - MaxPool (2 by 2) - Dropout (0.3) >>
Conv (3 by 3, 32 filters, ReLU) - BN - MaxPool (2 by 1) >>
Conv (3 by 3, 32 filters, ReLU) - BN - MaxPool (2 by 1) >> Flatten (2016) >> Dense(64, ReLU)
BiLSTM-Self attention
BiLSTM hidden layer sequence: (300, 64x2=128) >> (300, 64) (by dense layer)
Attention source: np.zeros(64)
Attention source >> Dense(64, ReLU) >> Context vector (64)
Context vector x BiLSTM hidden layer sequence (column-wisely) >> Attention vector (300)
Attention vector x BiLSTM hidden layer sequence (column-wisely) >> Weighted hidden layers (300,64)
Weighted hidden layers >> Summation (64) >> Concatenation with CNN output (128)
MLP
(CNN + BiLSTM Self-attention) >> Dense(64, ReLU) - Dropout (0.3) >> Dense(64, ReLU) - Dropout (0.3) >> Softmax(5)