KorInto

5-class sentence-final intonation classifier for a syllable-timed and head-final language (Korean)

Requirements

librosa, Keras (TensorFlow), Numpy

Data annotation

Manual tagging on the recordings of Korean drama scripts (# instances: 7,000)

Classification into five categories

System Description

 from dist import pred_into

Given the filename input, the sentence-final intonation label is inferred as an output.
High rise: 0, Low rise: 1, Fall-rise: 2, Level: 3, Fall: 4

Feature (last 300 frames)

mel spectrogram (128dim) + RMSE (1dim) (vector augmentation) >> (300 x 129)
FFT window: 2,048
Hop length: 512

Architecture : (CNN + BiLSTM-Self attention) concatenation >> MLP

CNN

Conv (5 by 5, 32 filters, ReLU) - BN - MaxPool (2 by 2) - Dropout (0.3) >>
Conv (5 by 5, 64 filters, ReLU) - BN - MaxPool (2 by 2) - Dropout (0.3) >>
Conv (3 by 3, 128 filters, ReLU) - BN - MaxPool (2 by 2) - Dropout (0.3) >>
Conv (3 by 3, 32 filters, ReLU) - BN - MaxPool (2 by 1) >>
Conv (3 by 3, 32 filters, ReLU) - BN - MaxPool (2 by 1) >> Flatten (2016) >> Dense(64, ReLU)

BiLSTM-Self attention

BiLSTM hidden layer sequence: (300, 64x2=128) >> (300, 64) (by dense layer)
Attention source: np.zeros(64)
Attention source >> Dense(64, ReLU) >> Context vector (64)
Context vector x BiLSTM hidden layer sequence (column-wisely) >> Attention vector (300)
Attention vector x BiLSTM hidden layer sequence (column-wisely) >> Weighted hidden layers (300,64)
Weighted hidden layers >> Summation (64) >> Concatenation with CNN output (128)

MLP

(CNN + BiLSTM Self-attention) >> Dense(64, ReLU) - Dropout (0.3) >> Dense(64, ReLU) - Dropout (0.3) >> Softmax(5)

About

5-class sentence-final intonation classifier for a syllable-timed and head-final language (Korean)

Languages

Language:Python 100.0%