Speech-Recognition-for-Emoition-Detection

Objective

The task is to extract a set of prosodic correlates (i.e. suprasegmental speech parameters) and cepstral features from speech recordings. Then, an emotion recognition system is constructed to recognize happy versus sad emotional speech (a quite easy two class problem) using a simple supervised classifier training and testing structure.

The original speech data is a set of simulated emotional speech (i.e. acted) from ten speakers speaking five different pre-segmented sentences of roughly 2-3 seconds in two different emotional states (happy and sad) totaling 100 samples. Basic prosodic features (i.e. distribution parameters derived from the prosodic correlates) are extracted using a simple voiced/unvoiced analysis of speech, pitch tracker, and energy analysis. Another set of Mel-Frequency Cepstral Coefficients (MFCC) features are also calculated for comparison.

Support Vector Machine (SVM) classifiers are trained. A random subset of 1/2 of the available speech data (i.e. half of the persons) is used to train the emotion recognition system, first using a set of simple prosodic parameter features and a then a classical set of MFCC derived features. The rest of the data (the other half of the persons) is then used to evaluate the performances of the trained recognition systems.

Task 0. Preparation

Downsample the ‘speech_sample’ from the original Fs of 48 kHz to 11.025 kHz using scipy.signal.resample() function.

Load the data 'speech_sample' from file lab2_data.mat. Make sure the sample is a 1-D time series by reshaping it.
Declare the sampling frequency of the original signal, and the new sampling frequency.
Resample the signal using scipy.signal.resample().
Visualize the resampled signal in the time domain. Use an appropriate time vector as the x-axis.

Task 1. Feature Extraction

Task 1.1 MFCC calculations using the provided sample speech signal.

Pre-emphasize the resampled signal by applying a high pass filter, using the scipy.signal.lfilter() function. Apply a pre-emphasis filter $ H(z) = 1- \alpha z^{-1} $ with $\alpha = 0.98$ to emphasize higher frequencies in your downsampled speech signal (Tip: use scipy.signal.lfilter). Hint: for defining the filter: you will provide two vectors b and a to define the filter, a for the denominator and b for the numerator. So finally your filter will be defined as $$H(z) = \frac{b[0] z^0 + b[1] z^{-1} + ... + b[i] z^{-i}+...}{a[0] z^0 + a[1] z^{-1} + ... + a[i] z^{-i}+...}$$
Extract the 12 mfcc coefficients by using the python_speech_features.mfcc() function.
1. The python_speech_features.mfcc() function has an internal pre-emphasis functionality. However, we calculate the pre-emphasis by hand in order to have a better understanding of it, and thus it should be set to 0
2. Visualize the 12 mfcc coefficient contours.
3. Calculate the mean of each contour using numpy.mean(axis=axis).

Task 1.2 Extract the Intensity/Energy parameter

Firstly, calculate the short time energy (STE) of the downsampled ‘speech_sample’ using the squared signal $x(t)^2$ and a 0.01s hamming window frames (Note! the extra length of the window. Clip half a window length from the beginning and at the end). Then calculate the 5 distribution parameter features specified below from the utterance (the signal).

Define a hamming window using the scipy.signal.hamming() function. The window length is the number of frames in 0.01s.
Apply the hamming window to convolve the squared signal, using the scipy.signal.convolve() function. The convolution result is the short time energy (STE) controu.
Clip half window of frames from the begining and ending of STE contour.
Visualize the resulted STE controur.
Calculating the following 5 distribution parameter feature from the STE contour:
1. Mean, using the numpy.mean(https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.mean.html) function.
2. Standard Deviation (SD), using the numpy.std() function.
3. 10% percentile, using the numpy.percentile() function.
4. 90% percentile, using the numpy.percentile() function.
5. Kurtosis, using the scipy.stats.kurtosis() function.

Task 1.3. Extract the Pitch/F0 feature

Extract the Pitch/F0 contour of the resampled speech signal using the get_f0() function in 0.01s frames. The function is provided in the f0_lib.py file.
Visualize the F0 contour.
Extract the 5 distribution parameter features of the extracted F0 countour.

Task 1.4. Extract the Rhythm/Durations parameter

Perform a Voiced/Unvoiced speech segmentation of speech signal. Tip: Unvoiced frames are marked with 0 F0 values, you can find the voiced frames (i.e. F0 > 0) using numpy.where().
From the segmentation, calculate the means and SDs of both Voiced and Unvoiced segment lengths (i.e. voiced segment mean length, SD of voiced segment lengths, unvoiced segment mean length, SD of unvoiced segment lengths).
Calculate also the voicing ratio, i.e. the ratio of voiced segments versus total segments (Tip: You can do this simply by using the frames).

Task 2. Speech Emotion Classification

In this part, the sklearn.svm library is used to perform the speech signal classification. The ‘training_data_proso’ and ‘training_data_mfcc’ matrices contain the calculated prosodic features for the training set (9 features in each row representing a speech sample) and MFCC derived features (12 features) respectively. The ‘training_class’ group vector contains the class of samples: 1 = happy, 2 = sad; corresponding to the rows of the training data matrices.

Task 2.1. Train the SVM classifiers

Load the training data.
Train a SVM with the prosody data using the ‘training_data_proso’ features and a 3rd order polynomial kernel.
Train a SVM with the MFCC data using the ‘training_data_mfcc’ features and a 3rd order polynomial kernel.

Task 2.2. Test the classifiers

Classify the ‘training_data_*’ and the ‘testing_data_*’ data matrices. Then, calculate average classification performances for both training and testing data. The correct class labels corresponding with the rows of the training and testing data matrices are in the variables ‘training_class’ and ‘testing_class’, respectively.

Load the testing data.
Calculate the average classification accuracy for the training data (‘training_data_proso’ and ‘training_data_mfcc’) using the corresponding prosody and MFCC trained SVMs.
Calculate the average classification accuracy for the testing data (‘testing_data_proso’ and ‘testing_data_mfcc’) using the corresponding prosody and MFCC trained SVMs.
Print the four accuracies you have calculated.

Task 2.3. Plot confusion matrices for the training and testing data for both classifiers.

Print following confusion matrix(Tip, use sklearn.metrics.confusion_matrix function):

The confusion matrix of the prosody trained SVM using the ‘training_data_proso’.
The confusion matrix of the prosody trained SVM using the ‘testing_data_proso’.
The confusion matrix of the MFCC trained SVM using the ‘training_data_mfcc’.
The confusion matrix of the MFCC trained SVM using the ‘testing_data_mfcc’.

moinul7002 / Speech-Recognition-for-Emoition-Detection