Create live audience reactions animated from an audio speech that behind the scenes is predicting emotions every half a second.
It classifies speech audio into 8 classes of emotions - neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
$ git clone https://github.com/HimanshuMittal01/speech_to_emotion_classifier.git
$ cd speech_to_emotion_classifier/
$ python3 -m venv venv
$ source vevn/bin/activate
$ pip install -r requirements.txt
In order to train the models, you need to download RAVDESS dataset - speech audio only.
Trained models will be available for download soon.
Go through the following notebooks in order.
- split_process_data.ipynb
- train.ipynb
- inference.ipynb
- Set paths and configuration in
config.json
. sh run.sh
Evaluated 3 models trained on RAVDESS dataset (excluding song audio).
It has 8 classes - ["neutral","calm","happy","sad","angry","fearful","disgust","surprised"]
Following accuracy is observed on 16 classes (8 classes bifuracted further by gender) in train/validation/test datasets.
- XGBoost: ~95% / ~41% / ~25%
- LSTM (3 layers - 64-64-64): ~76% / ~46% / ~26%
- CNN (3 layers - 32-64-128): ~93% / ~50% / ~30%
Predicting gender through voice is decent given the size of the dataset. Accuracy observed is as follows:
- XGBoost: ~85%
- LSTM (3 layers - 64-64-64): ~90%
- CNN (3 layers - 32-64-128): ~87%
- Improving accuracy of the speech models
- By using extra training datasey, maybe including song set and other datasets
- Cross validation
- Real time smooth animation of different faces
- Asynchronuos / Trigger calls to blender engine
- Optimized predictions
I made this project to improve public speaking (Some people will frown or make faces even when you are right, you just have to be confident ;)
More use cases, issues or pull requests are most welcome.