intoxication-detection convolutional-neural-networks filter-banks mfcc-features audio-classification

AlcoAudio Research

Detection of Alcohol induced intoxication through voice using Neural Networks

Dataset

Alcohol Language Corpus is a curation of audio samples from 162 speakers. Audio samples are first recorded when speaker is sober. Then the speakers are given a chosen amount of alcohol to reach a particular intoxication state, and audio samples are recorded again.

Audio samples are split into 8 seconds each. Below is the plot of a raw signal

These raw audio signals are converted into Mel filters using librosa. Below is how it looks:

Architectures

Below are the architectures tried. All the files are under networks folder.

Networks	Log Loss	UAR(Unweighted Average Recall)
Convolutional Neural Networks(convnet)	0.89	66.28
LSTM(lstm)	1.59	58.12
Conv LSTMs(crnn)	1.17	62.27
One class Neural Networks(ocnn)	1.81	55
Conv Auto Encoders(cae)	0.92	65.53

Setup

Download and run the requirements.txt to install all the dependencies.
```
pip install -r requirements.txt
```
Create a config file of your own
Install OpenSmile and set environment variable OPENSMILE_CONFIG_DIR to point to the config directory of OpenSmile installation.

Usage

Data generation

Run data_processor.py to generate data required for training the model. It reads the raw audio samples, splits into n seconds and generates Mel filters, also called as Filter Banks (fbank paramater in config file. Other available audio features are mfcc & gaf)

python3 data_processor.py --config_file <config_filepath>

Training the network

Using main.py one can train all the architectures mentioned in the above section.

python3 main.py --config_file <config_filepath> --network convnet

Inference

One can use our model for inference. The best model is being saved under best_model folder

python3 main.py --config_file --test_net True <config_filepath> --network convnet --datapath <data filepath>

Remember to generate mel filters from raw audio data and use the generated .npy file for datapath parameter

Future work: TODO

Improve on Data Representations

Work on frequency variance in voice
Recurrence plots
Extract features using Praat and Opensmile
Normalise audio sample based on average amplitude

Try new architectures

Conditional Variational AutoEncoder
Convolutional One class Neural Network

Known Issues

As training progresses, test and valid log losses increase. The confidence with which the network miss predicts increase. The below graph depicts this behaviour
Mel filters or MFCC are not the best representation for this use case as these representations fail to capture variance in the amplitudes rather just try to mimic human voice.

Acknowledgements

Our team would like to thank Professor Emmanuel O. Agu for guiding the team throughout. I would like to thank team members Pratik, Arjun and Mitesh for all their contributions.

About

Detect alcohol induced intoxication level from a voice sample