Sound Classification for the deaf

Dataset

ESC-50 Dataset for Environmental Sound Classification is a tagged collection of 2000 recordings of environmental sound that is appropriate for benchmarking environmental sound categorization techniques.

It contains 50 semantic classes with 40 examples each and 5 major categories:

Animals
Natural soundscapes & water sounds
Human, non-speech sounds
Interior/domestic sounds
Exterior/urban noises

This dataset can be downloaded as a .zip file: ESC-50 dataset

Methodology

Feature Extraction - MFCC

To perform Audio classification, we first preprocess the data to extract the audio signal's relevant features using MFCC and then pass those important features through the deep neural network for the audio classification. The Mel Frequency Cepstral Coefficients (MFCCs) are short term spectral features of a signal which concisely describe the overall shape of a spectral envelope. Few MFCCs extracted from ESC-50 dataset:

Ariplane:

Dog:

Convolutional Neural Networks

CNNs or convolutional neural nets are a type of deep learning algorithm that does really well at learning images. To use them for Audio classification we extract features which look like images and shape them in a way in order to feed them into a CNN. We use the librosa package to do the same.

Output

Recurrent Neural Networks

Recurrent Neural nets are a type of deep learning algorithm that can remember sequences. Audio data tends to follow a pattern which can be exploited using RNNs to classify them. In contrast to the CNN model's results we decide to use a stateful LSTM thats allows us to simplify the overall network structure. All we need here is the LSTM layer followed by a Dense layer.