karolpiczak / paper-2017-DCASE

The details that matter: Frequency resolution of spectrograms in acoustic scene classification - paper replication data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Download paper in PDF format MIT licensed

Paper replication data for:

The details that matter: Frequency resolution of spectrograms in acoustic scene classification

Table of Contents

Overview

This work describes a submission to the acoustic scene classification task of the DCASE 2017 challenge. The study is based on the following premises:

  • The prevailing tendency of convolutional neural network models employed in audio classification tasks is to utilize spectrogram representations limited to 40-60 mel frequency bands. Higher values are rarely chosen despite the additional granularity they introduce:

    Comparison of spectrograms with different frequency resolutions

    Therefore, the main goal of this paper is to check how using spectrograms with different frequency resolution could impact the accuracy in this task.

  • Most acoustic scenes can be described as a combination of a recurring background (ambient noise) mixed with distinct foreground elements (sound events). Still, Mafra et al. (2016) have shown that even a single averaged frame can have a good predictive capacity in acoustic scene classification tasks, it is thus likely that a good model should not be overly complicated. Could both these assumptions be introduced a priori into the architecture of the employed network?

Model structure

To this end, the proposed system has a simple design coming in two flavors - ambient only or extended with sound event detectors that signal if a template match has occurred anywhere in the whole recording.

Model structure

The ambience processing model is evaluated with different frequency resolutions (amb* and STFT) and compared with models extended with 15 detector modules trained from scratch (detectors) or with a single detector pre-trained on hand-annotated fragments of cafe/restaurant recordings with sounds invovling cups, plates, etc. (dishes).

Results
System Fold 1 Fold 2 Fold 3 Fold 4 1—4 Final (test)
amb40 79.4 (0.5) 77.7 (0.8) 76.7 (1.0) 81.4 (1.0) 78.8
amb60 81.3 (0.6) 76.3 (1.0) 75.8 (0.9) 81.5 (1.0) 78.7 62.0
amb100 81.1 (0.6) 77.5 (0.9) 80.6 (0.7) 83.4 (1.3) 80.7 67.7
amb200 80.9 (0.8) 80.2 (0.8) 83.0 (0.9) 85.6 (1.3) 82.4 70.6
STFT 81.1 (0.9) 83.6 (0.8) 81.4 (0.9) 83.4 (1.3) 82.4
detectors 78.7 (0.9) 78.1 (1.1) 78.6 (1.3) 80.8 (1.4) 79.1
dishes 80.3 (0.9) 81.4 (0.7) 82.6 (0.6) 86.6 (1.0) 82.7 69.6
Mean (standard deviation) of validation accuracies across 50 final epochs of training on the development set and official evaluation results for submitted models. Values in percentages.

Results of the proposed systems

Results obtained in the experiments indicate that increasing the number of mel frequency bands improves accuracy of the ambience model. The detectors variant unfortunately shows signs of significant overfitting combined with higher training times. This effect is constrained in the dishes model when the detector array is limited to fine-tuning on one pre-trained module.

Confusion matrix of the thresholded amb200 model

Confusion matrix of the thresholded amb200 model.

Visualization of filters in the first layer shows that the ambience network essentially learns to discriminate frequency patterns. It is therefore a plausible explanation why higher frequency resolution of input data could be beneficial for classification.

Filters learned by the first layer

Further examination would be needed to extrapolate this claim to other datasets and architectures, but preliminary results show that spectrogram resolution could be an important factor influencing the behavior of models in audio classification tasks. Another interesting extension would be to validate the concept of individual detector pre-training with more abundant annotation data and see if incorporating a priori domain knowledge about the problem could enhance the capabilities of the model.

Repository content

  • LaTeX source code for the paper.

  • Actual submission package as delivered for DCASE 2017.

    • /Code/*.py

      Source code for experiments.

      Run run_amb.py to train the ambience model:

      $ python run_amb.py -h
      
      usage: run_amb.py [-h] [-D DEVICE] [-b BANDS] [--all]
      
      optional arguments:
        -h, --help            show this help message and exit
        -D DEVICE, --device DEVICE
                              Theano device used for computations
        -b BANDS, --bands BANDS
                              Number of mel bands or `stft`
        --all                 Train on all folds (for final model testing)
      

      This will generate run_{BANDS}_{FOLD}.* files in the results/ directory:

      • run_{BANDS}_{FOLD}.h5 - saved weights (highest holdout score)
      • run_{BANDS}_{FOLD}.last.h5 - saved weights (last epoch)
      • run_{BANDS}_{FOLD}.npz - training history
      • run_{BANDS}_{FOLD}.txt - generated predictions

      Settings used for training models in the paper:

      BANDS in [40, 60, 100, 200, 'stft']

      The --all switch is used for training the final model (all training folds included).

      Issuing:

      $ python run_detectors.py

      will train a hybrid model (ambience module + 15 binary event detectors). run_dishes.py is a streamlined version with only 1 binary detector pre-trained on typical cafe/restaurant sounds (kitchenware, cutlery, crockery etc.). Detector pre-training is done with train_clues.py based on clues.txt annotations.

    • Helper/backend code for generating submissions.

      Adjust the /Code/arf/core.py file to reflect proper paths to the datasets:

      DATA_PATH = '/volatile/dcase17_1/'
      TEST_PATH = '/volatile/dcase17_1_eval/'
    • Raw figures with code used for visualization available as a Jupyter notebook (Figures.ipynb).

    • Training outputs. The eval.py script creates cross-validation accuracy metrics, list of misclassifications and a confusion matrix for a given system:

      $ python eval.py run_200
      

      The _th_0.5 suffixes denote models with prediction thresholding, so:

      $ python eval.py run_200_th_0.5
      

      will generate a corresponding confusion matrix Code/results/run_200_th_0.5.pdf.

    • A very unpolished modification of CrowdCurio's audio-annotator JavaScript interface used for creating the dishes (clues.txt) annotation list.

Citing

If you find this paper useful in some way, you can cite it with the following BibTeX entry:

@inproceedings{piczak2017dcase,
    title={The details that matter: Frequency resolution of spectrograms in acoustic scene classification},
    author={Piczak, Karol J.},
    booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop},
    year={2017},
    location={Munich, Germany}
}

K. J. Piczak. The details that matter: Frequency resolution of spectrograms in acoustic scene classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop, Munich, Germany, 2017.

Author version of this paper: The details that matter: Frequency resolution of spectrograms in acoustic scene classification.

Poster

DCASE 2017 poster

About

The details that matter: Frequency resolution of spectrograms in acoustic scene classification - paper replication data

License:MIT License


Languages

Language:Jupyter Notebook 81.7%Language:JavaScript 7.8%Language:TeX 6.5%Language:Python 3.1%Language:CSS 0.5%Language:HTML 0.3%Language:Shell 0.1%