bakhtos / PANNs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition (forked by ML TUNI)

This repository is a fork of https://github.com/qiuqiangkong/audioset_tagging_cnn (which is the code release for paper PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Patterns Recognition [1])

Dataset and metadata

We do not provide here the scripts to download the dataset and assume that the files are already prepared by the user.

Preparing the dataset and metadata

The codes here assume that all files are already the audio segments which are to be studied and not full recordings from which segments are to be extracted, that the segments are 10 seconds long (although codes should support any length of audio common to all used clips) and are named according to the pattern Y*.wav, i.e. begin with 'Y', followed by arbitrary symbols (video's YouTubeID) and having the extension '.wav'. Data should have a sampling rate of 32000Hz, although again, any sample rate common to all files is supported.

The files should be stored in two separate directories, train/ and eval/, for the training and evaluation splits, respectively.

Metadata files storing class labels for audios should be provided as tab-separated files, as in Google's AudioSet: Reformatted dataset.

NOTE: filename fields should not contain the prefix "Y" or the extension ".wav", these are added by the scripts.

These metadata files should only mention audio files that were actually downloaded, and only the classes that were selected for the model.

You can check the dataset folder to verify the format of all files.

Create the following environment variables to keep track of all files and locations:

AUDIOS_DIR_TRAIN # Location of the training split of files
AUDIOS_DIR_EVAL # Location of the evaluation split of files
DATASET_PATH_TRAIN # Path to the tsv file containing strong labels for the train split
DATASET_PATH_EVAL # Path to the tsv file containing strong labels for the eval split
LOGS_DIR # Location to store logs

Create target arrays

Use the panns.data.target module from the command line to save the target array for train and eval splits. Weak target array will have the shape (files, classes), strong target array - (files, frames, classes). Weak target can be computed once for a dataset, Strong target depends on sample rate, hop length (in samples) and clip length (in ms).

Use the following environment variables:

TARGET_WEAK_PATH_TRAIN # Path to save the weak target array for train files
TARGET_WEAK_PATH_EVAL # Path to save the weak target array for eval files
TARGET_STRONG_PATH_TRAIN # Path to save the strong target array for train files
TARGET_STRONG_PATH_EVAL # Path to save the strong target array for eval files

And then call the module as follows:

# Weak target
# Train split
python -m panns.data.target --dataset_path=$DATASET_PATH_TRAIN\
                            --target_type=weak\
                            --target_path=$TARGET_WEAK_PATH_TRAIN
# Eval split
python -m panns.data.target --dataset_path=$DATASET_PATH_EVAL\
                            --target_type=weak\
                            --target_path=$TARGET_WEAK_PATH_EVAL
                              
# Strong target
# Train split
python -m panns.data.target --dataset_path=$DATASET_PATH_TRAIN\
                            --target_type=strong\
                            --target_path=$TARGET_STRONG_PATH_TRAIN\
                            --sample_rate=32000\
                            --hop_length=320\
                            --clip_length=10000
# Eval split
python -m panns.data.target --dataset_path=$DATASET_PATH_EVAL\
                            --target_type=strong\
                            --target_path=$TARGET_STRONG_PATH_EVAL\
                            --sample_rate=32000\
                            --hop_length=320\
                            --clip_length=10000

Pack waveforms into hdf5 files

For training and evaluation, the actual audio files need to be packed into a hdf5 object using the panns.data.hdf5 module, which relies on the h5py package.

The audio arrays will be made to match the length clip length*sample rate/1000 either by truncating or zero-padding.

Create the following environment variables:

HDF5_FILES_PATH_TRAIN # Location for the hdf5 compression of train split
HDF5_FILES_PATH_EVAL # Location for the hdf5 compression of eval split

And make the calls:

# Train split
python -m panns.data.hdf5  --audios_dir=$AUDIOS_DIR_TRAIN\
                           --dataset_path=$DATASET_PATH_TRAIN\
                           --hdf5_path=$HDF5_FILES_PATH_TRAIN\
                           --logs_dir=$LOGS_DIR
                           --sample_rate=32000\
                           --clip_length=10000
                           
# Eval split
python -m panns.data.hdf5  --audios_dir=$AUDIOS_DIR_EVAL\
                           --dataset_path=$DATASET_PATH_EVAL\
                           --hdf5_path=$HDF5_FILES_PATH_EVAL\
                           --logs_dir=$LOGS_DIR
                           --sample_rate=32000\
                           --clip_length=10000

NOTE Optionally --mini_data parameter can be specified, which only packs the given amount of files.

Models

The models are defined in panns/models/models.py, some auxiliary classes are defined in panns/models/blocks.py.

Models have been significantly reworked compared to the original implementation.

In particular, custom-written torchlibrosa has been replaced with native torchaudio. This applies to Spectrogram extraction for models that require it as well as Spectrogram Augmentation.

Furthermore, many versions of the CNN14 model in the original implementation differed only by a handful of parameters that were hardcoded. They are now refactored into the main CNN14 model with the possibility to customize these parameters to resemble the original models. CNN6 and CNN10 also had these features inserted into them.

In general, these parameters are used to customize the models (some models only support some parameters, check the source):

  • classes_num: Amount of classes used
  • wavegram: Whether to use the Wavegram features (see [1])
  • spectrogram: Whether to use the log-mel-Spectrogram features
    • sample_rate: Sample rate of the original audio
    • win_length: Window length to use for MelSpectrogram extraction
    • hop_length: Hop length for the window of MelSpectrogram extraction
    • n_mels: Amount of mel filterbanks to use for MelSpectrogram
    • f_min: Minimum frequency
    • f_max: Maximum frequency
    • spec_aug: Whether to use SpectrogramAugmentation during training
  • mixup_time: Whether to perform mixup in time-domain (before feature extraction)
  • mixup_freq: Whether to perform mixup in frequency domain (after feature extraction)
  • dropout: Whether to perform dropout during training
  • decision_level: Whether to output strong labels (framewise_output) and which function to use to generate them
  • Additional:
    • window_fn, center, pad_mode: Passed to MelSpectrogram
    • top_db: Passed to AmplitudeToDB
    • num_features: Passed to BatchNorm2D (must be correct with respect to the input)
    • embedding_size: Amount of nodes connecting the last two layers of the model

Below is the 'conversion table' between models in the original and current implementations (note that in all cases parameters for Spectrogram are renamed):

Original Current
Cnn6 Cnn6(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None)
Cnn10 Cnn10(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None)
Cnn14 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None)
Cnn14_8k Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, sample_rate=8000, win_length=256, hop_length=80, n_mels=64, f_mix=50, f_max=4000)
Cnn14_16k Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, sample_rate=16000, win_length=512, hop_length=160, n_mels=64, f_mix=50, f_max=8000)
Cnn14_no_specaug Cnn14(spec_aug=False, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None)
Cnn14_no_dropout Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=False, wavegram=False, spectrogram=True, decision_level=None)
Cnn14_mixup_time_domain Cnn14(spec_aug=True, mixup_time=True, mixup_freq=False, dropout=True, wavegram=False, spectrogram=True, decision_level=None)
Cnn14_emb32 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, embedding_size=32)
Cnn14_emb128 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, embedding_size=128)
Cnn14_emb512 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, embedding_size=512)
Cnn14_mel32 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, num_features=32)
Cnn14_mel128 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level=None, num_features=128)
Cnn14_DecisionLevelMax Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level='max')
Cnn14_DecisionLevelAvg Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level='avg')
Cnn14_DecisionLevelAtt Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=False, spectrogram=True, decision_level='att')
Wavegram_Cnn14 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=True, spectrogram=False, decision_level=None)
Wavegram_Logmel_Cnn14 Cnn14(spec_aug=True, mixup_time=False, mixup_freq=True, dropout=True, wavegram=True, spectrogram=True, decision_level=None)
Wavegram_Logmel128_Cnn14 Not implemented
ResNet22 ResNet22 (same)
ResNet38 ResNet38 (same)
ResNet54 ResNet54 (same)
Res1dNet31 Res1dNet31(classes_num) (other parameters are not used)
Res1dNet51 Res1dNet51(classes_num) (other parameters are not used)
MobileNetV1 MobileNetV1 (same)
MobileNetV2 MobileNetV2 (same)
LeeNet11 LeeNet11(classes_num) (other parameters are not used)
LeeNet24 LeeNet24(classes_num, dropout=True) (other parameters are not used, dropout can be set to False)
DaiNet19 DaiNet19(classes_num) (other parameters are not used)

Training

Training is performed using panns/train.py. Training is controlled by the following parameters:

  • Model configuration:
    • model_type: One of the classes in panns/models/models.py (the model used)
    • Parameters for the model (see Models):
      • classes_num, sample_rate, win_length, hop_length, f_min, f_max, n_mels, decision_level, pad_mode, top_db, num_features, embedding_size: Passed directly to model constructor, see Models
      • spec_aug/no_spec_aug, mixup_time/no_mixup_time, mixup_freq/no_mixup_freq, dropout/no_dropout, wavegram/no_wavegram, spectrogram/no_spectrogram, center/no_center: Set the corresponding model parameter to True/False respectively
  • Files locations:
    • hdf5_files_path_train, hdf5_file_path_eval: Location of the HDF5 files, see [HDF5](#Pack waveforms into hdf5 files)
    • target_path_train, target_path_eval: Location of the target arrays for train/eval split, either weak or strong
    • logs_dir: Folder to store logs (default logs in CWD)
    • checkpoints_dir: Folder to store checkpoints every 100000 iterations (default checkpoints in CWD)
    • statistics_dir: Folder to save evaluation results every 2000 iterations (default statistics in CWD)
    • resume_checkpoint_path: Location to load a trained model checkpoint from
  • Control training loop:
    • label_type: Whether to use weak or strong label output from the model to calculate BCE loss, must be the same as dataset target given and strong can only be used with compatible models (ones that have decision_level parameter)
    • batch_size: amount of files used in one training iteration
    • learning_rate: learning rate for the optimizer
    • iter_max: Amount of training iterations to perform (an 'iteration' is processing of one batch, we do not use epochs in this pipeline)
    • num_workers: Amount of workers to pass to the DataLoader
    • cuda: Whether to use GPU (flag)

Example of initiating training:

python -m panns.train --hdf5_files_path_train=$HDF5_FILES_PATH_TRAIN\
                      --hdf5_files_path_eval=$HDF5_FILES_PATH_EVAL\
                      --target_path_train=$TARGET_WEAK_PATH_TRAIN\
                      --target_path_eval=$TARGET_WEAK_PATH_EVAL\
                      --label_type='weak'\
                      --model_type='Cnn14'\
                      --classes_num=110\
                      --decision_level='max'\
                      --spectrogram\
                      --win_length=1024\
                      --hop_length=320\
                      --sample_rate=32000\
                      --f_min=50\
                      --f_max=14000\
                      --n_mels=64\
                      --spec_aug\
                      --no_mixup_time\
                      --no_mixup_freq\
                      --dropout\
                      --no_wavegram\
                      --batch_size=32\
                      --learning_rate=1e-3\
                      --iter_max=600000\
                      --num_workers=8
                      --cuda

Inference

It is possible to produce a file with strong labels for a given dataset inferred from the trained model in the same format as files in dataset using panns.inference. For that a checkpoint of the trained model is needed as well as an hdf5 compression of the evaluation set.

The script accepts following parameters:

  • File parameters
    • hdf5_files_path: location of the hdf5 compression of the dataset
    • dataset_path: location of the dataset tsv file
    • output_path: filename to save the detected events
    • checkpoint_path: location of the checkpoint of the model to use
    • logs_dir: directory to write logs into (optional)
  • Model parameters: same as during Training phase
  • batch_size, cuda, num_workers: Control passing data to the model similarly to training phase
  • Inference parameters:
    • threshold: This threshold is applied to the output of the model only values greater are considered as 'event detected'
    • minimum_event_gap: In seconds, minimum gap between two consecutive events so that they are considered separate events; events closer than this are merged together by filling the small gap
    • minimum_event_length: In seconds, events shorter than this are ignored (first gaps are closed, than short events removed)

Example of inference:

python -m panns.inference --hdf5_files_path=$HDF5_FILES_PATH_EVAL\
                          --dataset_path=$DATASET_PATH_EVAL\
                          --checkpoint_path=\  # Path to checkpoint in $CHECKPOINTS_DIR
                          --output_path='inference.tsv'\
                          --logs_dir=$LOGS_DIR\
                          --batch_size=32\
                          --num_workers=8\
                          --cuda\
                          --threshold=0.5\
                          --minimum_event_gap=0.1\
                          --minimum_event_length=0.1\
                          --model_type='Cnn14'\
                          --classes_num=110\
                          --decision_level='max'\
                          --spectrogram\
                          --win_length=1024\
                          --hop_length=320\
                          --sample_rate=32000\
                          --f_min=50\
                          --f_max=14000\
                          --n_mels=64\
                          --spec_aug\
                          --no_mixup_time\
                          --no_mixup_freq\
                          --dropout\
                          --no_wavegram

Cite

[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. "Panns: Large-scale pretrained audio neural networks for audio pattern recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894.

Reference

[2] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M. and Ritter, M., 2017, March. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776-780, 2017

[3] Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B. and Slaney, M., 2017, March. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131-135, 2017

About

License:MIT License


Languages

Language:Python 100.0%