wonyangcho / scl_icbhi2017

This is the official PyTorch implementation of our work Learning Audio Features with Metadata and Contrastive Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Learning Audio Features with Metadata and Contrastive Learning


This is the official pytorch implementation of our work Learning Audio Features with Metadata and Contrastive Learning

NB : this repository will be updated soon to add M-SCL

Dependencies:

Launch : pip install -r requirements.txt

Dataset:

Put the data files in the data folder

Pretrained models:

Put the pretrained pth files in the panns folder

Metadata:

Optional : metadata.py creates a metadata file in the data folder.
I got feedback of some people having trouble creating the metadata file, I have added it directly in the data folder.

Training:

main.py launches the training

Arguments:

--method METHOD: the training method METHOD, sl for cross entropy, scl for supervised contrastive, and hybrid for a combination of both.
--backbone BACKBONE: the backbone to be used, cnn6, cnn10 or cnn14.
--scratch: to train from scratch (when this argument is not encountered, the models are intiailized using AudioSet weights).
--lr LR: the learning rate for training.
--bs BS: the batch size.
Check args.py for more arguments.

Reproducibility:

To reproduce our results, keep all arguments by default value except the learning rate (and the generic arguments such as the number of workers to be used and the desired device to train on).
To train from scratch, launch the following script : python main.py --scratch --lr 1e-3 --backbone BACKBONE --method METHOD using the desired training method and backbone.
To train using AudioSet intialization, launch the following script : python main.py --lr 1e-4 --backbone BACKBONE --method METHOD using the desired training method and backbone.

Quantitative Results

We optimized hyperparameters for CNN6, and we simply report CNN10 from scratch and pretrained CNN14 scores on ICBHI without any hyperparameter tuning.
We report results over 10 identical runs:

Backbone Method Sp Se Sc # of Params Ext. Dataset
Cnn6 CE 76.72(3.97) 31.12(3.72) 53.92(0.71) 4.3 -
SCL 76.17(3.84) 27.97(3.92) 52.08(1.06)
Hybrid 75.35(5.47) 33.84(5.67) 54.74(0.5)
Cnn10 CE 73.45(6.7) 36.8(6.61) 55.13(1.56) 4.8 -
SCL 74.78(5.89) 30.38(5.5) 52.59(1.35)
Hybrid 78.12(6.14) 33.07(5.43) 55.6(1.13)
Cnn6 CE 70.09(3.08) 40.39(2.97) 55.24(0.43) 4.3 AudioSet
SCL 75.95(2.31) 39.15(1.89) 57.55(0.81)
Hybrid 70.47(2.07) 43.29(1.83) 56.89(0.55)
Cnn14 CE 75.63(4.13) 38.13(5.07) 57.32(0.54) 75.4 AudioSet
SCL 80.67(4.2) 32.93(4.37) 56.92(0.9)
Hybrid 80.73(3.86) 34.96(3.59) 57.85(0.48)

To cite this work:

@Article{scl_icbhi2017,
      title={Learning Audio Features with Metadata and Contrastive Learning}, 
      author={Ilyass Moummad and Nicolas Farrugia},
      year={2022},
      eprint={2210.16192},
      archivePrefix={arXiv},
      primaryClass={cs.SD}}

About

This is the official PyTorch implementation of our work Learning Audio Features with Metadata and Contrastive Learning


Languages

Language:Python 100.0%