Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

This is a demo implementation of Masked Spectrogram Modeling using Masked Autoencoders (MSM-MAE), a self-supervised learning method for general-purpose audio representation, includes:

Training code that can pre-train models with arbitrary audio files.
Evaluation code to test models under two benchmarks, HEAR 2021 and EVAR.
Visualization examples and a notebook.
Pre-trained weights.

If you find MSM-MAE useful in your research, please use the following BibTeX entry for citation.

@article{niizumi2022masked,
    title   = {Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation}, 
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    journal = {arXiv:2204.12260},
    year    = {2022},
    eprint  = {2204.12260},
    url     = {https://arxiv.org/abs/2204.12260},
    archivePrefix = {arXiv},
    primaryClass = {eess.AS}
}

1. Getting Started

The repository relies on the codes from facebookresearch/mae, and we patch our changes on these files.

Download external source files from facebookresearch/mae, and apply a patches.

curl -o util/lars.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lars.py
curl -o util/lr_decay.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lr_decay.py
curl -o util/lr_sched.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/lr_sched.py
curl -o util/misc.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/misc.py
curl -o util/pos_embed.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/util/pos_embed.py
curl -o main_pretrain.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/main_pretrain.py
curl -o msm_mae/engine_pretrain.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/engine_pretrain.py
curl -o msm_mae/models_mae.py https://raw.githubusercontent.com/facebookresearch/mae/6a2ba402291005b003a70e99f7c87d1a2c376b0d/models_mae.py
cd util
patch -p1 < patch_util.diff
cd ../msm_mae
patch -p1 < patch_msm_mae.diff
cd ..
patch -p1 < patch_main.diff

If you need a clean environment, the following anaconda example creates a new environment named ar:

conda create -n ar python==3.8
conda activate ar

Install external modules listed on requirements.txt.

2. Evaluating MSM-MAE

2-1. Evaluating on HEAR 2021 NeurIPS Challenge Tasks

We evaluate our models on our paper using hear-eval-kit from on HEAR 2021 NeurIPS Challenge as follows.

NOTE: The folder hear has all the files we need to evaluate models on hear-eval-kit.

Install hear-eval-kit as follows:
```
pip install heareval
```
Download your copy of downstream dataset for 16 kHz. See HEAR NeurIPS 2021 Datasets@zenodo for the detail.

To evaluate our models, we need a local package which loads and runs our models. The followings shows an example for a model named 80x208p16x16_mymodel.

cd hear/hear_msm
cp sample.py 80x208p16x16_mymodel.py
** Edit the 80x208p16x16_mymodel.py here, so that the value of `model_path` points to your model with an absolute path. **
cd ..
pip install -e .

We are on the folder hear. Run the hear-eval-kit with your pre-trained model.

python -m heareval.embeddings.runner hear_msm.80x208p16x16_mymodel --tasks-dir /your_copy/hear/tasks
CUBLAS_WORKSPACE_CONFIG=:4096:8 python -m heareval.predictions.runner embeddings/hear_msm.80x208p16x16_mymodel/*

2-2. Evaluating on EVAR

EVAR is an evaluation package for audio representations used by our research papers such as BYOL-A and Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model. It supports the following downstream tasks: ESC-50, US8K, FSD50K, SPCV1/V2, VoxForge, VoxCeleb1, CREMA-D, GTZAN, NSynth instrument family, Pitch Audio Dataset (Surge synthesizer).

The following steps setup MSM-MAE on the EVAR.

Clone EVAR repository and prepare basic items.

git clone https://github.com/nttcslab/eval-audio-repr.git evar
cd evar
curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py
cd ..

Install MSM-MAE files on the cloned evar folder.

ln -s ../../to_evar/ar_msm_mae.py evar/evar
ln -s ../../to_evar/msm_mae.yaml evar/config
cd evar
sed 's/import evar\.ar_byola/import evar\.ar_byola, evar\.ar_msm_mae/' -i lineareval.py
cd ..

Setup downstream task datasets according to Preparing-datasets.md. The following is an example for setting up CREMA-D dataset.

cd evar
python evar/utils/download_cremad.py downloads/cremad
python prepare_wav.py downloads/cremad work/16k/cremad 16000
cd ..

Once setup is complete, you can evaluate your models as follows.

For evaluating a model with an absolute path /your/path/to/model.pth.

cd evar
python lineareval.py config/msm_mae.yaml cremad weight_file=/your/path/to/model.pth

If you want to save GPU memory, set a fewer batch size as follows. This example sets it as 16.

cd evar
python lineareval.py config/msm_mae.yaml cremad batch_size=16,weight_file=/your/path/to/model.pth

3. Training From Scratch

First, you need to prepare pre-training data samples; then, you can pre-train your model. The following is an example to pre-train a model on FSD50K dataset.

Preprocess .wav files into log-mel spectrogram .npy files. The following converts from a source folder /your/local/fsd50k/FSD50K.dev_audio to a new folder lms_fsd50kdev.
```
python wav_to_lms.py /your/local/fsd50k/FSD50K.dev_audio lms_fsd50kdev
```
Create a CSV file which is used as a list of pre-training samples containing a single column file_name. The following example creates trainingfiles.csv.
```
echo file_name > trainingfiles.csv
(cd lms_fsd50kdev && find . -name "*.npy") >> trainingfiles.csv
```
Make some copies of samples for visualization. The following example makes 3 symbolic links. All the samples under the folder vis_samples will be used to visualize reconstruction results of a checkpoint during training.
```
mkdir -p lms_fsd50kdev/vis_samples
ln -s ../1240.npy lms_fsd50kdev/vis_samples/1240.npy
ln -s ../106237.npy lms_fsd50kdev/vis_samples/106237.npy
ln -s ../119949.npy lms_fsd50kdev/vis_samples/119949.npy
```

Once your data is ready, start pre-training as follows.

python main_pretrain.py 80x208p16x16_your_test --input_size 80x208 --data_path lms_fsd50kdev --dataset trainingfiles --model=mae_vit_base_patch16x16

3-1. About run folders

The training loop creates a folder called the run folder to store artifacts of your training run, such as a log, visualizations, and checkpoints. In the example above, the 80x208p16x16_your_test is the run folder name which has a format below:

<input_size>p<patch_size>_<your_free_string>
 - input_size: Two integers concatenated with `x`.
 - patch_size: Two integers concatenated with `x`.
 - your_free_string: Optional.

NOTE: When a model is evaluated by EVAR or hear-eval-kit, the input and patch size parameters written in the run name are used.

3-2. Evalidation during training

The training loop takes two actions for evaluating checkpoints during training: visualization and calling an external command quick_eval.sh (EVAR by default).

While training, main_pretrain.py will call a script named quick_eval.sh for every 20 epochs by default. You can edit quick_eval.sh for your purpose.
The training loop also visualizes reconstruction results using checkpoints. You can find them under the sub-folder of the run folder, such as 80x208p16x16_tttt/19. The folder contains raw results in .npy and image files in .png as the following examples.

4. Visualization Examples

👉 An example notebook is available. 👈

You can try visualizations of reconstruction results as well as attention maps.

Download AudioSetWav16k_examples.zip that contains example wave samples from the releases and unzip the zip file in the misc folder beforehand.

A reconstruction examples:

An attention map examples:

5. Pre-trainede Weights and Network Structure Details

Three pre-trained weights are published on the releases, and the followings are their EVAR task results:

weight	esc50	us8k	spcv2	vc1	voxforge	cremad	gtzan	nsynth	surge
80x512p16x16_paper	89.1%	84.3%	93.1%	56.1%	97.5%	70.1%	79.3%	76.9%	37.5%
80x512p16x16_0425	88.9%	84.0%	92.4%	56.5%	96.8%	67.6%	81.4%	76.1%	37.4%
80x208p16x16_0422	87.1%	82.2%	91.1%	49.7%	95.6%	66.9%	75.9%	76.5%	37.8%

80x512p16x16_paper: This weight is the pre-trained weight used to calculate the numbers in the paper, trained on our old unpolished code. Cannot be used for visualizations due to a minor difference in the decoder structure.
80x512p16x16_0425: This weight is a pre-trained weight that is trained with the current code to check reproducibility.
80x208p16x16_0422: This weight is a pre-trained weight that is trained with the current code to check reproducibility.

FYI: You can check a notebook that summarizes EVAR task results.

5-1. Network Structure Details

Our ViT network implementation has three minor differences from the original MAE implementation, due to historical reason. (We have already been testing based on pengzhiliang/MAE-pytorch in late 2021, before the MAE implementation became available.)

Our implementation follows BEiT, where the k bias is always zero in the Attention calculation. Though it is explained to be equivalent in terms of calculation results, we keep this implementation in our network for better reproducibility.
[CLS] token is removable by the option --no_cls_token.
The decoder uses 1d positional embedding. You can switch to the 2d version by the option --dec_pos_2d.

6. License

See LICENSE for details.

Acknowledgements

Our code is based on the MAE PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners.
We also refer to another implementation of MAE, pengzhiliang/MAE-pytorch.
We use nnAudio (KinWaiCheuk/nnAudio) for converting raw audio into log-mel spectrogram.

We appreciate these publicly available implementations and all the modules our experiments heavily depend on!

References

About

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representations

Other

Languages

Language:Jupyter Notebook 99.2%Language:Python 0.8%Language:Shell 0.0%