Audio Diffusion Research

Introduction

This is the final project for NYCU_DLP course.
After reading the paper"Palette:ASimple,General Framework for Image-to-Image Translation," we found it interesting to investigate whether the ablation results are consistent with those in the audio domain.
We will compare the L1 and L2 loss and also evaluatethe significance of self-attention and normalization in the audio diffusion architecture.

Using this github to implement our project：https://github.com/teticio/audio-diffusion

The audio diffusion backbone utilizes a U-Net architecture.
The class_embedding of the UNet2DModel is utilized to incorporate both the time embedding and class_embedding, treating them as part of the conditional input of the model.

ESC-50 consists of 5-second-long recordings organized into 50 semantical classes, with 40 examples per class.
This dataset consists of the following five main categories: Animals, Natural, Human non-speech sounds, Interior sounds, and Exterior noises.

The .wav data is preprocessed into Mel Spectrograms. This can be done by audio_diffusion/scripts/audio_to_images.py.
The Mel spectrograms will be normalized according to the experiment setting.

Use audio_diffusion/scripts/test_cond_model.py to generate sample. This program generate 40 .wav files for 50 classes in ESC-50.
There are several things need to modified before you run this code：
1. Modify the parameter of parser
2. Replace the path in line 171 by your pretrained unet weight.(ex. /unet/diffusion_pytorch_model.bin)
3. Modify the model_index.json file in your saved model path：

    "mel": [
        "audio_diffusion", # change null to "audio_diffusion"
        "Mel"
    ],

We utilize our model to generate 50 classes of audio, producing 40 audio samples for each class as evaluation data.
The FAD score is a metric employed to measure the similarity between evaluation data and original data. A lower FAD score indicates a closer match between the distributions of the generated and real audio.
The CA score, utilizing pretrained Contrastive Language-Audio Pretraining (CLAP), is used to assess whether our model can successfully generate the correct voice.

To compute FAD and CA, the path should contain 50 folder, which is named from 0 to 49 by its label. Each folder should contain 40 .wav generate from the same class.
Take a look at audio_evaluate/Predict/L2 as an example.

dir_1 = path of ground truth .wav.
dir_2 = path of generate .wav.

Download this github：https://github.com/LAION-AI/CLAP
Use audio_evaluate/result/CA/zero-shot-classification/CLASS.PY to compute CA.
Go to https://huggingface.co/lukewys/laion_clap/blob/main/630k-audioset-best.pt download the pretrained weight.
Put your generate file path to esc50_test_dir

cat.mp4

crickets.mp4

frog.mp4

Audio Diffusion

Language:Python 99.7%Language:Rich Text Format 0.3%