yangdongchao / Text-to-sound-Synthesis

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"

Home Page:http://dongchaoyang.top/text-to-sound-synthesis-demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use the codebook with the size of 512?

jojonki opened this issue · comments

Hi, thank you for sharing your great project!

I have a question about your released models.

At pan.baidu.com, you shared your trained coodebook model with the size of 512.
But diffsound_audiocapas.pth (6.32GB) also contains a coodebook with the size of 256, not 512.

I confirmed this with the following code.

import torch

cb_model_path = '../download/baidu/2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt'
cb_model = torch.load(cb_model_path)
print(cb_model['state_dict']['quantize.embedding.weight'].shape) # (512, 256), ok

ds_model_path = "../download/baidu/diffsound_audiocaps.pth"
ds_model = torch.load(ds_model_path, map_location="cpu")
print(ds_model['model']['content_codec.quantize.embedding.weight'].shape) # (256, 256), should be (512, 256)

This cases a dimension mismatch.
In your generate_samples_batch.py, it firstly loads a coodebook with model.params.content_codec_config.params.ckpt_path in yaml, then loads diffsound_audiocaps.pth. But the latter checkpoint contains a codebook with the size of 256 as I mentioned.

Do I need to drop the codebook weights from diffsound_audiocaps.pth? Or do you have an appropriate diffsound model?

Thank you,

diffsound_audiocaps.pth exactly contains 2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt.

cb256_model_path= '../download/gdrive/2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt'
cb256_model = torch.load(cb256_model_path)
print(cb256_model_path)
print(cb256_model['state_dict']['quantize.embedding.weight'].shape) # (256, 256)
print(cb256_model['state_dict']['quantize.embedding.weight'][0][:10])

cb512_model_path = '../download/baidu/2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt'
cb512_model = torch.load(cb512_model_path)
print(cb512_model_path)
print(cb512_model['state_dict']['quantize.embedding.weight'].shape) # (512, 256)
print(cb512_model['state_dict']['quantize.embedding.weight'][0][:10])

ds_model_path = "../download/baidu/diffsound_audiocaps.pth"
print(ds_model_path)
ds_model = torch.load(ds_model_path, map_location="cpu")
print(ds_model['model']['content_codec.quantize.embedding.weight'].shape) # (256, 256)
print(ds_model['model']['content_codec.quantize.embedding.weight'][0][:10])

# diffsound_audiocaps.pth contains a coodebook of 2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt.
# True
torch.all(torch.eq(cb256_model['state_dict']['quantize.embedding.weight'].cpu(), ds_model['model']['content_codec.quantize.embedding.weight'].cpu()))

hi, good question.
Actaully, if you want to use codebook size 512, and you should use diffsound model trained with 512. I offer the 512 codebook aims to give people to train the diffsound model with 512 codebook size (please check the yaml config file to see the details). I will upload the diffsound model trained with 512 codebook on the github, so you can use the 512 codebook size to inference.

@yangdongchao

Thank you for your answer!
The model trained with 256 already worked in my environment, so I would like to also try 512 when you upload the model.

Anyway, I would like to close this issue. Thank you!