yangdongchao / Text-to-sound-Synthesis

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"

Home Page:http://dongchaoyang.top/text-to-sound-synthesis-demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use the codebook with the size of 512?

jojonki opened this issue · comments

Hi, thank you for sharing your great project!

I have a question about your released models.

At pan.baidu.com, you shared your trained coodebook model with the size of 512.
But diffsound_audiocapas.pth (6.32GB) also contains a coodebook with the size of 256, not 512.

I confirmed this with the following code.

import torch

cb_model_path = '../download/baidu/2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt'
cb_model = torch.load(cb_model_path)
print(cb_model['state_dict']['quantize.embedding.weight'].shape) # (512, 256), ok

ds_model_path = "../download/baidu/diffsound_audiocaps.pth"
ds_model = torch.load(ds_model_path, map_location="cpu")
print(ds_model['model']['content_codec.quantize.embedding.weight'].shape) # (256, 256), should be (512, 256)

This cases a dimension mismatch.
In your generate_samples_batch.py, it firstly loads a coodebook with model.params.content_codec_config.params.ckpt_path in yaml, then loads diffsound_audiocaps.pth. But the latter checkpoint contains a codebook with the size of 256 as I mentioned.

Do I need to drop the codebook weights from diffsound_audiocaps.pth? Or do you have an appropriate diffsound model?

Thank you,

diffsound_audiocaps.pth exactly contains 2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt.

cb256_model_path= '../download/gdrive/2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt'
cb256_model = torch.load(cb256_model_path)
print(cb256_model['state_dict']['quantize.embedding.weight'].shape) # (256, 256)

cb512_model_path = '../download/baidu/2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt'
cb512_model = torch.load(cb512_model_path)
print(cb512_model['state_dict']['quantize.embedding.weight'].shape) # (512, 256)

ds_model_path = "../download/baidu/diffsound_audiocaps.pth"
ds_model = torch.load(ds_model_path, map_location="cpu")
print(ds_model['model']['content_codec.quantize.embedding.weight'].shape) # (256, 256)

# diffsound_audiocaps.pth contains a coodebook of 2022-04-24T23-17-27_audioset_codebook256/checkpoints/last.ckpt.
# True
torch.all(torch.eq(cb256_model['state_dict']['quantize.embedding.weight'].cpu(), ds_model['model']['content_codec.quantize.embedding.weight'].cpu()))

hi, good question.
Actaully, if you want to use codebook size 512, and you should use diffsound model trained with 512. I offer the 512 codebook aims to give people to train the diffsound model with 512 codebook size (please check the yaml config file to see the details). I will upload the diffsound model trained with 512 codebook on the github, so you can use the 512 codebook size to inference.


Thank you for your answer!
The model trained with 256 already worked in my environment, so I would like to also try 512 when you upload the model.

Anyway, I would like to close this issue. Thank you!