yangdongchao / Text-to-sound-Synthesis

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"

Home Page:http://dongchaoyang.top/text-to-sound-synthesis-demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues from sampling by newest pretrained model

yizhidamiaomiao opened this issue · comments

Dear authors,

I try to use your pretrained model listed in readme.md
"2022/08/09 We upload trained diffsound model on audiocaps dataset, and the baseline AR model, and the codebook trained on audioset with the size of 512. (https://disk.pku.edu.cn/link/DA2EAC5BBBF43C9CAB37E0872E50A0E4)"

When I try to run the command "python evaluation/generate_samples_batch.py" to sampling some audio, the codes raise an Error:
"RuntimeError: Error(s) in loading state_dict for VQModel:
size mismatch for quantize.embedding.weight: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([256, 256])
"

I have already tried many revised versions of your 'caps_text.yaml' (change several 256 to 512), but none of them works. Could you please share any ways for me to do the sampling from your newest trained model? Thanks a lot.

hi, if you use the codebook trained with 512, you also should use the diffsound model trained with 512. But now, the 512 trained diffsound model is not released, so you can only use codebook with 256. I will upload diffsound trained with 512 as soon as.

hi, if you use the codebook trained with 512, you also should use the diffsound model trained with 512. But now, the 512 trained diffsound model is not released, so you can only use codebook with 256. I will upload diffsound trained with 512 as soon as.

hi, if you use the codebook trained with 512, you also should use the diffsound model trained with 512. But now, the 512 trained diffsound model is not released, so you can only use codebook with 256. I will upload diffsound trained with 512 as soon as.