yangdongchao / Text-to-sound-Synthesis

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"

Home Page:http://dongchaoyang.top/text-to-sound-synthesis-demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing required files for audiocaption evaluation

jzq2000 opened this issue · comments

The required files in AudiocaptionLoss config are missing.

path:
  vocabulary: 'data/pickles/words_list.p'
  encoder: 'pretrained_models/audioset_deit.pth'  # 'pretrained_models/deit.pth'
  word2vec: 'pretrained_models/word2vec/w2v_512.model'
  eval_model: 'pretrained_models/ACTm.pth'

The required files in AudiocaptionLoss config are missing.

path:
  vocabulary: 'data/pickles/words_list.p'
  encoder: 'pretrained_models/audioset_deit.pth'  # 'pretrained_models/deit.pth'
  word2vec: 'pretrained_models/word2vec/w2v_512.model'
  eval_model: 'pretrained_models/ACTm.pth'

Hi, please refer to https://disk.pku.edu.cn/link/4908743A441B02235C8652742FE44949 . Also, you can refer to https://github.com/XinhaoMei/ACT

Thanks a lot~ BTW, could your please further provide '/apdcephfs/share_1316500/donchaoyang/code3/ACT/outputs/exp_4/model/best_model.pth' in settings2.yaml.

Otherwise, some errors occurs in loading ACTm.pth.

RuntimeError: Error(s) in loading state_dict for AudioTransformer_80:
        size mismatch for pos_embedding: copying a param with shape torch.Size([1, 126, 768]) from checkpoint, the shape in current model is torch.Size([1, 216, 768]).
        size mismatch for bn0.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for bn0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for bn0.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for bn0.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([768, 256]) from checkpoint, the shape in current model is torch.Size([768, 320]).

Thanks a lot~ BTW, could your please further provide '/apdcephfs/share_1316500/donchaoyang/code3/ACT/outputs/exp_4/model/best_model.pth' in settings2.yaml.

Otherwise, some errors occurs in loading ACTm.pth.

RuntimeError: Error(s) in loading state_dict for AudioTransformer_80:
        size mismatch for pos_embedding: copying a param with shape torch.Size([1, 126, 768]) from checkpoint, the shape in current model is torch.Size([1, 216, 768]).
        size mismatch for bn0.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for bn0.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for bn0.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for bn0.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([80]).
        size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([768, 256]) from checkpoint, the shape in current model is torch.Size([768, 320]).

Have you solved this issue? I got the same problem that missing the best_model.pth when getting the ACT loss metrics.