yangdongchao / Text-to-sound-Synthesis

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"

Home Page:http://dongchaoyang.top/text-to-sound-synthesis-demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How do we use BERT or CLIP features?

yizhidamiaomiao opened this issue · comments

It is shown in Codebook readme.md that
"For the text features, we provide two types of features, (1) use BERT (2) use CLIP
For BERT features, please run
python generete_text_fea/predict_one.py
For CLIP features, please run
python generete_text_fea/generate_fea_clip.py "

However, when I tried run the 'python3 ./Diffsound/train_spec.py --name caps_train --config_file ./Diffsound/configs/caps_512.yaml --tensorboard --load_path None', I found that it only call the "./Diffsound/sound_synthesis/modeling/embeddings/clip_text_embedding.py" file and load the Vit-B-32 model. There is no parts read the files generate by BERT or CLIP.

Shall we do any pre-pocessing on the text and save them as files in order to avoid running the Vit-B-32 model each time? Can we change the model to other models like BERT or CLIP?

Thanks a lot.

It is shown in Codebook readme.md that "For the text features, we provide two types of features, (1) use BERT (2) use CLIP For BERT features, please run python generete_text_fea/predict_one.py For CLIP features, please run python generete_text_fea/generate_fea_clip.py "

However, when I tried run the 'python3 ./Diffsound/train_spec.py --name caps_train --config_file ./Diffsound/configs/caps_512.yaml --tensorboard --load_path None', I found that it only call the "./Diffsound/sound_synthesis/modeling/embeddings/clip_text_embedding.py" file and load the Vit-B-32 model. There is no parts read the files generate by BERT or CLIP.

Shall we do any pre-pocessing on the text and save them as files in order to avoid running the Vit-B-32 model each time? Can we change the model to other models like BERT or CLIP?

Thanks a lot.

Hi, in our released Diffsound code, we have insert the clip model into it, so you just need to input the text, it can help you to extract features by pre-trained CLIP model. If you want to use BERT model to extract features, you can insert the BERT model into our Diffsound model. Our previous experiments have proved that using CLIP model is better than BERT.
As you mention, in the Codebook readme.md, we point out we can generate two types of features, because we do not insert the BERT or CLIP model into the baseline AR methods, so you should do some process to extract features, but in the Diffsound models, we have insert the clip model into it, so you donot use any pre-process method to extract text features.