Liujingxiu23 / Make-An-Audio-3

Make-An-Audio-3: Transforming Text/Video into Audio via Flow-based Large Diffusion Transformers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers

PyTorch Implementation of Lumina-t2x, Lumina-Next

We will provide our implementation and pre-trained models as open-source in this repository recently.

arXiv Hugging Face GitHub Stars

News

Install dependencies

Note: You may want to adjust the CUDA version according to your driver version.

conda create -n Make_An_Audio_3 -y
conda activate Make_An_Audio_3
conda install python=3.11 pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Install [nvidia apex](https://github.com/nvidia/apex) (optional)

Quick Started

Pretrained Models

Simply download the 500M weights from Hugging Face

Model Task Pretraining Data Path
M (160M) Text-to-Audio AudioCaption Here
L (520M) Text-to-Audio AudioCaption [TBD]
XL (750M) Text-to-Audio AudioCaption [TBD]
3B Text-to-Audio AudioCaption [TBD]
M (160M) Text-to-Music Music Here
L (520M) Text-to-Music Music [TBD]
XL (750M) Text-to-Music Music [TBD]
3B Text-to-Music Music [TBD]
M (160M) Video-to-Audio VGGSound Here

Generate audio/music from text

python3 scripts/txt2audio_for_2cap_flow.py  --prompt {TEXT}
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat 

Add --test-dataset structure for text-to-audio generation

Generate audio/music from audiocaps or musiccaps test dataset

  • remember to alter config["test_dataset"]
python3 scripts/txt2audio_for_2cap_flow.py
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset

Generate audio from video

python3 scripts/video2audio_flow.py 
--outdir output_dir -r  checkpoints_last.ckpt  -b configs/video2audio-cfm1-cfg-LargeDiT1-moe.yaml --scale 3.0 
--vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound 

Train flow-matching DiT

After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file. Run the following command to train Diffusion model

python main.py --base configs/txt2audio-cfm1-cfg-LargeDiT3.yaml -t  --gpus 0,1,2,3,4,5,6,7

Others

For Data preparation, Training variational autoencoder, Evaluation, Please refer to Make-An-Audio.

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio, AudioLCM, CLAP, as described in our code.

Citations

If you find this code useful in your research, please consider citing:

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

Make-An-Audio-3: Transforming Text/Video into Audio via Flow-based Large Diffusion Transformers


Languages

Language:Python 98.8%Language:Jupyter Notebook 1.2%