FoleyCrafter

Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.

Your star is our fuel! We're revving up the engines with it!

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†

(†Corresponding Author)

What's New

A more powerful one 😝 .
Release training code.
2024/07/01 Release the model and code of FoleyCrafter.

Setup

Prepare Environment

Use the following command to install dependencies:

# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter

# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install

Download Checkpoints

The checkpoints will be downloaded automatically by running inference.py.

You can also download manually using following commands.

Download the text-to-audio base model. We use Auffusion

git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion

Download FoleyCrafter

git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/

Put checkpoints as follows:

└── checkpoints
    ├── semantic
    │   ├── semantic_adapter.bin
    ├── vocoder
    │   ├── vocoder.pt
    │   ├── config.json
    ├── temporal_adapter.ckpt
    │   │
    └── timestamp_detector.pth.tar

Gradio demo

You can launch the Gradio interface for FoleyCrafter by running the following command:

python app.py --share

Inference

Video To Audio Generation

python inference.py --save_dir=output/sora/

Results:

Input Video	Generated Audio
0.mp4	0.mp4
1.mp4	1.mp4
2.mp4	2.mp4
3.mp4	3.mp4

Temporal Alignment with Visual Cues

python inference.py \
--temporal_align \
--input=input/avsync \
--save_dir=output/avsync/

Results:

Ground Truth	Generated Audio
0.mp4	0.mp4
1.mp4	1.mp4
2.mp4	2.mp4

Text-based Video to Audio Generation

Using Prompt

# case1
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--save_dir=output/PromptControl/case1/

python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--prompt='noisy, people talking' \
--save_dir=output/PromptControl/case1_prompt/

# case2
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--save_dir=output/PromptControl/case2/

python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--prompt='seagulls' \
--save_dir=output/PromptControl/case2_prompt/

Results:

Generated Audio	Generated Audio
Without Prompt	Prompt: noisy, people talking
0.mp4	0.mp4
Without Prompt	Prompt: seagulls
0.mp4	0.mp4

Using Negative Prompt

# case 3
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--save_dir=output/PromptControl/case3/

python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--nprompt='river flows' \
--save_dir=output/PromptControl/case3_nprompt/

# case4
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--save_dir=output/PromptControl/case4/

python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--nprompt='noisy, wind noise' \
--save_dir=output/PromptControl/case4_nprompt/

Results:

Generated Audio	Generated Audio
Without Prompt	Negative Prompt: river flows
0.mp4	0.mp4
Without Prompt	Negative Prompt: noisy, wind noise
0.mp4	0.mp4

Commandline Usage Parameters

options:
  -h, --help            show this help message and exit
  --prompt PROMPT       prompt for audio generation
  --nprompt NPROMPT     negative prompt for audio generation
  --seed SEED           ramdom seed
  --temporal_align TEMPORAL_ALIGN
                        use temporal adapter or not
  --temporal_scale TEMPORAL_SCALE
                        temporal align scale
  --semantic_scale SEMANTIC_SCALE
                        visual content scale
  --input INPUT         input video folder path
  --ckpt CKPT           checkpoints folder path
  --save_dir SAVE_DIR   generation result save path
  --pretrain PRETRAIN   generator checkpoint path
  --device DEVICE

BibTex

@misc{zhang2024pia,
  title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
  author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
  year={2024},
  eprint={2407.01494},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Contact Us

Yiming Zhang: zhangyiming@pjlab.org.cn

YiCheng Gu: yichenggu@link.cuhk.edu.cn

Yanhong Zeng: zengyanhong@pjlab.org.cn

LICENSE

Please check Apache-2.0 license for details.

Acknowledgements

The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.

We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.

open-mmlab / FoleyCrafter

FoleyCrafter

What's New

Setup

Prepare Environment

Download Checkpoints

Gradio demo

Inference

Video To Audio Generation

Text-based Video to Audio Generation

Commandline Usage Parameters

BibTex

Contact Us

LICENSE

Acknowledgements

About

Languages