Generate image from anything with ImageBind's unified latent space and stable-diffusion-2-1-unclip.
- No training is need.
- Integration with 🤗 Diffusers.
- Online demo with Huggingface Gradio and Google Colab.
Note that gradio and colab online demo need pro account to obtain more GPU/memory to run them.
Support Tasks
- Audio to Image
- Audio+Text to Image
- Audio+Image to Image
- Image to Image
- Text to Image
- Thermal to Image
- Depth to Image: Coming soon.
Update
[2023/5/19]:
- Anything2Image has been integrated into InternGPT.
- [v1.1.4]: Support fusing audio and text in ImageBind latent space and UI improvements.
[2023/5/18]
- [v1.1.3]: Support thermal to image.
- [v1.1.0]: Gradio GUI - add options for controling image size, and noise scheduler.
- [v1.0.8]: Gradio GUI - add options for controling noise level, audio-image embedding arithmetic strength, and number of inference steps.
anything2image.mp4
Requirements
Ensure you have PyTorch installed.
- Python >= 3.8
- PyTorch >= 1.13
Then install the anything2image
.
# from pypi
pip install anything2image
# or locally install via git clone
git clone git@github.com:Zeqiang-Lai/Anything2Image.git
cd Anything2Image
pip install .
Usage
# lanuch gradio demo
python -m anything2image.app
# command line demo, see also the tasks examples below.
python -m anything2image.cli --audio assets/wav/cat.wav
bird_audio.wav | dog_audio.wav | cattle.wav | cat.wav |
---|---|---|---|
fire_engine.wav | train.wav | motorcycle.wav | plane.wav |
---|---|---|---|
python -m anything2image.cli --audio assets/wav/cat.wav
See also audio2img.py.
import anything2image.imagebind as ib
import torch
from diffusers import StableUnCLIPImg2ImgPipeline
# construct models
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16
).to(device)
model = ib.imagebind_huge(pretrained=True).eval().to(device)
# generate image
with torch.no_grad():
audio_paths=["assets/wav/bird_audio.wav"]
embeddings = model.forward({
ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(audio_paths, device),
})
embeddings = embeddings[ib.ModalityType.AUDIO]
images = pipe(image_embeds=embeddings.half()).images
images[0].save("audio2img.png")
cat.wav | cat.wav | bird_audio.wav | bird_audio.wav |
---|---|---|---|
A painting | A photo | A painting | A photo |
python -m anything2image.cli --audio assets/wav/cat.wav --prompt "a painting"
See also audiotext2img.py.
with torch.no_grad():
audio_paths=["assets/wav/bird_audio.wav"]
embeddings = model.forward({
ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(audio_paths, device),
})
embeddings = embeddings[ib.ModalityType.AUDIO]
images = pipe(prompt='a painting', image_embeds=embeddings.half()).images
images[0].save("audiotext2img.png")
Audio & Image | Output | Audio & Image | Output |
---|---|---|---|
wave.wav | wave.wav |
python -m anything2image.cli --audio assets/wav/wave.wav --image "assets/image/bird.png"
with torch.no_grad():
embeddings = model.forward({
ib.ModalityType.VISION: ib.load_and_transform_vision_data(["assets/image/bird.png"], device),
})
img_embeddings = embeddings[ib.ModalityType.VISION]
embeddings = model.forward({
ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(["assets/wav/wave.wav"], device),
}, normalize=False)
audio_embeddings = embeddings[ib.ModalityType.AUDIO]
embeddings = (img_embeddings + audio_embeddings)/2
images = pipe(image_embeds=embeddings.half()).images
images[0].save("audioimg2img.png")
Top: Input Images. Bottom: Generated Images.
python -m anything2image.cli --image "assets/image/bird.png"
See also img2img.py.
with torch.no_grad():
paths=["assets/image/room.png"]
embeddings = model.forward({
ib.ModalityType.VISION: ib.load_and_transform_vision_data(paths, device),
}, normalize=False)
embeddings = embeddings[ib.ModalityType.VISION]
images = pipe(image_embeds=embeddings.half()).images
images[0].save("img2img.png")
A photo of a car. | A sunset over the ocean. | A bird's-eye view of a cityscape. | A close-up of a flower. |
---|---|---|---|
It is not necessary to use ImageBind for text to image. Nervertheless, we show the alignment of ImageBind's text latent space and its image spaces.
python -m anything2image.cli --text "A sunset over the ocean."
See also text2img.py.
with torch.no_grad():
embeddings = model.forward({
ib.ModalityType.TEXT: ib.load_and_transform_text(['A photo of a car.'], device),
}, normalize=False)
embeddings = embeddings[ib.ModalityType.TEXT]
images = pipe(image_embeds=embeddings.half()).images
images[0].save("text2img.png")
Input | Output | Input | Output |
---|---|---|---|
Top: Input Images. Bottom: Generated Images.
python -m anything2image.cli --thermal "assets/thermal/030419.jpg"
See also thermal2img.py.
with torch.no_grad():
thermal_paths =['assets/thermal/030419.jpg']
embeddings = model.forward({
ib.ModalityType.THERMAL: ib.load_and_transform_thermal_data(thermal_paths, device),
}, normalize=True)
embeddings = embeddings[ib.ModalityType.THERMAL]
images = pipe(image_embeds=embeddings.half()).images
images[0].save("thermal2img.png")
Latent Diffusion
@InProceedings{Rombach_2022_CVPR,
author = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
title = {High-Resolution Image Synthesis With Latent Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {10684-10695}
}
ImageBind
@inproceedings{girdhar2023imagebind,
title={ImageBind: One Embedding Space To Bind Them All},
author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
booktitle={CVPR},
year={2023}
}