Anything To Image

Generate image from anything with ImageBind's unified latent space and stable-diffusion-2-1-unclip.

No training is need.
Integration with 🤗 Diffusers.
Online demo with Huggingface Gradio and Google Colab.

Note that gradio and colab online demo need pro account to obtain more GPU/memory to run them.

Support Tasks

Audio to Image
Audio+Text to Image
Audio+Image to Image
Image to Image
Text to Image
Thermal to Image
Depth to Image: Coming soon.

Update

[2023/5/19]:

Anything2Image has been integrated into InternGPT.
[v1.1.4]: Support fusing audio and text in ImageBind latent space and UI improvements.

[2023/5/18]

[v1.1.3]: Support thermal to image.
[v1.1.0]: Gradio GUI - add options for controling image size, and noise scheduler.
[v1.0.8]: Gradio GUI - add options for controling noise level, audio-image embedding arithmetic strength, and number of inference steps.

anything2image.mp4

Getting Started

Requirements

Ensure you have PyTorch installed.

Python >= 3.8
PyTorch >= 1.13

Then install the anything2image.

# from pypi
pip install anything2image
# or locally install via git clone
git clone git@github.com:Zeqiang-Lai/Anything2Image.git
cd Anything2Image
pip install .

Usage

# lanuch gradio demo
python -m anything2image.app
# command line demo, see also the tasks examples below.
python -m anything2image.cli --audio assets/wav/cat.wav

Audio to Image

bird_audio.wav	dog_audio.wav	cattle.wav	cat.wav

fire_engine.wav	train.wav	motorcycle.wav	plane.wav

python -m anything2image.cli --audio assets/wav/cat.wav

Audio+Text to Image

cat.wav	cat.wav	bird_audio.wav	bird_audio.wav
A painting	A photo	A painting	A photo

python -m anything2image.cli --audio assets/wav/cat.wav --prompt "a painting"

Audio+Image to Image

Audio & Image	Output	Audio & Image	Output

wave.wav		wave.wav

python -m anything2image.cli --audio assets/wav/wave.wav --image "assets/image/bird.png"

with torch.no_grad():
    embeddings = model.forward({
        ib.ModalityType.VISION: ib.load_and_transform_vision_data(["assets/image/bird.png"], device),
    })
    img_embeddings = embeddings[ib.ModalityType.VISION]
    embeddings = model.forward({
        ib.ModalityType.AUDIO: ib.load_and_transform_audio_data(["assets/wav/wave.wav"], device),
    }, normalize=False)
    audio_embeddings = embeddings[ib.ModalityType.AUDIO]
    embeddings = (img_embeddings + audio_embeddings)/2
    images = pipe(image_embeds=embeddings.half()).images
    images[0].save("audioimg2img.png")

Image to Image

Top: Input Images. Bottom: Generated Images.

python -m anything2image.cli --image "assets/image/bird.png"

Text to Image

A photo of a car.	A sunset over the ocean.	A bird's-eye view of a cityscape.	A close-up of a flower.

It is not necessary to use ImageBind for text to image. Nervertheless, we show the alignment of ImageBind's text latent space and its image spaces.

python -m anything2image.cli --text "A sunset over the ocean."

Thermal to Image

Input	Output	Input	Output

Top: Input Images. Bottom: Generated Images.

python -m anything2image.cli --thermal "assets/thermal/030419.jpg"

Citation

Latent Diffusion

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

ImageBind

@inproceedings{girdhar2023imagebind,
  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={CVPR},
  year={2023}
}

wdshin / Anything2Image