speech2image

Create realistic AI generated images from human voice

Leveraging open ai whisper and StableDiffusion in a cloud native application powered by Jina

Under the hood the whisper and stable diffusion models are wrapped into Executors that will make them self-contained microservices. Both of the microservices will be chained into a Flow. The Flow expose a gRPC endpoint which accept DocumentArray as input.

This is an example of a multi-modal application that can be built with jina

How to use it ?

Install requirements:

pip install -r requirements.txt
pip install -r executors/stablediffusion/requirements.txt
pip install -r executors/whisper/requirements.txt

Start the jina Flow ( you need to get a HF token and accept the StableDiffusion terms to get the model weight. Otherwise you should provide it yourself to the Executor )

JINA_MP_START_METHOD=spawn HF_TOKEN=YOUR_HF_TOKEN python flow.py

Alternatively you can deploy the Flow on Jcloud. To do so you should edit the flow.yml and put your HF token in it.

pip install jcloud
jc login
jc deploy flow.yml

Start the gradio UI

python ui.py

or if you started the flow in Jcloud you can do

python ui.py --host grpcs://FLOW_ID.wolf.jina.ai

Or just talk directly to the backend with the jina Client

from jina import Client
from docarray import Document
client = Client(host='localhost:54322') 
docs = client.post('/', inputs=[Document(uri='audio.wav') for _ in range(1)])
for img in docs[0].matches:
    img.load_uri_to_image_tensor()

docs[0].matches.plot_image_sprites()

About

An example of building a speech to image generation pipeline with Jina, Whisper and StableDiffusion

Languages

Language:Python 85.0%Language:Dockerfile 15.0%