Given that:
- Visual art is a foundational form of human self-expression
- Speech is the foundational form of human communication
- Not everyone is literate
- Not everyone is sufficiently skilled or confident to generate visual art through traditional or digital media
And in particular:
- Not everyone is a native English speaker
- The most powerful AI text-to-image generation models are based on exclusively English-language prompts
Therefore:
- This project intends to provide a means for anyone to generate visual art directly through their speech, without presumption or prejudice with regard to their native language or level of literacy.
- Speech input to notebook:
IPyWebRTC
- Spoken language detection: OpenAI's
whisper
- Speech-to-text (speech-to-English): OpenAI's
whisper
- (English) text-to-image: Stability AI's
stable-diffusion
- Locally via 🤗 Diffusers, or through DeepAI's API
- Python v3.10.6
- An account on 🤗 (Hugging Face)
- Must accept T&C before downloading the
stable-diffusion
weights
- Must accept T&C before downloading the
ffmpeg
- Can install via
brew
,apt
,conda
or other package manager
- Can install via
- Create and activate a fresh python v3.10.6
venv
git clone
this repository- Install the dependencies with
pip install -r requirements.txt
- Download the
stable-diffusion
weightsgit lfs install
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5