Stability-AI / stable-audio-demo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

arXiv: Stable Audio's paper

stable-audio-tools: code to reproduce Stable Audio

stable-audio-metrics: code to evaluate Stable Audio

Our model can generate variable-length and long-form stereo music at 44.1kHz:

Generated Stereo Music Prompt
Audio not supported by your browser. Berlin techno, rave, drum machine, kick, ARP synthesizer, dark, moody, hypnotic, evolving, 135 BPM. Loop.
Audio not supported by your browser. Uplifting acoustic loop. 120 BPM.
Audio not supported by your browser. Disco, Driving Drum Machine, Synthesizer, Bass, Piano, Guitars, Instrumental, Clubby, Euphoric, Chicago, New York, 115 BPM.
Audio not supported by your browser. Calm meditation music to play in a spa lobby.
Audio not supported by your browser. Drum solo.

Differently from pervious state-of-the-art models, ours can generate stereo sound effects at 44.1kHz:

Generated Stereo Sounds Prompt
Audio not supported by your browser. Door slam. High-quality, stereo.
Audio not supported by your browser. Sports car passing by. High-quality, stereo.
Audio not supported by your browser. Motorbike passing by. High-quality, stereo.
Audio not supported by your browser. Fireworks. High-quality, stereo.
Audio not supported by your browser. Reverberant footsteps inside a large rocky cave. High-quality, stereo.

Note that all the examples in this website are generated with the same model that can generate both variable-length music and sound effects at 44.1kHz stereo. We append "high-quality, stereo" to our sound effects prompts because it is generally helpful.

Long-form stereo music: comparison with state-of-the-art with MusicCaps prompts

Prompt: This song contains someone strumming a melody on a mandolin while more people are whistling along. Then a mandolin, an e-bass and an acoustic guitar are playing a short melody in a lower key before breaking into the next part along with flutes and percussions. This song may be played outside by musicians performing.

| Our Model | MusicGen-large | MusicGen-stereo | AudioLDM2 |

(stereo, 44.1kHz) (mono, 32kHz) (stereo, 32kHz) (mono, 48kHz)
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: The commercial music features a groovy piano melody played over snare rolls in the first half of the loop. Right after, there is a drop that consists of a punchy "4 on the floor" kick pattern, shimmering hi hats, claps, groovy piano and wide synth lead melody. It sounds happy, fun, euphoric and exciting.

| Our Model | MusicGen-large | MusicGen-stereo | AudioLDM2 |

(stereo, 44.1kHz) (mono, 32kHz) (stereo, 32kHz) (mono, 48kHz)
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

These prompts/audios were used for the qualitative study we report in our paper.

Sound effects: comparison with state-of-the-art with AudioCaps prompts

Prompt: Clicking and sputtering then eventual revving of an idling engine.

| Model | Audiogen-medium | AudioLDM2 |

(stereo, 44.1kHz) (mono, 32kHz) (mono, 48kHz)
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: Birds chirping loudly.

| Model | Audiogen-medium | AudioLDM2 |

(stereo, 44.1kHz) (mono, 32kHz) (mono, 48kHz)
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

These prompts/audios were used for the qualitative study we report in our paper. Note the (randomly) selected prompts from AudioCaps did not require substantial stereo movement, resulting in renders that are relatively non-spatial.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth  Autoencoder reconstruction
Your browser does not support the audio element. Your browser does not support the audio element.
Your browser does not support the audio element. Your browser does not support the audio element.
Your browser does not support the audio element. Your browser does not support the audio element.
Your browser does not support the audio element. Your browser does not support the audio element.
Your browser does not support the audio element. Your browser does not support the audio element.

About