JonathanFly / generative-disco

🎼 text-to-video system for music visualization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🪩 generative disco

June'23 Update: Hugging Face Spaces demo available here: vivlavida/generative-disco 🌷

GenDisco

Visuals are a core part of our experience of music. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-image models. Users select intervals of music to visualize and then parameterize that visualization by defining start and end prompts. These prompts are warped between and generated according to the beat of the music for audioreactive video. We introduce design patterns for improving generated videos: "transitions", which express shifts in color, time, subject, or style, and "holds", which encourage visual emphasis and consistency.

How to Use

Demo of System

ezgif com-video-to-gif (21)

The only way the demo differs from the video is that it includes fields for OpenAI Key and Soundcloud URL to Music. These functions are intended for if you would like to use Disco in a more dedicated way. HuggingFace Spaces allows many people to edit the same space at once, so duplicate / clone the space if you would like to persist your work.

The OpenAI Key field is only necessary if you want to use the Brainstorming Describe Interval function, which retrieves inspiration for different subjects of prompts using GPT-3. We pass the OpenAI Key as an argument for an API call.

The Soundcloud URL to Music allows you to change the music file.

Notes

  • All fields have to be filled out in the form (Interval #) for the generation to begin.

While video intervals are generating you can check the progress by looking at the Logs tab. Generating even a few seconds of content on the community GPU may take awhile, so we recommend generating in very short intervals (like 0.5 seconds at a time). Screenshot 2023-06-25 at 10 16 19 AM

Links

Full Youtube video

arXiv Preprint

Examples

DiscoEx_1.mp4
Disco_Ex2.mov

System Design

fig_sysdesign_larger_fin

Generative Disco's system design. Users begin by interacting with the waveform to create intervals within the music (#1). To find prompts that will define the start and end of intervals, users can brainstorm prompts using prompt suggestions from GPT-4 or videography domain knowledge (#4-6) and explore text-to-image generations (#7, #8). Results users like can be dragged and dropped into the start and end areas (#10,#11), after which an interval can be generated. Generated intervals populate in the tracks area (#15) and can be stitched into a video that renders in the Video Area (#9).

Setup Option 1

Docker hub docker pull hellovivian/art-ai:disco_local

Setup Option 2

*Requires a GPU with a decent amount of VRAM.

Install the package

pip install -U stable_diffusion_videos
conda env create -f disco_environment.yml

Authenticate with Hugging Face

huggingface-cli login
conda activate video
python flask_app.py

Repo Structure

The Stable Diffusion checkpoint used was V1-4. The web application was written in Python, Javascript, and Flask. Images were generated with 50 iterations on an NVIDIA V100. The music, assets (images), stylesheets, and Javascript are collected in the static folder. Logic and routing is controlled by flask_app.py

Credits

The system was built on top of two open-source repositories: a) stable-diffusion-videos from Hugging Face, written by Nate Raw and b) wavesurfer.js. stable-diffusion-videos also builds on a script shared by @karpathy. The script was modified to this gist, which was then updated/modified to this repo.

Citation

BibTex

@misc{liu2023generative,
      title={Generative Disco: Text-to-Video Generation for Music Visualization}, 
      author={Vivian Liu and Tao Long and Nathan Raw and Lydia Chilton},
      year={2023},
      eprint={2304.08551},
      archivePrefix={arXiv},
      primaryClass={cs.HC}
}

Contributing

You can file any issues/feature requests :)

About

🎼 text-to-video system for music visualization

License:Apache License 2.0


Languages

Language:Python 58.8%Language:JavaScript 22.1%Language:HTML 14.3%Language:CSS 4.8%