AI beats

I have written a blog post describing this project in more detail, make sure to check "How to generate music clips with AI" to learn more!

With this project, you can use AI to generate music tracks and video clips. Provide some information on how you would like the music and videos, the code will do the rest.

Music generation workflow

First, we use a generative model to create music samples, the default model used here is only able to generate a max of 30 seconds of music, for this reason, we take another step to extend the music. After finishing with the audio part we can generate the video, first, we start with a Stable Diffusion model to generate images and then we use another generative model to give it a bit of motion and animation. To compose the final video clip, we take each generated music and join together with as many animated images as necessary to match the length of the music.

All of those steps will generate intermediate files that you can inspect and manually remove what you don't like to improve the results.

Examples

AI Beats Vol. 1

AI Beats Vol. 2

Usage

The recommended approach to use this repository is with Docker, but you can also use a custom venv, just make sure to install all dependencies.

Note: make sure to update the device param to maximize performance, but notice that some models might not work for all device options (cpu, cuda, mps).

Application workflow

Music generation: Generate the initial music tracks
Music continuation: Extend the initial music tracks to a longer duration
Image generation: Create the images that will be used to fill the video clip
Video generation: Generate animations from the images to compose video clips
Video clip creation: Join multiple video clips together to accompany the music tracks

Configs

project_dir: beats
project_name: lofi
seed: 42
music:
  prompt: "lo-fi music with a relaxing slow melody"
  model_id: facebook/musicgen-small
  device: cpu
  n_music: 5
  music_duration: 60
  initial_music_tokens: 1050
  max_continuation_duration: 20
  prompt_music_duration: 10
image:
  prompt: "Mystical Landscape"
  prompt_modifiers: 
    - "concept art, HQ, 4k"
    - "epic scene, cinematic, sci fi cinematic look, intense dramatic scene"
    - "digital art, hyperrealistic, fantasy, dark art"
    - "digital art, hyperrealistic, sense of comsmic wonder"
    - "mystical and ethereal atmosphere, photo taken with a wide-angle lens"
  model_id: stabilityai/sdxl-turbo
  device: mps
  n_images: 5
  inference_steps: 3
  height: 576
  width: 1024
video:
  model_id: stabilityai/stable-video-diffusion-img2vid
  device: cpu
  n_continuations: 2
  loop_video: true
  video_fps: 6
  decode_chunk_size: 8
  motion_bucket_id: 127
  noise_aug_strength: 0.1
audio_clip:
  n_music_loops: 1

project_dir: Folder that will host all your projects
project_name: Project name and main folder
seed: Seed used to control the randomness of the models
music
- prompt: Text prompt used to generate the music
- model_id: Model used to generate and extend the music tracks
- device: Device used by the model, usually one of (cpu, cuda, mps)
- n_music: Number of music tracks that will be created
- music_duration: Duration length of the final music
- initial_music_tokens: Duration length of the initial music (in tokens)
- max_continuation_duration: Maximum length of each extended music segment
- prompt_music_duration: Length of base music used to create the extension
image
- prompt: Text prompt used to generate the images
- prompt_modifiers: Prompt modifiers used to change the image style
- model_id: Model used to create the images
- device: Device used by the model, usually one of (cpu, cuda, mps)
- n_images: Number of images that will be created
- inference_steps: Number of inference steps for the diffusion model
- height: Height of the generated image
- width: Width of the generated image
video
- model_id: Model used to animate the images
- device: Device used by the model, usually one of (cpu, cuda, mps)
- n_continuations: Number of animation segments that will be created
- loop_video: If the each music video will be looped
- video_fps: Frames per second of each video clip
- decode_chunk_size: Video diffusion's decode chunk size parameter
- motion_bucket_id: Video diffusion's motion bucket id parameter
- noise_aug_strength: Video diffusion's noise aug strength parameter
audio_clip
- n_music_loops: Number of times to loop each music track

Commands

Build the Docker image

make build

Apply lint and formatting to the code (only needed for development)

make lint

Run the whole pipeline to create the music video

make ai_beats

Run the music generation step

make music

Run the music continuation step

make music_continuation

Run the image generation step

make image

Run the video generation step

make video

Run the audio clip creation step

make audio_clip

Development

For development make sure to install requirements-dev.txt and run make lint to maintain the coding style.

Requirements

I developed and tested most of this project on my MacBook Pro M2, the only step that I was not able to run was the video creation step, for that I used Google Colab (with V100 or A100 GPU). Some of the models were not runnable on MPS but they run on a reasonable time anyway.

Disclaimers

The models used by default here have specific licenses that might not be suited for all use cases, if you want to use the same models make sure to check their licenses. For music generation MusicGen and its CC-BY-NC 4.0 license, for image generation SDXL-Turbo and its LICENSE-SDXL1.0 license, and stable video diffusion and its STABLE VIDEO DIFFUSION NC COMMUNITY LICENSE license for video generation.

dimitreOliveira / ai_beats