Spectrogrand

Spectrogrand: Generating interesting audiovisuals for text prompts.

About the project

Spectrograms are visual representations of audio samples often used in Engineering applications as features for various downstream tasks. We unlock the artistic value of spectrograms and use them in both scientific and artistic domains through Spectrogrand: a pipeline to generate interesting melspectrogram-driven audiovisuals given text topic prompts. We also bake in lightweight domain-driven computational creativity assessment throughout steps of the generation process.

In this regard, this pipeline has the following steps:

We use audioldm2-music to generate multiple candidate house music songs for the topic text prompt. We then estimate each candidate's novelty from human-generated house music songs (collected from the HouseX dataset) and value through its danceability score calculated using Essebtia. We select the song with the highest equiweighted score for our pipeline.
Then, we generate melspectrograms for the song as a whole, and for periodic chunks of the sample. These numerous images convey local intensity and temporal diversity scattered throughout different zones of the song.
We use the parent spectrogram to deduce the genre of the song. Our Resnet-101 based-model with augmented train-time transforms is the current SOTA on the HouseX-full-image task 🥳
We then use stable-diffusion-xl-base-1.0 to generate candidate album covers for this song. The selected genre defines and augments the prompts through selecting base colours and descriptor words. We then estimate each candidate's value and surprisingness based on its aestheticness, and how likely it can fool a strong custom classifier (trained on human-generated and AI-generated album covers) into believing that the candidate is more human-generated. We select the image with the highest equiweighted score for our pipeline.
We then use magenta to perform arbitrary image style transfer on the selected album cover image and each of the song chunk's melspectrograms.
At the end of the pipeline, one can hence generate a static video and two spectrogram-driven audiovisual videos. As an additional feature ✨, we also support stable-video-diffusion-img2vid-xt to automatically generate a music video of arbitrary length conditioned on the chosen album cover image.

Key contributions

📄 New corpus with EDMReviewer reviews for selected artists from the HouseX dataset with LLM-generated descriptor words.
📄 New dataset of AI-generated and human-generated album covers for Dance music, as extracted from the MuMu dataset, and an accompanying strong lightweight Mobilenet-v3 based classifier model.
📌 Novel computational creativity estimation pipeline for audiovisuals' generation involving Novelty, Creativity, and Value distributed across different modalities viz. {audio, image}.

Getting Started

To run the pipeline on Kaggle, please review the instructions listed in the Kaggle data release and check out the notebook (also linked in the dataset).

To run the pipeline locally, please follow the steps detailed in this notebook.