Copyright (c) 2023 Mayukhmali Das
My code defines a function Piper_Compress that takes in a base audio file path, segment length, Vosk model path, and high/low bitrate values as inputs.
def piper_compress(base_audio_path, segment_length, model_path,high_bitrate,low_bitrate):
The function splits the audio file into segments of the given length, converts each segment into text using the Vosk model, performs sentiment analysis on the text using the Hugging Face Transformers library, and compresses each segment into either high or low bitrate mp3 files based on the sentiment analysis result. Finally, all the compressed segments are concatenated into a single output audio file named "Pied_Pipered_Output.mp3".
I got inspiration of this model from the paper titled as "The Effects of MP3 Compression on Perceived Emotional Characteristics in Musical Instruments"
https://www.aes.org/tmpFiles/elib/20230406/18523.pdf
This paper examines the impact of MP3 compression on the emotional characteristics by conducting listening tests. The experiment compared compressed and uncompressed samples at different bit rates across ten emotional categories. The results revealed that MP3 compression enhanced negative emotions like Mysterious, Shy, Scary, and Sad while diminishing positive emotions like Happy, Heroic, Romantic, Comic, and Calm. MP3 compression reduced positive emotions and amplified negative emotions such as Mysterious and Scary.
So I decided to come up with a python module that can change the compression of an audio file based on emotions.
Below you can see the vosk generated subtitles of the song. I split the audio file into segments of 31 seconds. Note as the training dataset for the vosk model is not a large, so the translation is not too accurate.
['when', 'i', 'was', 'six', 'years', 'old', 'and', 'broke', 'my', 'leg', 'i', 'was', 'running', 'from', 'my', 'brother', 'and', 'his', 'friends', 'tasty', 'and', 'sweet', 'perfume', 'mathematics', 'mounting', 'wrote', 'down', 'what', 'was', 'younger', 'then'] ['take', 'me', 'back', 'to', 'when', 'haha', 'can', 'you', 'make', 'friends', 'and', 'last', 'two', 'years', 'and', "i've", 'not', 'seen', 'in', 'figs', 'and', 'so', 'on', 'way', 'to', 'go', 'the'] ['the', 'the', 'hi', 'fifteen', 'years', 'old', 'smoking', 'a', 'cigarette'] ['running', 'from', 'the', 'law', 'to', 'the', 'back', 'fields', 'and', 'getting', 'drunk', 'with', 'my', 'friends', 'my', 'first', 'kiss', 'on', 'friday', 'night', 'reckon', 'that', 'day', 'i', 'was', 'younger', 'then', 'take', 'me', 'back', 'to', 'when', 'we', 'fell', 'can', 'joe', 'we', 'got', 'a', 'cheap', 'spirits', 'drink', 'industry', 'friend'] ['so', 'how', 'we', 'way', 'to', 'go', 'the', 'the'] ['hi', 'one', 'so', 'close'] ['works', 'down', 'by', 'the', 'coast', 'one', 'or', 'two', 'kids', 'lives', 'alone', "one's", 'brother', 'with', 'dogs', 'already', 'on', 'second', 'line', "he's", 'just', 'barely', 'getting', 'by', 'these', 'people', 'raise', 'me', 'go', 'as', 'to', 'the'] ['country', 'names', 'when', 'we', "didn't", 'know', 'hi'] ['the']
Link for downloading the Vosk Model https://alphacephei.com/vosk/models
Download a model from the Vosk website and save in the model folder. The file is too big to upload in Github.
Vosk API is a speech recognition toolkit developed by Alpha Cephei Inc., which is based on the Kaldi toolkit. Kaldi is a free and open-source toolkit for speech recognition developed by the Johns Hopkins University Speech and Language Processing Group. It uses modern machine learning techniques, including deep neural networks, to achieve high accuracy and efficiency.
The offline speech recognition is done with the help of Dmytro Nikolaiev work referenced below: https://gitlab.com/Winston-90/foreign_speech_recognition/-/tree/main/timestamps
I have used the "distilbert-base-uncased-finetuned-sst-2-english" model. Below you can find the sentiments of each translated subsegments of the song.
when i was six years old and broke my leg i was running from my brother and his friends tasty and sweet perfume mathematics mounting wrote down what was younger then [{'label': 'NEGATIVE', 'score': 0.9836982488632202}] NEGATIVE take me back to when haha can you make friends and last two years and i've not seen in figs and so on way to go the [{'label': 'POSITIVE', 'score': 0.9919325709342957}] POSITIVE the the hi fifteen years old smoking a cigarette [{'label': 'NEGATIVE', 'score': 0.979468822479248}] NEGATIVE running from the law to the back fields and getting drunk with my friends my first kiss on friday night reckon that day i was younger then take me back to when we fell can joe we got a cheap spirits drink industry friend [{'label': 'NEGATIVE', 'score': 0.9779858589172363}] NEGATIVE so how we way to go the the [{'label': 'POSITIVE', 'score': 0.9858322143554688}] POSITIVE hi one so close [{'label': 'POSITIVE', 'score': 0.9994691014289856}] POSITIVE works down by the coast one or two kids lives alone one's brother with dogs already on second line he's just barely getting by these people raise me go as to the [{'label': 'NEGATIVE', 'score': 0.9983982443809509}] NEGATIVE country names when we didn't know hi [{'label': 'NEGATIVE', 'score': 0.9887117147445679}] NEGATIVE the [{'label': 'POSITIVE', 'score': 0.9635980725288391}] POSITIVE
MP3 compression reduces positive emotions and amplifies negative emotions. Based on this fact the positive wav audio segments are compressed to mp3 files with higher bitrate than the negative audio segments. After this the audio files are concatenated. The process is explained below :
if sentiment == 'POSITIVE': wave_file = AudioSegment.from_wav(os.path.join(output_dir, f"segment_{count}.wav")) # Set the bitrate and export as mp3 wave_file.export(os.path.join(output_dir, f"segment_{count}_high.mp3"), format="mp3", bitrate=high_bitrate) final_audio += AudioSegment.from_mp3(os.path.join(output_dir, f"segment_{count}_high.mp3")) else: wave_file = AudioSegment.from_wav(os.path.join(output_dir, f"segment_{count}.wav")) # Set the bitrate and export as mp3 wave_file.export(os.path.join(output_dir, f"segment_{count}_low.mp3"), format="mp3", bitrate=low_bitrate) final_audio += AudioSegment.from_mp3(os.path.join(output_dir, f"segment_{count}_low.mp3"))
Using a low_bitrate of 8k and high_bitrate of 128k will give noticable transitions. But to improve listening experience use 64k-128k or 32k-128k combinations.
Run test.py to run the module.
from Pied_Piper_AI_Emo_Compression import main main.piper_compress("base_audio.wav",segment_length=31,model_path="C:\\Users\\Mayukhmali Das\\Desktop\\Pied_Piper_AI_Emo_Compression\\Pied_Piper_AI_Emo_Compression\\model",high_bitrate="128k",low_bitrate="64k")
You have to install Vosk, Pydub, Pytorch, transformers before running the code.
The name of the module is inspired from Pied Piper from Silicon Valley. "Pied Piper" is a fictional company from the TV series "Silicon Valley", which is a comedy-drama that satirizes the culture and practices of the tech industry in Silicon Valley, California. The company is founded by Richard Hendricks, a young software engineer, and his friends, and it develops a revolutionary data compression algorithm that could change the industry. The name "Pied Piper" is a reference to the fairy tale of the same name, in which a man uses his musical talents to lead rats out of a town. In the show, the name is meant to suggest that the company can lead the tech industry out of its problems and toward a better future.