Bad Performance of Voice Cloning
souvikqb opened this issue · comments
I am using the https://github.com/serp-ai/bark-with-voice-clone/blob/main/clone_voice.ipynb Notebook to generate audio clips similar to one provided by me.
While the code ran well, the resulting audio file was not really very good. I am using common American and British accents speakers
Any tips to tune the model to correctly get the results or any parameters to play with ?
import sys
sys.path.append('./bark-voice-cloning-HuBERT-quantizer')
import os
from pydub import AudioSegment
from scipy.io.wavfile import write as write_wav
import numpy as np
import torch
import torchaudio
from bark.api import generate_audio
from bark.generation import SAMPLE_RATE, preload_models, load_codec_model
from encodec.utils import convert_audio
from bark_hubert_quantizer.customtokenizer import CustomTokenizer
from bark_hubert_quantizer.hubert_manager import HuBERTManager
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert
preload_models(
text_use_gpu=True,
text_use_small=False,
coarse_use_gpu=True,
coarse_use_small=False,
fine_use_gpu=True,
fine_use_small=False,
codec_use_gpu=True,
force_reload=False
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = load_codec_model(use_gpu=True if device == 'cuda' else False)
hubert_manager = HuBERTManager()
hubert_manager.make_sure_hubert_installed()
hubert_manager.make_sure_tokenizer_installed()
# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)
# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth', map_location=device).to(device)
"""# Inference"""
text_prompt = 'Hello! How are you?, I am Monster from Monster. I make AI Models for all of you here at Blocks and I am really excited about it. I make Generative AI accessible to all' #@param {type:"string"}
audio_filepath = r'/home/qblocks/Cloning/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm' #@param {type:"string"}
def trim_and_convert_audio(input_path, output_path, target_duration_ms=30000):
# Load the audio file
print("Loading Audio File:", input_path)
audio = AudioSegment.from_file(input_path)
# Get the duration of the audio in milliseconds
audio_duration = len(audio)
# Trim the audio to the target duration
if audio_duration > target_duration_ms:
trimmed_audio = audio[:target_duration_ms]
else:
trimmed_audio = audio
# Save the trimmed audio as a WAV file
trimmed_audio.export(output_path, format="wav")
print("Trimmed audio saved as:", output_path)
output_audio_path = "converted_audio.wav"
trim_and_convert_audio(audio_filepath, output_audio_path)
if not os.path.isfile(audio_filepath):
raise ValueError(f"Audio file not exists ({output_audio_path})")
wav, sr = torchaudio.load(output_audio_path)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.to(device)
semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_tokens = tokenizer.get_token(semantic_vectors)
# Extract discrete codes from EnCodec
with torch.no_grad():
encoded_frames = model.encode(wav.unsqueeze(0))
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()
# move codes to cpu
codes = codes.cpu().numpy()
# move semantic tokens to cpu
semantic_tokens = semantic_tokens.cpu().numpy()
voice_filename = 'output3.npz'
current_path = os.getcwd()
voice_name = os.path.join(current_path, voice_filename)
np.savez(voice_name, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)
# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.8, waveform_temp=0.8)
# save audio
filepath = "out5.wav" # change this to your desired output path
write_wav(filepath,SAMPLE_RATE,audio_array)
your audio input it's in 24 bit ?
I passed in webm and mp3 files, how do I check this?
your audio input it's in 24 bit ?
Using this file - https://upload.wikimedia.org/wikipedia/commons/c/c5/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm
Can you elaborate more?
Can you elaborate more?
- I wrote the code to isolate wav from your video
from pydub import AudioSegment
def convert_webm_to_wav(input_file, output_file):
audio = AudioSegment.from_file(input_file, format="webm")
audio.export(output_file, format="wav")
def crop_audio(input_file, output_file, seconds):
audio = AudioSegment.from_wav(input_file)
processed_audio = audio[:seconds * 1000]
processed_audio.export(output_file, format="wav")
usage:
input_wav = 'CA_AG_Kamala_Harris_2013_CADEM_Convention.webm'
converted_wav = 'converted.wav'
cropped_wav = 'cropped.wav'
seconds_to_crop = 60
convert_webm_to_wav(input_wav, converted_wav)
crop_audio(converted_wav, cropped_wav, seconds_to_crop)
- We can determine the bitrate of the audio recording using the code, but I used the website (I was too lazy to write code;)): https://www.advalify.io/audio-validator
- Your audio is 32 bit:
- Use this to convert to 24 bit: https://onlineaudioconverter.com/
I see,
Thanks for taking the effort.
But how should I use this to improve the video cloning performance
But how should I use this to improve the video cloning performance
If I understand correctly, do you want to make a deepfake for a video with a voice change?
If yes, here is the code to convert to 24 bits (https://stackoverflow.com/questions/44812553/how-to-convert-a-24-bit-wav-file-to-16-or-32-bit-files-in-python3):
import soundfile
input_wav = 'input.wav' # Maybe 32 bit?
output_wav = 'output.wav'
data, samplerate = soundfile.read(input_wav)
soundfile.write(output_wav, data, samplerate, subtype='PCM_24')
Yes thats a possibility
But for now I would just like a Voice Cloned Audio File
Say - Reading a normal speech but with a celebrity's or a user defined speaker voice
Does converting it to 24 bits help in video cloning process?
@souvikqb In fact, I myself have encountered difficulties when cloning a voice. Unfortunately, they do not give an answer to my question, but the option with a 24-bit translation gives me little hope of success. I will try it on my own data...
Thanks 👍
Do let me know if you get anything
Also can we tag the owner of this repository?
@souvikqb I think we can tag Francis @francislabountyjr.
I'm also stuck on my issue ;( #49
@souvikqb how can i contact you? I found another solution (from another project). I will not write here, because it does not apply to this project.
@souvikqb how can i contact you? I found another solution (from another project). I will not write here, because it does not apply to this project.
Please email me on -> autocar2060 @ gmail . com
@BrasD99 Is it simply the bit rate difference causing the issue? I'd love to hear if there are other factors one could employ to improve the clone.
having difficulties just using my own voice with good results.
litterally one time out of a handful did I hear my voice. and it was a single "umm" at the start before switching back to some person who does not sound like mehaha
@Shyk92 did you ever make progress on this? I'm in the same boat.
Facing same issue