Bad Performance of Voice Cloning

Question

Bad Performance of Voice Cloning

souvikqb opened this issue a year ago · comments

I am using the https://github.com/serp-ai/bark-with-voice-clone/blob/main/clone_voice.ipynb Notebook to generate audio clips similar to one provided by me.

While the code ran well, the resulting audio file was not really very good. I am using common American and British accents speakers

Any tips to tune the model to correctly get the results or any parameters to play with ?


import sys
sys.path.append('./bark-voice-cloning-HuBERT-quantizer')
import os
from pydub import AudioSegment
from scipy.io.wavfile import write as write_wav
import numpy as np
import torch
import torchaudio
from bark.api import generate_audio
from bark.generation import SAMPLE_RATE, preload_models, load_codec_model
from encodec.utils import convert_audio
from bark_hubert_quantizer.customtokenizer import CustomTokenizer
from bark_hubert_quantizer.hubert_manager import HuBERTManager
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert

preload_models(
    text_use_gpu=True,
    text_use_small=False,
    coarse_use_gpu=True,
    coarse_use_small=False,
    fine_use_gpu=True,
    fine_use_small=False,
    codec_use_gpu=True,
    force_reload=False
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = load_codec_model(use_gpu=True if device == 'cuda' else False)

hubert_manager = HuBERTManager()
hubert_manager.make_sure_hubert_installed()
hubert_manager.make_sure_tokenizer_installed()

# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)

# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth', map_location=device).to(device)

"""# Inference"""

text_prompt = 'Hello! How are you?, I am Monster from Monster. I make AI Models for all of you here at Blocks and I am really excited about it. I make Generative AI accessible to all' #@param {type:"string"}
audio_filepath = r'/home/qblocks/Cloning/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm' #@param {type:"string"}

def trim_and_convert_audio(input_path, output_path, target_duration_ms=30000):
    # Load the audio file
    print("Loading Audio File:", input_path)
    audio = AudioSegment.from_file(input_path)
    # Get the duration of the audio in milliseconds
    audio_duration = len(audio)
    # Trim the audio to the target duration
    if audio_duration > target_duration_ms:
        trimmed_audio = audio[:target_duration_ms]
    else:
        trimmed_audio = audio
    # Save the trimmed audio as a WAV file
    trimmed_audio.export(output_path, format="wav")
    print("Trimmed audio saved as:", output_path)

output_audio_path = "converted_audio.wav"  
trim_and_convert_audio(audio_filepath, output_audio_path)

if not os.path.isfile(audio_filepath):
  raise ValueError(f"Audio file not exists ({output_audio_path})")

wav, sr = torchaudio.load(output_audio_path)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.to(device)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav.unsqueeze(0))
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()

# move codes to cpu
codes = codes.cpu().numpy()
# move semantic tokens to cpu
semantic_tokens = semantic_tokens.cpu().numpy()

voice_filename = 'output3.npz'
current_path = os.getcwd()
voice_name = os.path.join(current_path, voice_filename)

np.savez(voice_name, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.8, waveform_temp=0.8)

# save audio
filepath = "out5.wav" # change this to your desired output path
write_wav(filepath,SAMPLE_RATE,audio_array)

DagsHub · Answer 1 · Thu Aug 10 2023 16:08:33 GMT+0800 (China Standard Time)

Join the discussion on DagsHub!

Duverne Mathieu · Answer 2 · Fri Aug 11 2023 00:24:30 GMT+0800 (China Standard Time)

your audio input it's in 24 bit ?

souvikqb · Answer 3 · Fri Aug 11 2023 00:29:33 GMT+0800 (China Standard Time)

I passed in webm and mp3 files, how do I check this?

souvikqb · Answer 4 · Fri Aug 11 2023 00:39:33 GMT+0800 (China Standard Time)

your audio input it's in 24 bit ?

Using this file - https://upload.wikimedia.org/wikipedia/commons/c/c5/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm

Can you elaborate more?

Denis Braslavskiy · Answer 5 · Sat Aug 12 2023 00:41:36 GMT+0800 (China Standard Time)

Can you elaborate more?

I wrote the code to isolate wav from your video

from pydub import AudioSegment

def convert_webm_to_wav(input_file, output_file):
  audio = AudioSegment.from_file(input_file, format="webm")
  audio.export(output_file, format="wav")

def crop_audio(input_file, output_file, seconds):
  audio = AudioSegment.from_wav(input_file)
  processed_audio = audio[:seconds * 1000]
  processed_audio.export(output_file, format="wav")

usage:

input_wav = 'CA_AG_Kamala_Harris_2013_CADEM_Convention.webm'
converted_wav = 'converted.wav'
cropped_wav = 'cropped.wav'
seconds_to_crop = 60

convert_webm_to_wav(input_wav, converted_wav)
crop_audio(converted_wav, cropped_wav, seconds_to_crop)

We can determine the bitrate of the audio recording using the code, but I used the website (I was too lazy to write code;)): https://www.advalify.io/audio-validator

Your audio is 32 bit:

Use this to convert to 24 bit: https://onlineaudioconverter.com/

souvikqb · Answer 6 · Sat Aug 12 2023 00:47:37 GMT+0800 (China Standard Time)

I see,

Thanks for taking the effort.

But how should I use this to improve the video cloning performance

Denis Braslavskiy · Answer 7 · Sat Aug 12 2023 00:53:39 GMT+0800 (China Standard Time)

But how should I use this to improve the video cloning performance

If I understand correctly, do you want to make a deepfake for a video with a voice change?

If yes, here is the code to convert to 24 bits (https://stackoverflow.com/questions/44812553/how-to-convert-a-24-bit-wav-file-to-16-or-32-bit-files-in-python3):

import soundfile

input_wav = 'input.wav' # Maybe 32 bit?
output_wav = 'output.wav'

data, samplerate = soundfile.read(input_wav)
soundfile.write(output_wav, data, samplerate, subtype='PCM_24')

souvikqb · Answer 8 · Sat Aug 12 2023 00:58:27 GMT+0800 (China Standard Time)

Yes thats a possibility

But for now I would just like a Voice Cloned Audio File

Say - Reading a normal speech but with a celebrity's or a user defined speaker voice

Does converting it to 24 bits help in video cloning process?

Denis Braslavskiy · Answer 9 · Sat Aug 12 2023 01:02:40 GMT+0800 (China Standard Time)

@souvikqb In fact, I myself have encountered difficulties when cloning a voice. Unfortunately, they do not give an answer to my question, but the option with a 24-bit translation gives me little hope of success. I will try it on my own data...

souvikqb · Answer 10 · Sat Aug 12 2023 01:04:13 GMT+0800 (China Standard Time)

Thanks 👍

Do let me know if you get anything

Also can we tag the owner of this repository?

Denis Braslavskiy · Answer 11 · Sat Aug 12 2023 01:12:58 GMT+0800 (China Standard Time)

@souvikqb I think we can tag Francis @francislabountyjr.

I'm also stuck on my issue ;( #49

Denis Braslavskiy · Answer 12 · Sun Aug 13 2023 22:50:31 GMT+0800 (China Standard Time)

@souvikqb how can i contact you? I found another solution (from another project). I will not write here, because it does not apply to this project.

souvikqb · Answer 13 · Sun Aug 13 2023 22:52:34 GMT+0800 (China Standard Time)

@souvikqb how can i contact you? I found another solution (from another project). I will not write here, because it does not apply to this project.

Please email me on -> autocar2060 @ gmail . com

Rajashekhar Reddy · Answer 14 · Tue Aug 15 2023 16:36:48 GMT+0800 (China Standard Time)

@souvikqb @BrasD99 , if you guys succeeded in generating better voice cloning, could you please put your outputs here?

Nick Shykula · Answer 15 · Sun Oct 22 2023 12:14:29 GMT+0800 (China Standard Time)

@BrasD99 Is it simply the bit rate difference causing the issue? I'd love to hear if there are other factors one could employ to improve the clone.

having difficulties just using my own voice with good results.

litterally one time out of a handful did I hear my voice. and it was a single "umm" at the start before switching back to some person who does not sound like mehaha

PlatformKit · Answer 16 · Mon Dec 04 2023 14:55:08 GMT+0800 (China Standard Time)

@Shyk92 did you ever make progress on this? I'm in the same boat.

Aryan Siddiqui · Answer 17 · Tue Jun 11 2024 18:52:02 GMT+0800 (China Standard Time)

Facing same issue