A mirror of BigVGAN and HiFi-GAN for access via PyTorch Hub. There are no dependencies other than PyTorch. I cleaned up the original code from NVIDIA and HiFi-GAN. The weights here are for prediction. If you want to train BigVGAN/HiFi-GAN, you also need the discriminator and have to add weight_norm to the generator.
Below is an example demonstrating how to generate a mel spectrogram from an audio file and use BigVGAN to synthesize audio from it.
import torch
import librosa
import numpy as np
from scipy.io import wavfile
def mel_spectrogram(y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax):
# Create mel filterbank
mel_basis = librosa.filters.mel(
sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
)
# Pad the signal
pad_length = int((n_fft - hop_size) / 2)
y = np.pad(y, (pad_length, pad_length), mode="reflect")
# Compute STFT
D = librosa.stft(
y,
n_fft=n_fft,
hop_length=hop_size,
win_length=win_size,
window="hann",
center=False,
pad_mode="reflect",
)
# Convert to magnitude spectrogram and add small epsilon
S = np.sqrt(np.abs(D) ** 2 + 1e-9)
# Apply mel filterbank
S = np.dot(mel_basis, S)
# Convert to log scale
S = np.log(np.maximum(S, 1e-5))
return S
model = torch.hub.load("lars76/bigvgan-mirror", "bigvgan_base_22khz_80band",
trust_repo=True, pretrained=True)
wav, sr = librosa.load('/path/to/your/audio.wav', sr=model.sampling_rate, mono=True)
mel = torch.FloatTensor(mel_spectrogram(wav, model.n_fft, model.num_mels, model.sampling_rate, model.hop_size, model.win_size, model.fmin, model.fmax)).unsqueeze(0)
with torch.inference_mode():
predicted_wav = model(mel) # 1 x T tensor (32-bit float)
predicted_wav = np.int16(predicted_wav.squeeze(0).cpu().numpy() * 32767)
wavfile.write("output.wav", model.sampling_rate, predicted_wav)
Model Name | Mels | n_fft | Hop Size | Win Size | Sampling Rate | fmin | fmax | Params | Dataset |
---|---|---|---|---|---|---|---|---|---|
bigvgan_v2_22khz_80band_fmax8k_256x | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 112M | Large-scale Compilation |
bigvgan_base_22khz_80band | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 14M | LibriTTS + VCTK + LJSpeech |
bigvgan_22khz_80band | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 112M | LibriTTS + VCTK + LJSpeech |
hifigan_universal_v1 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 14M | Universal |
hifigan_vctk_v1 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 14M | VCTK |
hifigan_vctk_v2 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 9.26M | VCTK |
hifigan_vctk_v3 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 1.46M | VCTK |
hifigan_lj_v1 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 14M | LJSpeech |
hifigan_lj_v2 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 9.26M | LJSpeech |
hifigan_lj_v3 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 1.46M | LJSpeech |
hifigan_lj_ft_t2_v1 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 14M | LJSpeech + Finetuned |
hifigan_lj_ft_t2_v2 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 9.26M | LJSpeech + Finetuned |
hifigan_lj_ft_t2_v3 | 80 | 1024 | 256 | 1024 | 22050 | 0 | 8000 | 1.46M | LJSpeech + Finetuned |
Since bigvgan_v2_22khz_80band_fmax8k_256x
was also trained with non-speech data, I found that bigvgan_base_22khz_80band
and bigvgan_22khz_80band
are much better suited for use in text-to-speech systems such as FastSpeech2. In addition, bigvgan_base_22khz_80band
also seems to be better than bigvgan_22khz_80band
.
Model Name | Mels | n_fft | Hop Size | Win Size | Sampling Rate | fmin | fmax | Params | Dataset |
---|---|---|---|---|---|---|---|---|---|
bigvgan_v2_44khz_128band_512x | 128 | 2048 | 512 | 2048 | 44100 | 0 | 22050 | 122M | Large-scale Compilation |
bigvgan_v2_44khz_128band_256x | 128 | 1024 | 256 | 1024 | 44100 | 0 | 22050 | 112M | Large-scale Compilation |
bigvgan_v2_24khz_100band_256x | 100 | 1024 | 256 | 1024 | 24000 | 0 | 12000 | 112M | Large-scale Compilation |
bigvgan_v2_22khz_80band_256x | 80 | 1024 | 256 | 1024 | 22050 | 0 | 11025 | 112M | Large-scale Compilation |
bigvgan_base_24khz_100band | 100 | 1024 | 256 | 1024 | 24000 | 0 | 12000 | 14M | LibriTTS |
bigvgan_24khz_100band | 100 | 1024 | 256 | 1024 | 24000 | 0 | 12000 | 112M | LibriTTS |
You can run python benchmark.py
to compare the performance between the original code and this one.
This table presents the results of evaluating the PESQ (Perceptual Evaluation of Speech Quality) score on 20 WAV files from the AISHELL-3 dataset. The evaluation compares the performance of the cleaned up model with the original model.
Model Name | PESQ ± StdDev |
---|---|
bigvgan_base_22khz_80band | 3.5709 ± 0.3007 |
bigvgan_22khz_80band | 3.9903 ± 0.2270 |
bigvgan_base_24khz_100band | 3.6300 ± 0.3398 |
bigvgan_24khz_100band | 4.0123 ± 0.2957 |
bigvgan_v2_44khz_128band_512x | 3.7670 ± 0.3044 |
bigvgan_v2_44khz_128band_256x | 3.8614 ± 0.4240 |
bigvgan_v2_24khz_100band_256x | 4.1535 ± 0.3204 |
bigvgan_v2_22khz_80band_256x | 3.9525 ± 0.2709 |
bigvgan_v2_22khz_80band_fmax8k_256x | 4.0165 ± 0.2682 |
The result between the cleaned up model and the original model is the same.