ConvTasNet pretrained huggingface model inference setup

Question

ConvTasNet pretrained huggingface model inference setup

Rodolfo-S opened this issue 2 months ago · comments

I'm trying to do some inferencing on this pretrained ConvTasNet single source enhancement model on hugging face and I'm getting notably poor output.

I tried passing an ~18.5 sec, 16kHz clean speech clip mixed with -40dB white Gaussian noise and the output seemed to have about the SNR and the scaling ballooned well passed +/-1 (max sample value around 1500). Additionally, the speech itself sounds slightly distorted.

I should note that I also tried passing just the clean speech to the model and got similar results, as far as added distortion goes.

I'm trying to figure out if I've configured everything correctly to inference using LambdaOverlapAdd. I mostly used the Process large audio files notebook as reference. Here's my code.

kernel_size = 32
stride = 16

model = torch.hub.load('mpariente/asteroid', 'conv_tasnet', 'JorisCos/ConvTasNet_Libri1Mix_enhsingle_16k')

continuous_nnet = LambdaOverlapAdd(
    nnet=model,
    n_src=1,
    window_size=kernel_size,
    hop_size=stride,
    window=None,
    reorder_chunks=False
)

in_tensor = torch.from_numpy(noisy_audio[None, None, :])
out_tensor = continuous_nnet.forward(in_tensor)

out_wav = out_tensor.numpy().squeeze()

Where noisy_audio is the 1-D noisy speech signal, and window_size and hop_size were inferred from the config provided on the hugging face page for the model.

Is there something I'm missing or doing wrong here?