Query on Resampling and Audio Format Compliance in Competition Rules

Question

Query on Resampling and Audio Format Compliance in Competition Rules

huazhi1024 opened this issue 5 months ago · comments

Hello, in the released development set, different test sets have varying sampling rates such as 8kHz, 16kHz, 44.1kHz, and 48kHz, as well as different audio formats like WAV and FLAC. My model was trained on 16kHz speech data. During inference, if the input audio is not 16kHz, it will be automatically resampled to 16kHz before encoding and reconstruction. Does this comply with the competition rules?

hbwu-ntu · Answer 1 · Thu May 23 2024 14:09:18 GMT+0800 (China Standard Time)

Thank you for bringing up this question.

Yes, resampling to 16kHz for both encoding and reconstruction is allowed.

However, please note that the evaluation pipeline expects the audio to be at the same sampling rate as the original datasets. Therefore, you should resample the audio back to its original sampling rate before evaluation.

We recommend saving the original audio's sampling rate (sr) when loading the audio. After codec reconstruction, just resample the reconstructed audio to the original sampling rate (sr). This should not add much effort to your resynthesis python script.

Thank you.

redmist · Answer 2 · Fri Jun 07 2024 20:35:29 GMT+0800 (China Standard Time)

Hi @hbwu-ntu ,

I also have a similar problem. If my codec is trained on 16 kHz data, then any data above 16 kHz is actually very easy for me to handle. I just need to downsample it to 16 kHz, perform the computations, and then upsample the generated speech to the required sampling rate. This way, the generated speech, while empty in the frequency range above 8 kHz, at least sounds normal.

However, if I experiment with a high sampling rate, such as 48 kHz, and I need to encode an audio with a sampling rate of 16 kHz. If I first upsample it to 48 kHz, perform the computations, and then downsample the generated audio to 16 kHz, the resulting audio is almost inaudible. Even though my model can recover the input 48 kHz data quite well.

hbwu-ntu · Answer 3 · Sat Jun 08 2024 11:19:20 GMT+0800 (China Standard Time)

@redmist328 Hi, Redmist, thank you for bringing up this point. We will compare codec models with the same sampling rate.