etzinis / sudo_rm_rf

Code for SuDoRm-Rf networks for efficient audio source separation. SuDoRm-Rf stands for SUccessive DOwnsampling and Resampling of Multi-Resolution Features which enables a more efficient way of separating sources from mixtures.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

why so many noise?

AIHHU opened this issue · comments

I tried improved_rm_rf on wham clean speech separation and the separation result had so many noise, And its Si-SDR is 15.0

I don't know which version you tried, I am sure you are doing something wrong though (maybe you do not normalize the waveform?). As specified in the README, you can see how to use the pre-trained models here: https://github.com/etzinis/sudo_rm_rf/blob/master/sudo_rm_rf/notebooks/sudormrf_how_to_use.ipynb

Moreover, SI-SDR at 15 dB corresponds no noise at all, so please check again your code.

Thank you sir, I tried WHAM clean separation experiment as in your readme doc and did the separation as in your notebook. Normalization is also tried.

Maybe you are getting the wrong file or something, in any case, it's almost impossible to get a really noisy file with 15dB SI-SDR. Except of the case where the noisy file is your reference signal :P

The only difference is that i use sf.read instead of torchaudio.read and use keep_dims intead of keep_dim

It's impossible to debug your code without looking at it. Please use what I have in my notebook and I am sure it will run smoothly.

Thanks. i'll try it soon. I can ensure that i don't change your experiment code and the dataset is the wham dataset. That's true.

 That's my inference code sir. It's all same to you. Why there also exits so many noise. The returned test Si-SDR index is 15.0.  The model is trained on 16khz,.
  esti_utt, _ = torchaudio.load(os.path.join(mix_file_path, file_id))
        
        input_mix_std = esti_utt.std(-1, keepdim=True)
        input_mix_mean = esti_utt.mean(-1, keepdim=True)
        input_mix = (esti_utt - input_mix_mean) / (input_mix_std + 1e-9)

        rec_sources_wavs = model_separation(input_mix.unsqueeze(1))


        #rec_sources_wavs = (rec_sources_wavs * input_mix_std) + input_mix_mean
    
        est_waveform_1 = rec_sources_wavs[0, 0].detach().numpy()
        est_waveform_2 = rec_sources_wavs[0, 1].detach().numpy()
        
        #print(rec_sources_wavs.shape)
        
        sf.write(os.path.join(esti_file_path, '1'+file_id), est_waveform_1, args.fs)
        sf.write(os.path.join(esti_file_path, '2'+file_id), est_waveform_2, args.fs)

The # in the line
#rec_sources_wavs = (rec_sources_wavs * input_mix_std) + input_mix_mean is removed when being tested

You are still giving me incomplete code snippets - tbh I don't see the reason for me to debug your code if you have a notebook which is bug-free. Are you pointing to files which are 8kHz ? Rescaling is important for capturing the appropriate gain of the sources.

nope, sir, I have a question: if i use the 16khz dataset to train and change the --fs in the command line. Will it train on the 16khz dataset? or In your code, it will resample to 8khz?

those are 8khz models, you have to downsample your files to 8kHz first and then process them.

sorry to hear that, lol, 16khz model is required

You mean is required for your application? I was thinking that maybe it would also be a good idea to run a couple of 16 kHz models.