Generate a singing voice prompt ALWAYS generates the same voice singing the same song

Question

Generate a singing voice prompt ALWAYS generates the same voice singing the same song

devlux76 opened this issue a year ago · comments

I gave it different lyrics, and even tried uploading a different audio file etc. But prompts that involve generating a singing voice always produce the exact same output as:

Generate a piece of singing voice. Text sequence is 小酒窝长睫毛AP是你最美的记号. Note sequence is C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4. Note duration sequence is 0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340.

Either something is hardcoded somewhere or the model has been overfitted.

Steven Morrey · Answer 1 · Tue May 02 2023 03:59:53 GMT+0800 (China Standard Time)

I believe I've narrowed down the problem.

In https://github.com/AIGC-Audio/AudioGPT/blob/main/audio-chatgpt.py you have the following code

class T2S:
    def __init__(self, device= None):
        from inference.svs.ds_e2e import DiffSingerE2EInfer
        if device is None:
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print("Initializing DiffSinger to %s" % device)
        self.device = device
        self.exp_name = 'checkpoints/0831_opencpop_ds1000'
        self.config= 'NeuralSeq/egs/egs_bases/svs/midi/e2e/opencpop/ds1000.yaml'
        self.set_model_hparams()
        self.pipe = DiffSingerE2EInfer(self.hp, device)
        self.default_inp = {
            'text': '你 说 你 不 SP 懂 为 何 在 这 时 牵 手 AP',
            'notes': 'D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest',
            'notes_duration': '0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590'
        }

    def set_model_hparams(self):
        set_hparams(config=self.config, exp_name=self.exp_name, print_hparams=False)
        self.hp = hp

    def inference(self, inputs):
        self.set_model_hparams()
        val = inputs.split(",")
        key = ['text', 'notes', 'notes_duration']
        try:
            inp = {k: v for k, v in zip(key, val)}
            wav = self.pipe.infer_once(inp)
        except:
            print('Error occurs. Generate default audio sample.\n')
            inp = self.default_inp
            wav = self.pipe.infer_once(inp)
        #if inputs == '' or len(val) < len(key):
        #    inp = self.default_inp
        #else:
        #    inp = {k:v for k,v in zip(key,val)}
        #wav = self.pipe.infer_once(inp)
        wav *= 32767
        audio_filename = os.path.join('audio', str(uuid.uuid4())[0:8] + ".wav")
        wavfile.write(audio_filename, self.hp['audio_sample_rate'], wav.astype(np.int16))
        print(f"Processed T2S.run, audio_filename: {audio_filename}")
        return audio_filename

class t2s_VISinger:
    def __init__(self, device=None):
        from espnet2.bin.svs_inference import SingingGenerate
        if device is None:
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print("Initializing VISingere to %s" % device)
        tag = 'AQuarterMile/opencpop_visinger1'
        self.model = SingingGenerate.from_pretrained(
            model_tag=str_or_none(tag),
            device=device,
        )
        phn_dur = [[0.        , 0.219     ],
            [0.219     , 0.50599998],
            [0.50599998, 0.71399999],
            [0.71399999, 1.097     ],
            [1.097     , 1.28799999],
            [1.28799999, 1.98300004],
            [1.98300004, 7.10500002],
            [7.10500002, 7.60400009]]
        phn = ['sh', 'i', 'q', 'v', 'n', 'i', 'SP', 'AP']
        score = [[0, 0.50625, 'sh_i', 58, 'sh_i'], [0.50625, 1.09728, 'q_v', 56, 'q_v'], [1.09728, 1.9832100000000001, 'n_i', 53, 'n_i'], [1.9832100000000001, 7.105360000000001, 'SP', 0, 'SP'], [7.105360000000001, 7.604390000000001, 'AP', 0, 'AP']]
        tempo = 70
        tmp = {}
        tmp["label"] = phn_dur, phn
        tmp["score"] = tempo, score
        self.default_inp = tmp

    def inference(self, inputs):
        val = inputs.split(",")
        key = ['text', 'notes', 'notes_duration']
        try: # TODO: input will be update
            inp = {k: v for k, v in zip(key, val)}
            wav = self.model(text=inp)["wav"]
        except:
            print('Error occurs. Generate default audio sample.\n')
            inp = self.default_inp
            wav = self.model(text=inp)["wav"]

        audio_filename = os.path.join('audio', str(uuid.uuid4())[0:8] + ".wav")
        soundfile.write(audio_filename, wav, samplerate=self.model.fs)
        return audio_filename

It looked a bit off to me, but I couldn't quite put my finger on it so I asked ChatGPT 4 about it.

It looks like the code is expecting inputs in a specific format and if the inputs are not in that format, it defaults to generating a default audio sample. The key point is in the try/except blocks in both inference methods of the T2S and t2s_VISinger classes. If any exception is thrown during the execution of the code within the try block, it immediately jumps to the except block, which generates a default audio sample.

Here is what's happening in more detail:

The input is expected to be a string with comma-separated values. This string is split by commas to create a list, val.
It then tries to create a dictionary inp by zipping key (which is a list of keys ['text', 'notes', 'notes_duration']) and val (the list of values obtained from the input string).
It passes this inp dictionary to the infer_once function of the pipe object in the T2S class or to the model object in the t2s_VISinger class.
If anything goes wrong during this process (for example, if the input string does not contain enough comma-separated values to match up with the keys, or if the infer_once function does not work with the provided inputs), an exception is raised.
As soon as an exception is raised, it jumps to the except block and generates a default audio sample using self.default_inp.

So, if you're always hearing the same output regardless of the input parameters, it's likely that an exception is being raised every time you try to pass in different parameters. The exception could be due to invalid input format or values, or due to some issue with the infer_once method.

To fix this, you need to ensure that you're passing in inputs in the correct format and with valid values. If you're not sure what the correct format or values are, you'll need to look at the documentation or source code for the infer_once method.

Also, you might want to modify the except block to print out the exception message, like this:

except Exception as e:
    print('Error occurs:', str(e))
    print('Generate default audio sample.\n')
    inp = self.default_inp
    wav = self.pipe.infer_once(inp)

This will give you more information about what's going wrong when you pass in different parameters.