lucidrains / audiolm-pytorch

When audio lengths are greater than max_length and multiple target sample rates are used, the SoundDataset samples audios with different start positions:

audiolm-pytorch/audiolm_pytorch/data.py

Lines 86 to 97 in c65bb97

    
           for data, max_length, seq_len_multiple_of in zip(data_tuple, self.max_length, self.seq_len_multiple_of): 
        
               audio_length = data.size(1) 
        
               # pad or curtail 
        
               if exists(max_length): 
        
                   if audio_length > max_length: 
        
                       max_start = audio_length - max_length 
        
                       start = torch.randint(0, max_start, (1, )) 
        
                       data = data[:, start:start + max_length] 
        
                   else: 
        
                       data = F.pad(data, (0, max_length - audio_length), 'constant')

Affects the training data for CoarseTransformer.

@ilya16 yes indeed that does not seem right 😞

decided to take the strategy of doing all the resampling + curtail / pad on the highest target sample freq first, before resampling to all the rest of the target sample freqs

want to see if that addresses the issue?

@lucidrains looks good!

	for data, max_length, seq_len_multiple_of in zip(data_tuple, self.max_length, self.seq_len_multiple_of):
	audio_length = data.size(1)

	# pad or curtail

	if exists(max_length):
	if audio_length > max_length:
	max_start = audio_length - max_length
	start = torch.randint(0, max_start, (1, ))
	data = data[:, start:start + max_length]
	else:
	data = F.pad(data, (0, max_length - audio_length), 'constant')

Inconsistent samples for multiple targets in SoundDataset