lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent samples for multiple targets in SoundDataset

ilya16 opened this issue · comments

When audio lengths are greater than max_length and multiple target sample rates are used, the SoundDataset samples audios with different start positions:

for data, max_length, seq_len_multiple_of in zip(data_tuple, self.max_length, self.seq_len_multiple_of):
audio_length = data.size(1)
# pad or curtail
if exists(max_length):
if audio_length > max_length:
max_start = audio_length - max_length
start = torch.randint(0, max_start, (1, ))
data = data[:, start:start + max_length]
else:
data = F.pad(data, (0, max_length - audio_length), 'constant')

Affects the training data for CoarseTransformer.

@ilya16 yes indeed that does not seem right 😞

decided to take the strategy of doing all the resampling + curtail / pad on the highest target sample freq first, before resampling to all the rest of the target sample freqs

want to see if that addresses the issue?

@lucidrains looks good!