santi-pdp / pase

Problem Agnostic Speech Encoder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (0, 10416) at dimension 2 of input [1, 1, 5584]

lianrzh opened this issue · comments

The problem occurs when running the command python make_trainset_statistics.py --net_cfg cfg/workers/workers+.cfg ...
{'regr': [{'num_outputs': 1, 'dropout': 0, 'dropout_time': 0.0, 'hidden_layers': 1, 'name': 'cchunk', 'type': 'decoder', 'hidden_size': 64, 'fmaps': [512, 256, 128], 'strides': [4, 4, 10], 'kwidths': [30, 30, 30], 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab1410>}, {'num_outputs': 3075, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'lps', 'context': 1, 'r': 7, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab1490>, 'skip': False}, {'num_outputs': 3075, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'lps_long', 'context': 1, 'r': 7, 'transform': {'win': 512}, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab14d0>, 'skip': False}, {'num_outputs': 120, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'fbank', 'context': 1, 'r': 7, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab1510>, 'skip': False}, {'num_outputs': 120, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'fbank_long', 'context': 1, 'r': 7, 'transform': {'win': 1024, 'n_fft': 1024}, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab1550>, 'skip': False}, {'num_outputs': 120, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'gtn', 'context': 1, 'r': 7, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab1590>, 'skip': False}, {'num_outputs': 120, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'gtn_long', 'context': 1, 'r': 7, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab15d0>, 'transform': {'win': 2048}, 'skip': False}, {'num_outputs': 39, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'mfcc', 'context': 1, 'r': 7, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687ab1610>, 'skip': False}, {'num_outputs': 60, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'mfcc_long', 'context': 1, 'r': 7, 'transform': {'win': 2048, 'order': 20}, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687aad550>, 'skip': False}, {'num_outputs': 12, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'prosody', 'context': 1, 'r': 7, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687aad450>, 'skip': False}], 'cls': [{'num_outputs': 1, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'mi', 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687aad210>, 'skip': False, 'keys': ['chunk', 'chunk_ctxt', 'chunk_rand']}, {'num_outputs': 1, 'dropout': 0, 'hidden_size': 256, 'hidden_layers': 1, 'name': 'cmi', 'augment': True, 'loss': <pase.losses.ContextualizedLoss object at 0x7fd687aad1d0>, 'skip': False, 'keys': ['chunk', 'chunk_ctxt', 'chunk_rand']}]}
Found 2445650 speakers info
Found 1980000 files in train split
Found 1980000 speakers in train split
Traceback (most recent call last):
File "make_trainset_statistics.py", line 165, in
extract_stats(opts)
File "make_trainset_statistics.py", line 86, in extract_stats
for bidx, batch in enumerate(dloader, start=1):
File "/data/app/anaconda3/envs/pytorch-1.1/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/data/app/anaconda3/envs/pytorch-1.1/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
File "/data/app/anaconda3/envs/pytorch-1.1/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/data/app/anaconda3/envs/pytorch-1.1/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/data/app/ronlian/pase/pase/dataset.py", line 304, in getitem
pkg = self.transform(pkg)
File "/data/app/anaconda3/envs/pytorch-1.1/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 61, in call
img = t(img)
File "/data/app/ronlian/pase/pase/transforms.py", line 427, in call
pkg['chunk_rand'] = self.select_chunk(raw_rand)
File "/data/app/ronlian/pase/pase/transforms.py", line 317, in select_chunk
mode=self.pad_mode).view(-1)
File "/data/app/anaconda3/envs/pytorch-1.1/lib/python3.7/site-packages/torch/nn/functional.py", line 2805, in pad
ret = torch._C._nn.reflection_pad1d(input, pad)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (0, 10416) at dimension 2 of input [1, 1, 5584]

Hi @lianrzh ,

sa noted in the RuntimeError, you are padding to 1 second @ 16kHz (hence 16.000 samples) a piece of waveform with 5.584 samples. Please remove chunks that are too small (e.g. < 12.000 samples) from your training wav folder prior to processing the data, or perhaps reduce the chunk_size to a lower value (e.g. 8.000), although we don't know about the possible detrimental effect of the latter option in performance.