Slow tacotron training 1step/sec on AWS p3.2xlarge (Tesla V100)
ScottyBauer opened this issue · comments
I'm fooling around with this project and I'm getting throughput I think is too slow, which leads me to believe I may have mis-configured something or there are other issues.
I'm reusing the pre-trained models with my own custom audio of ~750 audio clips ranging from 4-10 seconds.
I'm using:
PyTorch 1.7.1 with Python3.7 (CUDA 11.0 and Intel MKL)
In order to get the code to run properly I had to apply the fix from this bug (not sure if this is relevant just want to give all details):
#201
and I applied this pull request:
521179e
The only changes I've made to hyperparams is changing peak_norm from false to true:
peak_norm =True # Normalise to the peak of each wav file
and setting my paths.
I can confirm that it is using the GPU (at least GPU memory), but I've never seen nvidia-smi show utilization above 38%:
Things I've tried:
upping the batch size in hyperparams, also the learning rate, up to 64, which didn't help.
here is nvidia-smi output:
Fri Feb 26 00:40:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4461MiB / 16160MiB | 10% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6844 C python 4459MiB |
+-----------------------------------------------------------------------------+
what it's up to:
Trainable Parameters: 11.088M
Restoring from latest checkpoint...
Loading latest weights: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_weights.pyt
/home/ubuntu/WaveRNN/models/tacotron.py:308: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
Loading latest optimizer state: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_optim.pyt
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
| 170k Steps | 8 | 0.0001 | 2 |
+----------------+------------+---------------+------------------+
| Epoch: 1/1869 (61/91) | Loss: 0.7363 | 0.41 steps/s | Step: 180k |
If I change some of the learning rate parameters:
(pytorch_latest_p37) ubuntu@ip-172-31-46-96:~/WaveRNN$ python train_tacotron.py
Using device: cuda
Initialising Tacotron Model...
Trainable Parameters: 11.088M
Restoring from latest checkpoint...
Loading latest weights: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_weights.pyt
Loading latest optimizer state: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_optim.pyt
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
| 169k Steps | 64 | 0.0001 | 7 |
+----------------+------------+---------------+------------------+
| Epoch: 1/14154 (12/12) | Loss: 0.7744 | 0.91 steps/s | Step: 180k |
| Epoch: 2/14154 (12/12) | Loss: 0.7742 | 0.94 steps/s | Step: 180k |
| Epoch: 3/14154 (12/12) | Loss: 0.7733 | 0.92 steps/s | Step: 180k |
| Epoch: 4/14154 (12/12) | Loss: 0.7785 | 0.93 steps/s | Step: 180k |
and smi:
Every 1.0s: nvidia-smi ip-172-31-46-96: Fri Feb 26 00:50:23 2021
Fri Feb 26 00:50:23 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 40C P0 202W / 300W | 11715MiB / 16160MiB | 33% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 8291 C python 11713MiB |
+-----------------------------------------------------------------------------+
Let me know what other information I can provide to help debug this.
Thank you,
Scott
It may be constrained by the disk read. Move your dataset to a faster storage, like copying it to RAM in /dev/shm
.
Tesla V100 is an old GPU. For the default model size, you're going to top out around 2-3 step/sec at r=7 and 1 step/sec at r=2. It will be faster if you discard your longer utterances.
You may wish to check out CorentinJ/Real-Time-Voice-Cloning. It uses the same Tacotron and WaveRNN models as this repo. Once you get the hang of Tacotron (synthesizer) training, check out CorentinJ/Real-Time-Voice-Cloning#437 as it describes what you are trying to do.