Slow tacotron training 1step/sec on AWS p3.2xlarge (Tesla V100)

Question

Slow tacotron training 1step/sec on AWS p3.2xlarge (Tesla V100)

ScottyBauer opened this issue 3 years ago · comments

I'm fooling around with this project and I'm getting throughput I think is too slow, which leads me to believe I may have mis-configured something or there are other issues.

I'm reusing the pre-trained models with my own custom audio of ~750 audio clips ranging from 4-10 seconds.

I'm using:
PyTorch 1.7.1 with Python3.7 (CUDA 11.0 and Intel MKL)

In order to get the code to run properly I had to apply the fix from this bug (not sure if this is relevant just want to give all details):
#201

and I applied this pull request:
521179e

The only changes I've made to hyperparams is changing peak_norm from false to true:

peak_norm =True                   # Normalise to the peak of each wav file

and setting my paths.

I can confirm that it is using the GPU (at least GPU memory), but I've never seen nvidia-smi show utilization above 38%:

Things I've tried:
upping the batch size in hyperparams, also the learning rate, up to 64, which didn't help.

here is nvidia-smi output:

Fri Feb 26 00:40:21 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    44W / 300W |   4461MiB / 16160MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6844      C   python                           4459MiB |
+-----------------------------------------------------------------------------+

what it's up to:

Trainable Parameters: 11.088M
Restoring from latest checkpoint...
Loading latest weights: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_weights.pyt
/home/ubuntu/WaveRNN/models/tacotron.py:308: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than tensor.new_tensor(sourceTensor).
  self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
Loading latest optimizer state: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_optim.pyt
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   170k Steps   |     8      |    0.0001     |        2         |
+----------------+------------+---------------+------------------+

| Epoch: 1/1869 (61/91) | Loss: 0.7363 | 0.41 steps/s | Step: 180k |

If I change some of the learning rate parameters:

(pytorch_latest_p37) ubuntu@ip-172-31-46-96:~/WaveRNN$ python train_tacotron.py 
Using device: cuda

Initialising Tacotron Model...

Trainable Parameters: 11.088M
Restoring from latest checkpoint...
Loading latest weights: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_weights.pyt
Loading latest optimizer state: /home/ubuntu/WaveRNN/checkpoints/ljspeech_lsa_smooth_attention.tacotron/latest_optim.pyt
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   169k Steps   |     64     |    0.0001     |        7         |
+----------------+------------+---------------+------------------+
 
| Epoch: 1/14154 (12/12) | Loss: 0.7744 | 0.91 steps/s | Step: 180k |  
| Epoch: 2/14154 (12/12) | Loss: 0.7742 | 0.94 steps/s | Step: 180k |  
| Epoch: 3/14154 (12/12) | Loss: 0.7733 | 0.92 steps/s | Step: 180k |  
| Epoch: 4/14154 (12/12) | Loss: 0.7785 | 0.93 steps/s | Step: 180k |

and smi:

Every 1.0s: nvidia-smi                                                                                                                                                                                                                           ip-172-31-46-96: Fri Feb 26 00:50:23 2021

Fri Feb 26 00:50:23 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P0   202W / 300W |  11715MiB / 16160MiB |     33%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      8291      C   python                          11713MiB |
+-----------------------------------------------------------------------------+

Let me know what other information I can provide to help debug this.

Thank you,
Scott

Deleted user · Answer 1 · Mon Mar 08 2021 00:52:45 GMT+0800 (China Standard Time)

It may be constrained by the disk read. Move your dataset to a faster storage, like copying it to RAM in /dev/shm.

Tesla V100 is an old GPU. For the default model size, you're going to top out around 2-3 step/sec at r=7 and 1 step/sec at r=2. It will be faster if you discard your longer utterances.

You may wish to check out CorentinJ/Real-Time-Voice-Cloning. It uses the same Tacotron and WaveRNN models as this repo. Once you get the hang of Tacotron (synthesizer) training, check out CorentinJ/Real-Time-Voice-Cloning#437 as it describes what you are trying to do.