Training Optimization Question
woolz opened this issue · comments
Thiago Cassimiro commented
Current I'm training a large model (114M sentences) with 2 GPUS but I see a problem on GPU parallelism during the training on nvidia-smi.
``
Second GPU are all time 98-99% usage OK
But the First GPU have a fluctuation on GPU-Utilization, sometimes on 11% and others 45%, 95% etc..
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off| 00000000:01:00.0 Off | N/A |
| 39% 56C P2 55W / 170W| 11343MiB / 12288MiB | 11% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 Off| 00000000:02:00.0 Off | N/A |
| 37% 43C P2 48W / 170W| 11343MiB / 12288MiB | 99% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
It's normal or have a build/configuration problem?
train-marian.txt
[2023-05-18 03:17:45] Using synchronous SGD
[2023-05-18 03:17:57] [training] Batches are processed as 1 process(es) x 2 devices/process