TPU debugging
mfatih7 opened this issue · comments
Hello
In the debugging article https://cloud.google.com/blog/topics/developers-practitioners/pytorchxla-performance-debugging-tpu-vm-part-1
the following script is used to get log data from host
export PT_XLA_DEBUG=1
export USE_TORCH=ON
python3 mmf_cli/run.py \
config=./projects/unit/configs/all_8_datasets/shared_dec_without_task_embedding.yaml \
dataset=glue_qnli \
model=unit \
training.batch_size=8 \
training.device=xla \
distributed.world_size=1 \
training.log_interval=100 \
training.max_updates=1500
I do not use logging module in my scripts.
How can I get log data for my own training script from the Host while using TPU on COLAB?
I am triggering the training procedure with the code below.
!python runTrain_n_to_n_TPU_single.py
How should I modify the above code that I use to start my training scripts?
Hello
The debugging article that I mentioned before does not belong to Google COLAB.
It belongs to google cloud. Can I run every google cloud code in COLAB?
Afterwards I switched to Troubleshooting document of COLAB.
I imported
import torch_xla.debug.metrics as met
print(met.metrics_report())
lines to my code and after the execution is completed I observe the metrics data.
But how about running my workload with
PT_XLA_DEBUG=1
How can I activate this?
Hello
How can I activate Auto-Metrics Analysis?
When I run
!python runTrain_n_to_n_TPU_single.py PT_XLA_DEBUG=1
I don't see any printed information on the console like the example message below.
pt-xla-profiler: CompileTime too frequent: 21 counts during 11 steps
pt-xla-profiler: TransferFromServerTime too frequent: 11 counts during 11 steps
pt-xla-profiler: Op(s) not lowered: aten::_ctc_loss, aten::_ctc_loss_backward, Please open a GitHub issue with the above op lowering requests.
pt-xla-profiler: CompileTime too frequent: 23 counts during 12 steps
pt-xla-profiler: TransferFromServerTime too frequent: 12 counts during 12 steps
Adding the lines below to the code is enough
import os
os.environ['PT_XLA_DEBUG'] = '1'