TPU debugging

Question

TPU debugging

mfatih7 opened this issue a year ago · comments

Hello

In the debugging article https://cloud.google.com/blog/topics/developers-practitioners/pytorchxla-performance-debugging-tpu-vm-part-1

the following script is used to get log data from host

export PT_XLA_DEBUG=1
export USE_TORCH=ON

python3 mmf_cli/run.py \
config=./projects/unit/configs/all_8_datasets/shared_dec_without_task_embedding.yaml \
dataset=glue_qnli \
model=unit \
training.batch_size=8 \
training.device=xla \
distributed.world_size=1 \
training.log_interval=100 \
training.max_updates=1500
I do not use logging module in my scripts.

How can I get log data for my own training script from the Host while using TPU on COLAB?

I am triggering the training procedure with the code below.

!python runTrain_n_to_n_TPU_single.py

How should I modify the above code that I use to start my training scripts?

mfatih7 · Answer 1 · Wed Jan 25 2023 18:03:49 GMT+0800 (China Standard Time)

Hello

The debugging article that I mentioned before does not belong to Google COLAB.
It belongs to google cloud. Can I run every google cloud code in COLAB?

Afterwards I switched to Troubleshooting document of COLAB.

I imported

import torch_xla.debug.metrics as met

print(met.metrics_report())

lines to my code and after the execution is completed I observe the metrics data.
But how about running my workload with

PT_XLA_DEBUG=1

How can I activate this?

mfatih7 · Answer 2 · Thu Jan 26 2023 20:58:36 GMT+0800 (China Standard Time)

Hello

How can I activate Auto-Metrics Analysis?

When I run

!python runTrain_n_to_n_TPU_single.py PT_XLA_DEBUG=1

I don't see any printed information on the console like the example message below.

pt-xla-profiler: CompileTime too frequent: 21 counts during 11 steps
pt-xla-profiler: TransferFromServerTime too frequent: 11 counts during 11 steps
pt-xla-profiler: Op(s) not lowered: aten::_ctc_loss, aten::_ctc_loss_backward,  Please open a GitHub issue with the above op lowering requests.
pt-xla-profiler: CompileTime too frequent: 23 counts during 12 steps
pt-xla-profiler: TransferFromServerTime too frequent: 12 counts during 12 steps

mfatih7 · Answer 3 · Thu Jan 26 2023 23:36:48 GMT+0800 (China Standard Time)

Adding the lines below to the code is enough

import os 
os.environ['PT_XLA_DEBUG'] = '1'