pytorch / xla

Enabling PyTorch on Google TPU

Home Page:https://pytorch.org/xla/release/1.13/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running codes on large TPU VM Pod causes SIGSEGV

tgisaturday opened this issue Β· comments

πŸ› Bug

To Reproduce

Steps to reproduce the behavior:

  1. Create TPU VM Pod instance (minimum v3-128)
  2. Git clone https://github.com/tgisaturday/dalle-lightning-tpu
  3. Run pip install -r requirements.txt
  4. Run main.py with fake_data generator
    python3 -m torch_xla.distributed.xla_dist --tpu= POD_NAME --restart-tpuvm-pod-server -- python3 /home/taehoon.kim/taming-transformers-tpu/main.py --use_tpus --refresh_rate 1 --disc_start 1 --fake_data
  5. The process successfully runs on pods under v3-64, but fails on v3-128.

2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 265, in start_training
2021-07-08 08:28:12 10.164.0.33 [0] xmp.spawn(self.new_process, **self.xmp_spawn_kwargs)
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 388, in spawn
2021-07-08 08:28:12 10.164.0.33 [0] return torch.multiprocessing.start_processes(
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
2021-07-08 08:28:12 10.164.0.33 [0] while not context.join():
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
2021-07-08 08:28:12 10.164.0.33 [0] raise ProcessExitedException(
2021-07-08 08:28:12 10.164.0.33 [0] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV

Expected behavior

Training goes well without any problem.

Environment

  • Reproducible on XLA backend [CPU/TPU]:
  • torch_xla version: torch_xla 1.8 with RUNTIME_VERSION v2_alpha on GCP.

Additional context

Here's full error log.

2021-07-08 08:28:12 10.164.0.33 [0] *** SIGSEGV (@0x7f97c0e8e528), see gl__________25#s15 received by PID 16710 (TID 16710) on cpu 32; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808cb51 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98b808405a 144 GOMP_parallel
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7ffc5c456360 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808cb51,7f96c2d8a1df,7f98bbe7120f,7f98b8084059,7ffc5c45635f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.243922 16710 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.243990 16710 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.244007 16710 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.244019 16710 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.244026 16710 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.244040 16710 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.244046 16710 coredump_hook.cc:525] RAW: Discarding core.
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98bbe19edb,7f98bbe7120f,7f96c2ca7183,7f96c2ca7b0e,7f96c2ca8daa,7f96c2ca7cdc,7f96c2ca8727,7f96c2d8a3b0,7f98bbe7120f,7f98b8084059,7ffc5c45635f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.252956 16710 process_state.cc:1061] RAW: Signal 11 raised at PC: 0x7f98bbe19edb while already in FailureSignalHandler!
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.252966 16710 process_state.cc:1095] RAW: Raising 11 signal with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16711 (TID 16711) on cpu 9 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.411338 16711 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.424205 16711 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16716 (TID 16716) on cpu 21 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.445436 16716 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.458199 16716 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16733 (TID 16733) on cpu 5 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.478726 16733 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.491271 16733 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16738 (TID 16738) on cpu 20 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.512143 16738 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.525030 16738 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16742 (TID 16742) on cpu 34 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.545681 16742 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.558362 16742 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16746 (TID 16746) on cpu 44 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.578955 16746 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.591588 16746 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c
2021-07-08 08:28:12 10.164.0.33 [0] *** SIGTERM received by PID 16750 (TID 16750) on cpu 47 from PID 16165; stack trace: ***
2021-07-08 08:28:12 10.164.0.33 [0] PC: @ 0x7f98b808ecab (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f96c2d8a1e0 976 (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] @ 0x7f98bbe71210 (unknown) (unknown)
2021-07-08 08:28:12 10.164.0.33 [0] https://symbolize.stripped_domain/r/?trace=7f98b808ecab,7f96c2d8a1df,7f98bbe7120f&map=5f4fb88af97be3ecacc71363136bb015b2a07119:7f98b8076000-7f98b809a08c,ca1b7ab241ee28147b3d590cadb5dc1b:7f96b608b000-7f96c30bdb20
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.612571 16750 coredump_hook.cc:250] RAW: Remote crash gathering disabled for SIGTERM.
2021-07-08 08:28:12 10.164.0.33 [0] E0708 08:28:12.625375 16750 process_state.cc:771] RAW: Raising signal 15 with default behavior
2021-07-08 08:28:12 10.164.0.33 [0] Traceback (most recent call last):
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/taming-transformers-tpu/main.py", line 235, in
2021-07-08 08:28:12 10.164.0.33 [0] trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 509, in fit
2021-07-08 08:28:12 10.164.0.33 [0] self._run(model)
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 870, in _run
2021-07-08 08:28:12 10.164.0.33 [0] self._dispatch()
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 912, in _dispatch
2021-07-08 08:28:12 10.164.0.33 [0] self.accelerator.start_training(self)
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
2021-07-08 08:28:12 10.164.0.33 [0] self.training_type_plugin.start_training(trainer)
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 265, in start_training
2021-07-08 08:28:12 10.164.0.33 [0] xmp.spawn(self.new_process, **self.xmp_spawn_kwargs)
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 388, in spawn
2021-07-08 08:28:12 10.164.0.33 [0] return torch.multiprocessing.start_processes(
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
2021-07-08 08:28:12 10.164.0.33 [0] while not context.join():
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
2021-07-08 08:28:12 10.164.0.33 [0] raise ProcessExitedException(
2021-07-08 08:28:12 10.164.0.33 [0] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
client_loop: send disconnect: Broken pipe

I've looked a little this week and don't have a fix yet. I was able to confirm that the SIGSEGV is not really specific to your model. It seems like any model in pytorch lightning will crash on v3-128 if the images are moderately large (e.g. 28x28x3 image is OK but 256x256x3 image like yours results in crash)

I had a few clarification questions for the Lightning team in Lightning-AI/lightning#8358

In particular I am wondering if Pytorch Lightning handles the init differently for v3-32 vs v3-128. Or maybe there is some memory management issue in the way that Lightning sets up the workers that gets worse as the number of TPU cores increases

I will keep trying to dig around to find a more informative error message from the SIGSEGV and hopefully can get some hints from the Lightning team as well

Found out the cause and fixed it.

I'll leave the link for those who are having similar problem with TPU + xla + lightning

Lightning-AI/lightning#8358 (comment)