hpcaitech / EnergonAI

Large-scale model inference.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Detected RRef Leaks during shutdown, empty pipe, tests_engine failed

nostalgicimp opened this issue · comments

commented

When try to run engine.shutdown(), it fails due to RRef leaks, and the resultant empty pipe error. So when I ran the tests, particularly in tests/test_engine, four tests will be run and stuck there.

Possible causes:

  • master and workers are not gracefully shut down, resulting in residual RPC agent. So the second test will run into existing RPC agent, and long time hang there showing Socket Timeout.
  • Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners.

Debugging output:

============================== test session starts ========================================
platform linux -- Python 3.8.12, pytest-7.2.0, pluggy-1.0.0
plugins: xdist-3.0.2, anyio-3.6.2
collected 4 items                                                                                                                                                                              

**tests/test_engine/test_hybrid.py**
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:23] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3                                                                                                
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
[12/15/22 09:15:26] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2049,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch                                                         
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 2                  
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker0 start                                                                                                                       
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker1 start                                                                                                                       
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker2 start                                                                                                                       
[12/15/22 09:15:27] INFO     colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__                                                                     
                    INFO     colossalai - energonai - INFO: worker3 start                                                                                                                       
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt
[W rref_context.cpp:156] **Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners. Please make sure all RRefs are out of scope and Python GC has deleted them before calling shutdown(): 
Leaking RRef GloballyUniqueId(created_on=3, local_id=0) with fork Id GloballyUniqueId(created_on=3, local_id=1)
Leaking RRef GloballyUniqueId(created_on=4, local_id=0) with fork Id GloballyUniqueId(created_on=4, local_id=1)
Leaking RRef GloballyUniqueId(created_on=1, local_id=0) with fork Id GloballyUniqueId(created_on=1, local_id=1)
Leaking RRef GloballyUniqueId(created_on=2, local_id=0) with fork Id GloballyUniqueId(created_on=2, local_id=1)**

[W tensorpipe_agent.cpp:726] RPC agent for worker3 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker2 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker0: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker1: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker2: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker3: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
Process SpawnProcess-4:
Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-3:
.Traceback (most recent call last):
  File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
    return self.local_queue.get_nowait()
  File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
    return self.get(block=False)
  File "/usr/local/lib/python3.8/queue.py", line 167, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/Projects/energonai/worker.py", line 68, in _start
    task_entry: TaskEntry = self.input_pipe.recv_nowait()
  File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
    raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 62, in __init__
    self._start()
  File "/root/Projects/energonai/worker.py", line 73, in _start
    time.sleep(0.01)
KeyboardInterrupt

**tests/test_engine/test_pp.py [12/15/22 09:15:31] I**NFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
[12/15/22 09:15:31] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device                                       
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                                                                                
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                                                                                
[12/15/22 09:15:32] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
[12/15/22 09:15:32] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default       
                             parallel seed is ParallelMode.DATA.                                                                                                                                
                    INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch                                                         
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 1                  
Process SpawnProcess-6:
Process SpawnProcess-5:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 35, in __init__
    trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
    group = _init_process_group(store, rank, world_size)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
    group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Projects/energonai/worker.py", line 35, in __init__
    trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
    group = _init_process_group(store, rank, world_size)
  File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
    group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout

System settings:

Python 3.8
CUDA Version: 11.3
PyTorch Version: 1.12.1+cu113
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓

colossalai 0.1.11rc1+torch1.12cu11.3
energonai 0.0.1+torch1.12cu11.3
torch 1.12.1+cu113
torchaudio 0.10.2+cu113
torchvision 0.13.1+cu113

commented

Any updates on this issue? @ver217