[Help Request] Failed to reset when using multiprocessing

Question

[Help Request] Failed to reset when using multiprocessing

JiyuanTHU opened this issue 2 months ago · comments

High Level Description

When I use multiprocessing to create multi envs. Sometimes, the following error occurs but not always. The error log is attached. I hope there is someone who can help me with this.

Version

smarts 1.4.0
sumo 1.19.0

Operating System

Ubuntu 20.04

Problems

ERROR:SMARTS:Failed to successfully reset after 1 tries.
Process Process-64:
Traceback (most recent call last):
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/worker/subproc.py", line 96, in _worker
obs, info = env.reset(**data)
File "/home/yuan/Ji_ws/Learning-from-Intervention/utils/env_wrapper.py", line 155, in reset
obs, info = self.env.reset(seed=seed, options=options)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/env/gymnasium/wrappers/single_agent.py", line 78, in reset
obs, info = self.env.reset(seed=seed, options=options)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/gymnasium/wrappers/order_enforcing.py", line 61, in reset
return self.env.reset(**kwargs)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/env/gymnasium/hiway_env_v1.py", line 348, in reset
observations = self._smarts.reset(
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 471, in reset
raise first_exception
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 464, in reset
return self._reset(scenario, start_time)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 501, in _reset
self.setup(scenario)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 539, in setup
provider_state = self._setup_providers(self._scenario)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 1230, in _setup_providers
new_provider_state = self._handle_provider(provider, provider_error)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 1265, in _handle_provider
provider_state, recovered = provider.recover(
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/sumo_traffic_simulation.py", line 463, in recover
raise error
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 1228, in _setup_providers
new_provider_state = provider.setup(scenario)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/sumo_traffic_simulation.py", line 324, in setup
self._initialize_traci_conn()
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/sumo_traffic_simulation.py", line 239, in _initialize_traci_conn
self._traci_conn.setOrder(0)
TypeError: 'NoneType' object is not callable
Traceback (most recent call last):
File "main_smarts_tianshou_safe.py", line 409, in
test_discrete_sac()
File "main_smarts_tianshou_safe.py", line 209, in test_discrete_sac
result = offpolicy_trainer(
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/offpolicy.py", line 133, in offpolicy_trainer
return OffpolicyTrainer(*args, **kwargs).run()
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/base.py", line 441, in run
deque(self, maxlen=0) # feed the entire iterator into a zero-length deque
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/base.py", line 315, in next
test_stat, self.stop_fn_flag = self.test_step()
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/base.py", line 344, in test_step
test_result = test_episode(
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/utils.py", line 27, in test_episode
result = collector.collect(n_episode=n_episode)
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/data/collector.py", line 344, in collect
self._reset_env_with_ids(
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/data/collector.py", line 174, in _reset_env_with_ids
obs_reset, info = self.env.reset(global_ids, **gym_reset_kwargs)
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/venvs.py", line 282, in reset
ret_list = [self.workers[i].recv() for i in id]
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/venvs.py", line 282, in
ret_list = [self.workers[i].recv() for i in id]
File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/worker/subproc.py", line 204, in recv
result = self.parent_remote.recv()
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Tucker Alban · Answer 1 · Tue Apr 02 2024 02:40:36 GMT+0800 (China Standard Time)

Hello, multiprocessing has to use centralized TraCI server generation because there are race conditions when using the conventional means of how to acquire a port from the OS. https://smarts.readthedocs.io/en/latest/ecosystem/sumo.html#centralized-traci-management

Full context of why this is the case is here: #2139

JiyuanTHU · Answer 2 · Fri Apr 19 2024 15:24:54 GMT+0800 (China Standard Time)

Thanks! This helps