huawei-noah / SMARTS

Scalable Multi-Agent RL Training School for Autonomous Driving

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Help Request] traci

knightcalvert opened this issue · comments

High Level Description

I have noticed i have the same problem with #2127, so i update the latest smarts version. but the problems still exist.
problem 1 :

Could not connect to TraCI server at localhost:59573 [Errno 111] Connection refused
 Retrying in 0.05 seconds
Could not connect to TraCI server at localhost:59573 [Errno 111] Connection refused
 Retrying in 0.05 seconds
Could not connect to TraCI server at localhost:59573 [Errno 111] Connection refused
 Retrying in 0.05 seconds

in the beginning, the traci will tried to connect to different ports. however, after running 10 hours, the traci only tried to connect to the same port with constant failure, so my code stucked, i have to rerun my code.

problem 2 :
this problem is like the #2127, because of

  File "./smart_master/SMARTS/smarts/core/sumo_traffic_simulation.py", line 239, in _initialize_traci_conn
    self._traci_conn.setOrder(0)
TypeError: 'NoneType' object is not callable

so my smarts can't reset successfully, i update the code, but this problem still exist occasionally.

and can i ture off this traci warning? 90% of my console output is traci warning, I can't find the info i really need. thank you very much

Version

the latest v

Operating System

ubuntu

Problems

No response

Hello @knightcalvert, we are currently using this method to acquire port numbers for SUMO.

I did a bit of digging into SUMO port creation and shutdown. I think perhaps this line killing SUMO is preventing the destructor being called and cleanup of SUMO's used port.

self._sumo_proc.kill()

I'll try tomorrow to see if I can apply a cleaner shutdown without blocking on sumo closing.

I will also squash the messages. They are related one of sumo's methods that uses print for warnings...:

https://github.com/eclipse-sumo/sumo/blob/56aceb87d847397941936c28934f9097e0c03f98/tools/traci/main.py#L87-L106

@knightcalvert SUMO logging should be now squashed: 16665d5

An update. kill() was intended to prevent zombie SUMO processes, without it some sumo zombie processes can be left behind. I am trying to find a solution that closes sumo processes gracefully but does not leave behind zombie processes.

#2140 is intended to fix the issue. From current stress testing, kill() does not leave over zombie processes or lost ports. Processes exiting without closing the sumo process will cause zombie processes, mainly in the case of an exception or closing a process without calling SMARTS.destroy() inside. I have going to run a second set of stress tests to see if reusing a single process causes a problem.

@knightcalvert One thing to note, even after this, if you are using SMARTS directly make sure that each SMARTS instance calls destroy() at the end of its use to guarantee cleanup of resources. If using the the gym style environment close() is necessary.

after updating the code, it helpful, the useless output is gone.
but maybe because of my parallel run? i still got:

[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]

and my code have using close()

    def close(self):
        self.base_env.close()

i‘m not familiar with the traci or ports, and just curious that if one port could not connect in many tries, it seems useless to continue connecting this port, can i tried to connect another one?

after updating the code, it helpful, the useless output is gone.
but maybe because of my parallel run? i still got:

[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]
[] TraCI could not connect in time to 'localhost:52151' [Could not connect in 101 tries]

I think I understand the issue better now. It does not seem to have to do with number of ports but with the SUMO server and SMARTS somehow not paring.

My only thought is that somehow this is happening:

sumo_problem

Once a connection is established, SMARTS does not check to see if its instance of the TraCI server (sumo) is still alive. So, I think this might be the result of an extremely low probability race condition where a smarts instance manages to connect to a different TraCI server by bad luck, then locking out the actual owner of the TraCI server.

We get ports by getting a random free port recommended by the OS (out of 64512 standard ports), so the chance of ports colliding is very low but possible.

and my code have using close()

    def close(self):
        self.base_env.close()

i‘m not familiar with the traci or ports, and just curious that if one port could not connect in many tries, it seems useless to continue connecting this port, can i tried to connect another one?

It was assumed that it would connect or retry with a different port. I will put in a patch that will reattempt with a different port while I think of a way to gracefully handle the root cause.

problem 2 :

Does this still happen?

i‘m not familiar with the traci or ports, and just curious that if one port could not connect in many tries, it seems useless to continue connecting this port, can i tried to connect another one?

I have attempted a patch 0930736 for that now in #2140. It retries with a different port and saves the TraCI server on a stolen connection to avoid interrupting a different instance.

I will need to do a follow-up fix.

my last run is stucked in 401027 episodes, as far as i know, the problem 2 is not happened again. I roughly understand what you say, it seems like if i reduce the number of parallel runs, the possibility of port occupation will reduce too?
really thank you for your constant attention.

my last run is stucked in 401027 episodes, as far as i know, the problem 2 is not happened again. I roughly understand what you say, it seems like if i reduce the number of parallel runs, the possibility of port occupation will reduce too? really thank you for your constant attention.

Honestly, it would reduce the chances but not completely prevent it.

I am pursuing a different solution that uses a centralised server to prevent port collisions (at least between sumo instances). As of 94da02f it looks like this:

## console 1 (or in background OR on remote machine)
# Run the centralized sumo port management server.
# Use `export SMARTS_SUMO_CENTRAL_PORT=62232` or `--port 62232`
$ python -m smarts.core.utils.centralized_traci_server
## console 2
## Set environment variable to switch to the server.
$ export SMARTS_SUMO_TRACI_SERVE_MODE=central
## Unnecessary but optional
# export SMARTS_SUMO_CENTRAL_HOST=localhost
# export SMARTS_SUMO_CENTRAL_PORT=62232
## do run
$ python experiment.py

It works as is right now but when I get it working better I will likely integrate the server generation into the main process and set it as the default behaviour. I think I will also eventually use the server as a pool of sumo processes which may also speed up training somewhat.

The newest change 25789e9 resulted in no disconnects and no port collisions against 60k instances and 32 parallel experiments.