kquick / Thespian

Python Actor concurrency library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High memory use and actor system timeout with about 2,000 actors

davideps opened this issue · comments

I'm designing a simulation that should be able to scale across a cluster of computers. On my Windows 10 laptop with 6 cores and 32GB of memory, I am encountering both a memory issue and actor system timeout. I'm using multiprocTCPBase and get a timeout error (trace below) after 1,994 have been created. Together these actors use 20GB of memory (not including 5GB used at machine startup). Two actors store large dictionaries that include the addresses of many other actors. The remaining actors store only a few integer and string values. I was surprised by the memory usage, though my laptop did not actually run out of memory.

Since I'm interested in multi-core, multi-machine performance, I thought the actors should be separate processes, but perhaps there is too much overhead? I'd welcome advice about how to redesign the actors to use less memory as well as help identifying the reason for the time out.

        [lots of lines like this one below...]
	Cell {A:Cell @ ActorAddr-(T|:58746)} initialized: oid:41_34 x:41 y:34
Traceback (most recent call last):
  File "D:/Documents/Programming/ActiveFiction/python_code/thespian_test_01.py", line 409, in <module>
    cell = asys.createActor(Cell)
  File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\thespian\actors.py", line 705, in createActor
    sourceHash)
  File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\thespian\system\systemBase.py", line 217, in newPrimaryActor
    str(self.transport.myAddress)))
thespian.actors.ActorSystemRequestTimeout: No response received to PendingActor request to Admin at ActorAddr-(T|:1900) from ActorAddr-(T|:62606)

Process finished with exit code 1

Let me address your two concerns individually: memory utilization and timeouts.

  • For memory utilization, I'm afraid this may be unavoidable. The minimum overhead for a Python application seems to be about 20-30KB (at least under Linux), which seems to be related to the core interpreter, shared libraries, and the various python objects that are being maintained by the GC in the cloned processes. I did some work a couple of years ago to try to minimize this footprint but I was unable to reduce it beyond this size threshold.

    One of the constraints here is that Thespian is implementing each Actor as a separate system process. This ensures separation between Actors, but it's also a heavy-weight isolation mechanism. I had considered an alternative base implementation allowing each Actor to be a separate thread instead of a separate process as a smaller footprint, but I ran into some significant concerns relative to the GIL, where any Actor would then have the ability to stop all of the Actors. I think that a thread-based implementation would be possible, with some caveats and restrictions, but it's difficult to achieve Erlang-style scalability for Actors without support for this at the VM level.

    The various approaches I could suggest at this point are:

    1 - use multiple systems with a convention to spread Actors across different system
    2 - change the level of granularity for Actors: combine functionality to have fewer Actors capable of doing more things. I know this is counter to the general advice with Actors, but it's one way to mitigate your scalability issue.
    3 - you could try multi-threading your actors. Thespian should support Actors which have multiple threads: the receiveMessage() will only be activate on the "primary" thread although other threads are allowed to call self.send(). Any other synchronization is up to you. Again, I hesitate to suggest this because it adds considerable complication and the normal advice is to stick to a single concurrency architecture, but again, it's a practical approach to dealing with your underlying constraints.

  • With regards to the timeouts, this may be somewhat related to the scaling issue, but it's also a bit unexpected. I've used Thespian in systems with > 1000 actors successfully. It would be good to know if the timeouts only start occurring when you reach a certain threshold number of Actors or corresponding activity to know if this is a threshold issue or a consistent gradual issue.

    There is some internal logging Thespian generates that may be of some help to diagnose what is happening here. By default, it will write to $TMPDIR/thespian.log, although you can control this with the THESPLOG_FILE environment variable (ensure this is set in the process that starts the Admin to be properly applied). You can also use the THESPLOG_FILE_MAXSIZE environment variable to specify how large this file is allowed to grow (the value is the number of bytes and the default is 51200), and the THESPLOG_THRESHOLD to set the logging threshold (the default is "WARNING" and valid values are "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"). I'm happy to take a look at output from this log (either here or in a gist) to see if anything looks out of place.

That's OK, the last line is the important one: ValueError: too many file descriptors in select(). This is probably an OS limit that is being hit. I don't know if Windows provides a way to adjust this limit or not, but if you are close to the maximum number of Actors you need, this could be a viable solution. However, if you are wanting to scale significantly beyond 2000 actors, then this might start to encounter secondary effects.

Another option is to tell Thespian not to keep the sockets open; keeping the socket open is an efficiency measure to avoid TCP connection times, but Thespian will reconnect as needed. Unfortunately, there's no environmental control for this at the moment, so you'll need to edit your local version of Thespian and change REUSE_SOCKETS to False here: https://github.com/kquick/Thespian/blob/master/thespian/system/transport/TCPTransport.py#L127. If this helps and solves this problem for you, I could add an environment variable or other way to control this setting. Alternatively, I could look into establishing an upper threshold on the number of open sockets that would automatically close sockets as needed to avoid crossing that threshold; I'd like to get more data on your experiences though before undertaking any adjustments.

With respect to memory utilization, you are correct: there's a magnitude difference between the minimum I'm aware of and what you are seeing. One thought is that my observed minimum is under Linux and I'm less familiar with the situation on Windows. It might be interesting to see what the difference is between starting 2000 actors with no traffic v.s. that number after they start exchanging traffic in your application. If it's the latter, then in may be that since there's memory available in the system, there's less pressure on each Actor process's memory to cause python GC. I don't want to leap to any conclusions though, so if you have a chance to observe the footprint of 2000 idle actors to compare to your current situation it would be interesting to compare those.

It depends on your usage.

I would recommend against using the UDP base because UDP messages do not guarantee delivery; this base is interesting, especially for experimentation with lossy delivery, but should probably not be used for reliable/production use.

The multiprocQueueBase uses Python's Queue facility for communications between Actors. A Queue cannot be arbitrarily attached to processes, but is instead a point-to-point between a parent and child, so Thespian's message delivery in a Queue transport involves passing the message up the parent chain of actors (internally... the Actor code itself is not involved) to a common ancestor of both the sender and the receiver, and then back down the chain of children on the receiver's side to the destination. The benefit to you of this base is that none of them will exceed the file descriptor limit for the select() call unless you have an exceptionally flat Actor tree where all 2000 Actors are children of a single parent. The disadvantage is the performance impact of passing each message across multiple Actors to the intended destination. That said, all of this complexity is hidden, so all you would need to do is change the name of the base you are using to experiment and see if the multiprocQueueBase works better for your needs. [One additional note: in my testing I've seen occasional failures that seem to originate in the internals of the Queue implementation, but I haven't had the time to fully track them down.]

The simpleSystemBase has no processes or sockets, but the performance will be even less than the multiprocQueueBase because all the Actors live in the current process and participate in a cooperative scheduling technique to run serially. Additionally, since all the Actors share the same process, it's easier to accidentally violate the Actor protections by having Actors share structures in memory, so your development requires more discipline to avoid this. Again however, you can simply try this base to see if it helps or hurts your situation by changing the name of the base started, with no other changes to your code.

When I set REUSE_SOCKETS = False, I get the following trace:

C:\Users\dvyd.conda\envs\activefiction\python.exe D:/Documents/Programming/ActiveFiction/python_code/thespian_test_01.py
Starting actor system
Exception ignored in: <function TCPIncoming.del at 0x000001B93F6D25E8>
Traceback (most recent call last):
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\system\transport\TCPTransport.py", line 197, in del
_safeSocketShutdown(s.socket)
NameError: name 's' is not defined
Exception ignored in: <function TCPIncoming.del at 0x0000024955672D38>
Traceback (most recent call last):
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\system\transport\TCPTransport.py", line 197, in del
_safeSocketShutdown(s.socket)
NameError: name 's' is not defined
Exception ignored in: <function TCPIncoming.del at 0x000001B93F6D25E8>
Traceback (most recent call last):
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\system\transport\TCPTransport.py", line 197, in del
_safeSocketShutdown(s.socket)
NameError: name 's' is not defined
Exception ignored in: <function TCPIncoming.del at 0x0000024955672D38>
Traceback (most recent call last):
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\system\transport\TCPTransport.py", line 197, in del
_safeSocketShutdown(s.socket)
NameError: name 's' is not defined
Exception ignored in: <function TCPIncoming.del at 0x000001B93F6D25E8>
Traceback (most recent call last):
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\system\transport\TCPTransport.py", line 197, in del
_safeSocketShutdown(s.socket)
NameError: name 's' is not defined
Traceback (most recent call last):
File "D:/Documents/Programming/ActiveFiction/python_code/thespian_test_01.py", line 399, in
agent_scheduler = asys.createActor(AgentScheduler)
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\actors.py", line 705, in createActor
sourceHash)
File "C:\Users\dvyd.conda\envs\activefiction\lib\site-packages\thespian\system\systemBase.py", line 196, in newPrimaryActor
response.failure_message)
thespian.actors.InvalidActorSpecification: Invalid Actor Specification: <class 'main.AgentScheduler'> ('TCPTransport' object has no attribute '_openSockets')
Creating agent_scheduler

Process finished with exit code 1

is the file descriptor limit the same thing as the open file limit discussed/solved here: https://coderwall.com/p/ptq7rw/increase-open-files-limit-and-drop-privileges-in-python ?

Yes, although that link is related although it's POSIX specific. The python calls used there have command-line equivalents in a POSIX system as well (e.g. Linux). For Windows, it's probably something along the lines of what's indicated here: https://serverfault.com/questions/249477/windows-server-2008-r2-max-open-files-limit

Do you have any idea how much time it takes to open and close the connections?

That will vary a lot based on the OS and other factors like how many messages you expect to exchange. If your program execution will be dominated by delivery of messages between actors, then this may be visible to you, but if your program execution is dominated by computation within the actors then it may not be a big deal. TCP socket re-use is a common technique for improving high-load performance of network-based operations and the technique has been highlighted in scenarios where people are running benchmarks or working to maximize the number of transactions to a server; it starts to be a significant factor at those higher ends of the scale but it's not necessarily going to be a noticeable impact when you aren't running configurations at those levels.

Can the multiprocQueueBase be used in a cluster scenario or only the TCP base ?

No, it cannot. The Python Queue object is only useable between processes on the same system. To support your multi-system functionality, you will need either the multiprocTCPBase or the multiprocUDPBase (see previous note though: this one is not recommended). FYI, more information on the different ActorSystem bases is here: https://thespianpy.com/doc/using.html#outline-container-org8296d29

When I set REUSE_SOCKETS = False, I get the following trace:

Whoops, sorry about that! It looks like I haven't run in this mode for quite a while and some bitrot occurred. I've committed (e21f484) an updated TCPTransport.py that fixes this. You should be able to get a new version of this file (https://raw.githubusercontent.com/kquick/Thespian/e21f484db5d0207d98274276a2336dcab108c864/thespian/system/transport/TCPTransport.py) and replace your existing file, or you can checkout the latest master version of the repository. (The committed version still has REUSE_SOCKETS = True, so you'll still need to make that change.)

Kevin, the new version of TCPTransport.py solved the problem! I was able to create thousands of actors without running into the select() error. It is a turn-based simulation. Will I run into a problem when the scheduler tells all the workers to make a move and then they all try to open TCP connections to confirm to the scheduler that they are done? Will the ActorSystem handle the demand for new connections and make sure the open total never exceeds 1,900 or so?

The downside of being able to start all these actors is that I can easily run out of memory (32GB) if I try to generate too many. My machine started to use cache! How can I determine why the actors are using so much memory? The init of the workers is pretty simple:

class Agent(Actor):
    def __init__(self, *args, **kw):
        self.oid = None
        self.firstname = None
        self.lastname = None
        self.fullname = None
        self.cell = None
        super().__init__(*args, **kw)

and

class Cell(Actor):
    def __init__(self, *args, **kw):
        self.oid = None
        self.x = None
        self.y = None
        self.neighbors = {}
        super().__init__(*args, **kw)

The two schedulers (not shown) are a bit heavier. Each holds a dictionary of several thousand worker addresses. Still, that is not close to even a single GB of data.

Great news about the new TCPTransport.py!

There is currently no explicit limiter that would prevent it from trying to open too many sockets at once, although there are limits on the number of active listens which may help to regulate this. There is also a lot of error handling and retry logic in the TCPTransport: it's possible that you've been hitting this limit in other calls and it's been handling those, but the select() is expected to be able to work. It's also a little tricky because just regulating at the getrlimit() value isn't enough because the code doesn't have any way to account for other open file descriptors. Let's keep an eye on your results as you do more testing and when I have some more time (possibly this weekend) I can setup some more extensive tests for investigating this.

Regarding the memory size, I did some rough experimentation and (under Linux) it looks like a started Python process with nothing loaded starts at about 122MB (only ~9.6MB resident). Clearly some of this is shareable, since 1000 processes at that base rate would be 122GB. Windows results may be significantly different, depending on the amount of sharing Windows can achieve with Python DLL's.

One thing you can try is to change the "Process Startup Method" capability (see https://thespianpy.com/doc/using.html#outline-container-org8296d29), which allows you to specify which Python multiprocessing mode is used (modulo system capabilities). Here again, I think Windows is more restricted in this area (try $ python -c "import multiprocessing; print(multiprocessing.get_all_start_methods())" to see what's actually available on your system). Different process startup methods may have different levels of sharing and impact on the resulting process size.

You may want to try using mprof or guppy to do some analysis (https://pypi.org/project/memory-profiler/ and https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python may be helpful). I will look into this situation more as well, although it will take me a couple of days to have time to do this properly. However, unless we can find some egregious usage of memory in the Thespian code itself (which would surprise me, but I'd certainly want to fix it if there was), I'm not sure if there's a lot that can be done to improve things here. Memory is one of the cheaper resources available these days and so there's not a lot of general focus on minimizing footprints and not a lot that Thespian can do to control this.

It isn't possible for the Actor itself to be a thread (that was the development route I was talking about above that I considered but couldn't make work in a general fashion), but you can write an Actor that has multiple threads. Despite my aversion to recommending mixed concurrency mechanisms, this might be a good practical approach for your scalability needs.

Essentially you would have a grid Actor that contained some number of cells, where each cell could be a separate thread. The Actor receiveMessage() would always be invoked by Thespian in the context of the main grid thread, so it would need some mechanism to pass the message to the correct "cell" thread (using one of the mechanisms discussed in Python's multithreading library). The "cell" thread's can call self.send() directly to send their responses as that call should be thread-safe in Thespian.

Regarding Windows using spawn, I recall that now. Under Linux, a spawn is potentially slower, although it might have the advantage of carrying across less instantiated memory. Windows will have very different characteristics however in how it handles multiprocessing and system resources, but it sounds like this is not a parameter that you can experiment with to try to get some savings.

Hi @davideps, are there still concerns to be addressed regarding this issue?