System hangs with 1000 messages

Question

System hangs with 1000 messages

andatt opened this issue 5 years ago · comments

Hi Kevin

This is a continuation of issue 43 (#43), I managed to reproduce what was an intermittment issue in the test code below.

It basically the same code as we were using in issue 43 (https://gist.github.com/kquick/38a23b58e16f1505720a4a18fece012f) but with more data, the data being dicts instead of ints and a couple of hepler functions to count the output in logfile. It seems the critcal factor causing failure here is just that there are now 1000 messages to process as opposed to 100 in the original issue:

from thespian.troupe import troupe
from thespian.actors import ActorTypeDispatcher, Actor
from thespian.actors import ActorSystem
import logging
import time


def logfile_extraction(log_files):
    """
    Gets data from logfile and returns as string
    :param log_files: string, path to logfile
    :return: string
    """
    consolidated_output = ""
    for logfile in log_files:
        with open(logfile, "r+") as file:
            consolidated_output += file.read()

    return consolidated_output


def get_results(actor_system, messages_sent_to_api):

    log_file_try_count = 0
    posts_in_logfile = logfile_extraction(["bug_check_2.log"]).count("Received")
    print("checking api for data...", posts_in_logfile)
    while posts_in_logfile < messages_sent_to_api:
        time.sleep(3)
        print("checking api log for post requests...", posts_in_logfile)
        posts_in_logfile = logfile_extraction(["bug_check_2.log"]).count("Received")
        # from_api = get_cases_from_api(self.user_login)
        log_file_try_count += 1
        if log_file_try_count > 100:
            print("Not expected number of posts after 100 tries, test might fail...")
            break
    print("found {} posts!".format(posts_in_logfile))
    return posts_in_logfile


class ActorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' in logrecord.__dict__


class NotActorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' not in logrecord.__dict__


def log_config(log_file_path_1, log_file_path_2):
    return {
        'version': 1,
        'formatters': {
            'normal': {'format': '%(levelname)-8s %(message)s'},
            'actor': {'format': '%(levelname)-8s %(actorAddress)s => %(message)s'}},
        'filters': {'isActorLog': {'()': ActorLogFilter},
                    'notActorLog': {'()': NotActorLogFilter}},
        'handlers': {'h1': {'class': 'logging.FileHandler',
                            'filename': log_file_path_1,
                            'formatter': 'normal',
                            'filters': ['notActorLog'],
                            'level': logging.INFO},
                     'h2': {'class': 'logging.FileHandler',
                            'filename': log_file_path_2,
                            'formatter': 'actor',
                            'filters': ['isActorLog'],
                            'level': logging.INFO}, },
        'loggers': {'': {'handlers': ['h1', 'h2'], 'level': logging.DEBUG}}
    }


class PrimaryActor(ActorTypeDispatcher):

    def receiveMsg_str(self, msg, sender):

        test_data = []
        for x in range(0,10):
            inner = []
            for x in range(0, 100):
                inner.append({"num": x})
            test_data.append(inner)

        if not hasattr(self, "helper"):
            self.helper = self.createActor(
                SecondaryActor
            )

        for data in test_data:

            self.send(
                self.helper,
                data
            )


@troupe()
class SecondaryActor(ActorTypeDispatcher):
    child_count = 0
    children_finished = 0

    def receiveMsg_list(self, msg, sender):
        self.troupe_work_in_progress = True

        if not hasattr(self, "helper"):
            self.helper = self.createActor(
                TertiaryActor
            )

        for data in msg:
            print("in secondary")
            self.child_count += 1
            self.send(
                self.helper,
                {"from": self.myAddress, "data":data}
            )

    def receiveMsg_str(self, msg, sender):

        self.children_finished += 1
        if self.children_finished == self.child_count:
            self.troupe_work_in_progress = False


@troupe()
class TertiaryActor(ActorTypeDispatcher):

    def receiveMsg_dict(self, msg, sender):
        qa = self.createActor(
            QuaternaryActor,
            globalName="quaternay"
        )
        print("in tertiary")
        self.send(
            qa,
            msg["data"]
        )
        self.send(msg["from"], "done!")


@troupe()
class QuaternaryActor(ActorTypeDispatcher):

    def receiveMsg_dict(self, msg, sender):
        print("in quaternay")
        logging.info("Received message number {0}".format(msg))


thespian_system = ActorSystem(
    "multiprocTCPBase",
    {},
    logDefs=log_config("bug_check_1.log", "bug_check_2.log")
)

try:
    primary_actor = thespian_system.createActor(PrimaryActor)

    quaternary_actor = thespian_system.createActor(
        QuaternaryActor,
        globalName="quaternay"
    )

    print('telling',primary_actor)
    thespian_system.tell(primary_actor, "go")
    get_results(thespian_system, 1000)
    print('leaving')
finally:
    thespian_system.shutdown()

For me it hangs at 332 log messages, there is nothing in /tmp/thespian.log. The actor processes which are showing after the hang:

andrew   28737 20.5  0.0  77048 16136 ?        S    14:50   0:02 MultiProcAdmin ActorAddr-(T|:1900)
andrew   28738  1.6  0.0  75224 14052 ?        S    14:50   0:00 logger ActorAddr-(T|:41631)
andrew   28739  0.0  0.0  75620 14000 ?        S    14:50   0:00 PrimaryActor ActorAddr-(T|:37085)
andrew   28740  5.9  0.0  75612 14004 ?        S    14:50   0:00 QuaternaryActor ActorAddr-(T|:46121)
andrew   28742  0.6  0.0  75884 14408 ?        S    14:50   0:00 SecondaryActor ActorAddr-(T|:40021)
andrew   28743  2.1  0.0  75752 14420 ?        S    14:50   0:00 SecondaryActor ActorAddr-(T|:39465)
andrew   28746  1.9  0.0  75876 14300 ?        S    14:50   0:00 TertiaryActor ActorAddr-(T|:45795)
andrew   28757  2.0  0.0  76140 14720 ?        S    14:50   0:00 SecondaryActor ActorAddr-(T|:39741)
andrew   28763  1.6  0.0  76140 14588 ?        S    14:50   0:00 TertiaryActor ActorAddr-(T|:34295)
andrew   28790  1.8  0.0  76140 14720 ?        S    14:50   0:00 TertiaryActor ActorAddr-(T|:36253)
andrew   28802  1.5  0.0  75356 13980 ?        S    14:50   0:00 QuaternaryActor ActorAddr-(T|:44001)
andrew   28818  0.3  0.0  75752 14524 ?        S    14:50   0:00 TertiaryActor ActorAddr-(T|:37087)
andrew   28821  0.4  0.0  76008 14532 ?        S    14:50   0:00 TertiaryActor ActorAddr-(T|:42017)
andrew   28833  1.3  0.0  75356 14156 ?        S    14:50   0:00 QuaternaryActor ActorAddr-(T|:40023)
andrew   28838  0.1  0.0  76140 14544 ?        S    14:50   0:00 TertiaryActor ActorAddr-(T|:38999)
andrew   28861  0.0  0.0  14228   980 pts/23   S+   14:50   0:00 grep --color=auto Actor

Is this a system resource issue? I dont get why it would be though, as I would have expected the troupe's to only max out at 100 actors in total which should be well within source resources.

Any ideas?

Thanks

Andrew

Kevin Quick · Answer 1 · Mon Apr 08 2019 01:45:06 GMT+0800 (China Standard Time)

The problem you are encountering is that you have a bottleneck on the QuaternaryActors, but no protection against shutting down TertiaryActors before they have completed the asynchronous send to the QuaternaryActor troupe.

There are generally going to be 10 SecondaryActors, and each of those will be creating a TertiaryActor troupe with 10 each, so that will be about 100 TertiaryActors, all trying to send to just 10 QuaternaryActors (just one troupe instance because it is a global troupe). This means that many of the asynchronous self.send() operations from the TertiaryActors to the QuaternaryActors will be delayed, waiting for an idle QuaternaryActor. However, there is no troupe_work_in_progress manipulation in the TertiaryActors, so the extras will be shutdown as soon as the original message was received, killing many of these before they have completed the send to the QuaternaryActor.

One simple way to observe this is to introduce an arbitrary delay in the TertiaryActors during which the troupe_work_in_progress is true (and before responding with the done message to prevent the SecondaryActor from killing the entire troupe). On my system, this arbitrary delay is enough to allow all 1000 messages to be sent.

@troupe()
class TertiaryActor(ActorTypeDispatcher):
    def receiveMsg_dict(self, msg, sender):
        self.send(self.createActor(QuaternaryActor, globalName="quaternary"), msg["data"])
        self.lastmsg = msg
        self.troupe_work_in_progress = True
        self.wakeupAfter(3)
    def receiveMsg_WakeupMessage(self, wakemsg, sender):
        self.troupe_work_in_progress = False
        self.send(self.lastmsg["from"], "done!")

Naturally, an arbitrary delay is not a good methodology for production code, but it serves to demonstrate the adjusted behavior in this case.

andatt · Answer 2 · Mon Apr 08 2019 18:18:56 GMT+0800 (China Standard Time)

Thanks yet again Kevin, you are a lifesaver! I assumed that once sent the actor system as a whole took care of it if a recipient actor was busy and the message was undelivered. i.e. I assumed this would be queued and have independent existence of the actor which had sent the message.

I now see this is not the case and have modified my code to set the WIP True and only un set once a message is received from QuaternaryActor that its task is complete. Now I get the number of expected messages in the log. The time delay approach you mentioned also worked as you say though not great except for proving the issue.