Issues when creating many new instances quickly

Question

Issues when creating many new instances quickly

seertenedos opened this issue 6 years ago · comments

I have a simple app that spins up 100 processes as test and tests some messages going back and forth as i was building it up to test concurrency before i integrate the wrapper around SharmIPC i build to auto serialize and deserialise my classes as well as manage all the processes.

What i found is that if i spin up a lot of processes instantly a large number will of the SharmIPC Slaves will fail to send the first message i try to send. This message is a hartbeat and notification the child process has started so if it fails my child process shuts down. Interesting thing i found is the more i delay the start up of each new process the more child processes work successfully. For example at a 1 second delay 50/100 will start and send first message without error but at 2 second delay then 75/100 will start successfully and send first message.

Is there any know limits here? I can see the messages that the master and slave have started in console even though there is no way to check it on the IPC instance itself. All i get back the the false for it failing to send and it does not arrive on the receiving side.

Debugging has not helped too much yet and i need to find a way to get the actual error.

I can't say for sure it is not Anti virus or something on my computer causing this but thought i would check first if i am hitting some known issue.

Seer · Answer 1 · Mon Sep 10 2018 06:04:18 GMT+0800 (China Standard Time)

Some further info i am using netstandard2 and dotnet core 2.0 on windows. I am also using version 1.16.0 of your library but interestingly that does not match the source tagged as 1.16 as in the source there is a method "public string UsageReport()" which does not exist on the SharmIpc in the library i am using.

Maybe the net-standard library was a bad build?

Is it possible to update the message send responses to be a class that includes 3 things.

Boolean clearly names for successful send.
The response bytes if any
The error or exception if any so failures can be worked out and fixed?

On a side note what level of concurrently is supported in processing concurrent requests in the class receiving the requests and need to send responses back?

blaze · Answer 2 · Mon Sep 10 2018 06:36:31 GMT+0800 (China Standard Time)

Hi, what I can tell you? We run in the real production system up to 60-70 parallel slave processes, talking to one core and none of them fails.
May be you have serializer warm up problem, may be smth. else. Create the test project and post it here. Test this project with .NetFramework on Windows to exclude the netstandard problems.
You can use latest sources or sources that belong to the latest release - they have no vital difference.

blaze · Answer 3 · Mon Sep 10 2018 06:45:51 GMT+0800 (China Standard Time)

Each instance is completely isolated from each other.
Also don't shut down your processes while testing, investigate thoroughly the message flow.

blaze · Answer 4 · Mon Sep 10 2018 06:52:24 GMT+0800 (China Standard Time)

While you are spinning your 100 processes with slave instances, at this moment, all100 master instances already exists and initialized?

Seer · Answer 5 · Mon Sep 10 2018 19:34:35 GMT+0800 (China Standard Time)

yea the process is that it creates the master ipc instance then creates a child process for the ipc instance that really does nothing but create an IPC instance to talk to the master and try to send a message that returns a false boolean. I can't seem to get the IPC code to tell me anything apart from the false to say it failed to send. If i do a 2 second delay then only after about the 50th process will i start getting the failures. I was planning to open source my library but just have not gotten around to adding to a public GIT etc and that has the test app in it. I will try to do that tomorrow.

It really seems to be how fast i am creating more than the number i am creating them. the faster i create them the more that fail and the slow i create them the more that start correctly. It should not be seralisation as even if the bytes it generated were not wright it would not cause the IPC send method you wrote to return false as far as i can tell. To be honest looking at the latest code you have checked in i can't understand how it can return the false and not generate an exception that i would log.

If i get a chance i may pull the code directly into the project to debug but it is hard as try to auto attach to child process to find the issue is almost impossible as if you set a debug point then it slows it down enough to work as far as i can tell.

Anyway i will post a link to the repository here once i put it in public GIT so you can see what i am doing and hopefully spot a stupid mistake.

blaze · Answer 6 · Mon Sep 10 2018 20:19:23 GMT+0800 (China Standard Time)

Just create simple 2 projects emulating core of the problem. Better use .Net Framework for now (let's say quite stable 4.5)

Seer · Answer 7 · Tue Sep 11 2018 07:04:01 GMT+0800 (China Standard Time)

please look at https://bitbucket.org/seer_tenedos/childprocessmanager/src/master/
The TestApp has the issue.

blaze · Answer 8 · Wed Sep 12 2018 02:58:14 GMT+0800 (China Standard Time)

I've looked a bit.
I think the problem is hard limitations on watchdogs and ping.
Probably the listening thread of the manager is quite busy to answer on your pings by creating new child processes and Thread.Sleep.

Finally all child processes should answer if you give them a chance.
So, try to remove initial ping

 if (!SendKeepAlive(1000000))   //set to a very big value
            {
                throw new Exception("IPC connection failed on startup.");
            }

            _watchdog = new Timer(x => OnWat

and to remove watch dog negative reactions on child processes like

  cp = new ChildProcess((state, info) =>
            {
                Console.WriteLine($"Child State - {state} - {info}");
                lastState = state;
                switch (state)
                {
                    case ChildProcessStateChangedEnum.ParentExited:
                        exitEvent.Set();
                        break;
                    case ChildProcessStateChangedEnum.WatchdogTimeout:
                        //exitEvent.Set(); //remarked
                        break;
                    case ChildProcessStateChangedEnum.PingFailed:
                        //exitEvent.Set(); //remarked
                        break;
                    case ChildProcessStateChangedEnum.ShutdownRequested:
                        exitEvent.Set();
                        break;
                }
            }, processors);

Add to your code "communication test" (send a message from mgr to all children one by one and receive a response or failure from each of them).
Run, wait until all processes (may be for the test 30 processes is enough, otherwise you need to wait long) are started, run your "communication test".
Children should all answer - that will be a proof that communication is established and works.
Problem can be tightened environment settings and solving them is a point of another discussion.
If you see the other behavior - we can go on investigation.

P.S:
SharmIPC has internal configurable timeout 30 secs for the call. May be this kicks your call back.

blaze · Answer 9 · Wed Sep 12 2018 04:51:21 GMT+0800 (China Standard Time)

I have also realized, that you create only one SharmIPC instance in master process and try to connect to it 100 child SharmIPC instances, though I can be mistaken?
SharmIPC is designed a bit different.
If you have 1 master process and 100 child processes, you must create in master process 100 SharmIpc instances with names "instance1","instance2".."instance100" and for each child process an instance in it with corresponding names: for the first process - "instance1", for the second process - "instance2" etc...

Seer · Answer 10 · Wed Sep 12 2018 05:19:24 GMT+0800 (China Standard Time)

If possible i suggest you runt he test app to see but no there is one IPC first created in the parent process for each child process and the details for that process is then passed to the child. You will be able to see they both show up in the console logs for each one.

The watchdog is there to make sure both the child and parent are up and so that child shuts down if parent is not there as well as for the parent to know once communications with the child is up. If you have a look at the calls they are doing they are very simple so can't imagine them blocking other things since the test app does nothing and my CPU and memory is not that high. Also 10 seconds to send the ping request is quite a long time on the client side because that is all it is doing. I can bump it up to 30 seconds but if it is going to take 30 seconds for the communication to start working i am not sure that will be workable in my case.

In the final app a single message failing would not kill the app but the use case we have is basically a parent app that receives potentially 100's of calls a second should forward the requests to particular child and return the result. The code we are running in the child can sometimes cause stack overflow or memory issues so we need to isolate them so it does not affect main app and so we can identify the problem request and calls.

Are you able to point me to the commit that matches the code in the version 1.16 netstandard lib you published to nuget? The latest code makes the false response make no sense.

blaze · Answer 11 · Wed Sep 12 2018 13:42:52 GMT+0800 (China Standard Time)

I have already started your app with the latest source of sharmipc from github, that what I wrote (before suggestion about one master sharmipc instance) stays relevant. So, try to send communication test signal to children after all processes are initialized. If you receive response from all of them - it will be a first step.

blaze · Answer 12 · Wed Sep 12 2018 14:05:56 GMT+0800 (China Standard Time)

Sharmipc sources. take from here two folders sharmipc and sharmipcnetstandard20. Add them to your solution. Build netstandard project and make reference of your project on it.

blaze · Answer 13 · Wed Sep 12 2018 14:26:15 GMT+0800 (China Standard Time)

Sharmipc instance listening events and routes them back to your program, all this happens in the same thread which you utilize to create child instances, actually you make all in one thread. Events are not really parallel, they are executed in semi-parallel manner, but if thread. Sleep is called or long running operation (like, probably, process. Start) is called - events will stop to elapse in that thread until they got their execution window ... And your child processe's sharmipc call can be timedout(30 sec). All your instances are created in one main thread all processes are started in main thread all events are listened in that thread that is being blocked long time, at this time you will not see any cpu consumption, thread is just blocked.
In real life you also have code to execute that can really consume all cpus, it means that in any case on heavy load you will receive timed out calls. So to build such straightforward ping system that kills the processes when there is no answer is just a bad idea. That will be step two.

Seer · Answer 14 · Wed Sep 12 2018 14:37:57 GMT+0800 (China Standard Time)

I put some timing local on the calls where responses are required and noticed that as the number of child processes increases the time to get perform the same round trip call that starts between 0.498ms and 41ms starts increasing rapidly. by the time the 14th child is up and running it is up to 22 seconds instead of 41ms.

Is there anything that would inter-fear between the different IPC channels on the same box for lots of quick messages happening on different IPC channels at the same time? Some sort of global locking or something?

Based on your last reply you are somehow using the main thread to handle the incoming requests? I thought you would be using new threads for that or the thread pool at least. Looking at the code that is also what it look likes it does using a Task.Run which uses the thread pool not thread that created an instance of your class.

yes increasing the timeout helps and setting it to 5 minutes means that all pings are fine but the response times i see mean the library is not much use to me for my use case of needing concurrent request/response with high performance and many different concurrent parent to different child communication channels.

Seer · Answer 15 · Wed Sep 12 2018 15:38:52 GMT+0800 (China Standard Time)

I have built a performance test and starting to track the issues down when using man concurrent master/slave at once.

so far it pointing out SharedMemory.cs line 88 - if (mt.WaitOne(500))
that line can take a max of 50 seconds and an average of 12 seconds when spinning up 100 parent/children at once.

I am also profiling for the slowdown in the messages as well.

blaze · Answer 16 · Wed Sep 12 2018 15:57:29 GMT+0800 (China Standard Time)

Did you test the data exchange speed between different master-slave instances after all 100 instances are up and running?

Seer · Answer 17 · Wed Sep 12 2018 19:30:20 GMT+0800 (China Standard Time)

not once they are all running but i seem to have found the main issue which is actually the min threads in the threadpool. by default on my PC it is 6.
In the parent app if i use the following line then things seem to work
ThreadPool.SetMinThreads(100, 100);
well for 50 instances started at once. For 100 instances started at once i loose about 30 of them. I need to investigate exactly where they fail now. I am still getting some requests take 2-3 seconds from a quick look but i need to check if something else is slowing it down now or using the threads when it should not be like the code for reading the console output from the client apps.

blaze · Answer 18 · Wed Sep 12 2018 21:11:31 GMT+0800 (China Standard Time)

Our servers are 64bit VM machines, have default settings ThreadPool.SetMinThreads(250, 250) and up to 70 child processes. These quantity of threads also helps to the quick system warm up and start. Core processes can have 300-500 OS threads afterwhile and they don't drop down if the load is stable. Some servers together with child processes achieve 12GB of RAM. One VM with 8 cores can handle several such servers.
System is about 10 years old, on SharmIPC works about 2 years, the data exchange speed is quite acceptable.

Seer · Answer 19 · Wed Sep 12 2018 21:19:29 GMT+0800 (China Standard Time)

Thanks for that. 300 threads makes the 100 instances work. most requests take 1 second for something that does net to nothing when under load now so that is much better than timing out at 30 seconds.

blaze · Answer 20 · Thu Sep 13 2018 15:57:58 GMT+0800 (China Standard Time)

So, as more CPUs system will have as faster will be channel data exchange and more child processes could be bared to be served with a good speed on high load.
E.g. we got machine that is CPU-loaded in the middle 15-20%, having 3 servers (3 master process) that bear in total around 150 child processes, the data round-trip via SharmIPC is about 2 ms.

But I think it would be interesting to re-visit the code and to think when it is good to use Task.Run within the incoming response or just to call external function in async manner without await operator, so the client could decide itself either to put processing function into ThreadPool or not, though otherwise there will be needed Thread manager to utilize all CPUs (actually that is thread pool) ...
In those days I was experimenting much and came to this implementation, but...who knows.

Currently I close the issue, write down if you need smth. else.

Seer · Answer 21 · Thu Sep 13 2018 19:20:39 GMT+0800 (China Standard Time)

thanks. yea i will do some more investigation.
2 questions

is the version 2 protocol complete and working? I noticed it in the source but that version was never released.
Do you mind if i put a copy of the source for you lib as a project in mine leaving the files intact but i want to change the implementation a little to remove some of the Task.Run and allow the passing of the master vs slave though on startup if you already know what it will be like i do in my case. As well as any other things i can find to improve performance for my use case. SInce i is public you can pull anything you want back into your code.

Normally i would just do a PR on the original lib but i need to change a few things just for my use case i think and you may not want it in the main code base.

blaze · Answer 22 · Thu Sep 13 2018 19:28:59 GMT+0800 (China Standard Time)

1 - Second version is not working yet, I just wanted to make a bit smaller communication protocol, but then was interrupted and never returned back, though the first version is also good enough.
2 - no problem, do it and then share your knowledge, may be we can make this lib better.