Cronicle scheduler stops working after 2^31 milliseconds (~25 days) of continuous running

Question

Cronicle scheduler stops working after 2^31 milliseconds (~25 days) of continuous running

jhuckaby opened this issue 6 years ago · comments

I am seeing a very disturbing situation brewing. It seems that after 25 or so days of continuous running, Cronicle will randomly "stop" running scheduled jobs. This actually affects more than the scheduler, as it seems that all the Node.js timers and intervals are all magically stopping, while the daemon process itself remains up. No errors, no uncaught exceptions, just a silent timer failure. Here are the facts at a glance:

There is no known fix or workaround, other than to restart Cronicle.
All reports are for servers that have been running Cronicle continuously for 25+ days.
All reports are for Node v.10 (v10.5.0 and v10.9.0).
All reports are for CentOS (v6.9 and v7.2).
Both AWS (S3) and local filesystem storage backends are affected.
This affects both the Cronicle master server, and slave servers.
- Slave servers can "survive" this issue for the most part, because they don't rely on timers, except when detached jobs are involved. Detached jobs never "complete" on affected slave servers.
CPU and memory are both normal on an affected server.
- Meaning, CPU / load average are very low, almost zero, and memory usage is normal, around 50 - 100 MB. This does not appear to be a memory leak or event loop starvation issue.

More details about the bug

I am seeing strong evidence that all the Node.js timers are failing. Here is what I've observed so far:

The detached job queue system in Cronicle, which polls the /opt/cronicle/queue directory every minute, looking for detached job update JSON files, was clearly not working on several servers. This system works off the master pixl-server 1s interval tick timer.
Cronicle's daily log rotation system, which runs on every server, also uses the pixl-server 1s tick timer, specifically the "day" event which is emitted at midnight, was not working on my affected servers. Both of them had old logs from August 21 (meaning the "day" tick did not fire at midnight on Aug 22). This is also more evidence that points at (at least) the master pixl-server 1s tick timer.
When I logged into Cronicle on one of my affected servers, I noticed that the clock in the browser UI was not ticking up. The animated clock logo was not moving, and the time in the top-right was not changing. These things are all triggered by server-sent events, that work off the master pixl-server 1s tick timer.

I also see evidence that this isn't limited to the pixl-server 1s interval tick timer. On one of my servers, I see this logged:

[1534858789.948][2018-08-21 06:39:49][devrelease.xxx][Cronicle][debug][5][Marking slave as disabled: mgr01.xxx][]
[1534858789.955][2018-08-21 06:39:49][devrelease.xxx][Cronicle][debug][5][Marking slave as disabled: mgr02.xxx][]

This is right when the master server lost connectivity with mgr01 and mgr02. Now, the Cronicle code automatically tries to reconnect to lost slaves right after this happens. It reconnects automatically and retries indefinitely. But it never happened. It never even tried once.

Check out the code: https://github.com/jhuckaby/Cronicle/blob/master/lib/comm.js#L231-L237

That bit of code never ran on the affected server (i.e. it never logged "Reconnecting to slave:"), and this runs under its own separate setTimeout() call (i.e. not the pixl-server 1s interval timer). Meaning, this is evidence that the entire Node.js timer system got borked.

Furthermore, the other question is WHY did it lose connectivity with mgr01/mgr02 and at the same time as everything else? Well, I believe the answer is again timers:

[1534858789.948][2018-08-21 06:39:49][devrelease.xxx][Error][error][server][Slave disconnected unexpectedly: mgr01.xxx][]
[1534858789.954][2018-08-21 06:39:49][devrelease.xxx][Error][error][server][Slave disconnected unexpectedly: mgr02.xxx][]

As you can see here, devrelease.stage (Cronicle master) says that mgr01 and mgr02 disconnected unexpectedly. But look at what mgr01 and mgr02 have to say about this:

[1534858789.946][2018-08-21 06:39:49][mgr02.xxx][Cronicle][debug][5][Socket.io client disconnected: _IXAzsPZvXRS6aEpAAAA (IP: ::ffff:100.xx)][]
[1534858789.946][2018-08-21 06:39:49][mgr01.xxx][Cronicle][debug][5][Socket.io client disconnected: oyMlpxiFAbZp4uSzAAAA (IP: ::ffff:100.xx)][]

They say that the MASTER disconnected from THEM. So both are pointing the finger at the other.

Here is what I believe actually happened. Socket.io fired a "disconnect" event for both servers on the master, but it never actually lost connectivity. It disconnected because of a loss of heartbeat. Socket.io has its own heartbeat "echo packet" system that keeps TCP connections alive, and it runs on its own setInterval. This also failed at the same time as everything else, resulting in the sockets hitting idle timeout and dying.

I still don't know WHY yet, but it looks like the entire Node.js timer subsystem failed. One reason could be that the event loop was totally CPU-bound, but I see no other evidence of that. The CPUs were asleep on all affected servers. If the main process was CPU-pinned (which would indeed cause timers to not fire), then we would have seen this in our monitoring tools. But no, the load averages were all very low all morning long, and memory usage is normal.

Investigation ongoing, but I don't have much else to go on at the moment.

Joseph Huckaby · Answer 1 · Wed Aug 29 2018 05:43:49 GMT+0800 (China Standard Time)

Cronicle v0.8.26 has a new HTTP JSON REST API called /api/app/status which returns some internal status information, including the current "tick age" (the number of seconds since the last tick). This age should NEVER be longer than 1.0 (1 second). If it is over 60 seconds, Cronicle will log a level 1 debug message as part of this API. You can fetch this API by hitting a URL locally such as:

curl -s http://127.0.0.1:3012/api/app/status

Example response (with ?pretty=1):

{
	"code": 0,
	"version": "0.8.26",
	"node": "v10.5.0",
	"hostname": "dev01.xxx.net",
	"ip": "100.xxx.1",
	"pid": 2740,
	"now": 1535488477.914,
	"uptime": 176,
	"last_tick": 1535488477.361,
	"tick_age": 0.5529999732971191,
	"cpu": {
		"user": 1684823,
		"system": 239750
	},
	"mem": {
		"rss": 49987584,
		"heapTotal": 27570176,
		"heapUsed": 23715944,
		"external": 1119951
	}
}

I am planning on setting up a 1-minute Crontab to monitor this API and its output via curl, and pipe the response to a log file. If the "tick age" ever exceeds 60 seconds, I can take action, such as restarting the service. This is more of a fact finding mission than a solution, as it is sometimes hard to tell if a server is in a bad state, i.e. slave servers.

Example crontab:

* * * * * root /usr/bin/curl -s -m 5 "http://127.0.0.1:3012/api/app/status" >>/opt/cronicle/logs/monitor.log 2>&1

Joseph Huckaby · Answer 2 · Fri Aug 31 2018 02:40:08 GMT+0800 (China Standard Time)

After examining all my server logs, I can conclude that on all 5 servers where this issue was found, the total run time (daemon lifetime / uptime) from initial startup to going berserk, was exactly 24 days, 20.5 hours.

These servers were all started on different days and times of day, too, so this really is looking like a preset length to self destruction.

Joseph Huckaby · Answer 3 · Fri Aug 31 2018 04:34:29 GMT+0800 (China Standard Time)

Okay, this can't be a coincidence. I think I may actually be onto something here.

All of my Cronicle servers didn't glitch out at the same time, but if you consider that they were all started at different days & times, the actual total runtime (total duration of daemon lifespan), is damn near identical. It's always 24 days + 20.5 hours.

Using one of my server's logs, I was able to get this estimate closer to exact (+0-1 minutes):

24 days, 20 hours, 32 minutes
or... 2,147,520 seconds

Now, that total number of seconds suddenly caught my eye. Why does that number look so familiar??? 2,147,520.... 2,147,520.....

Then it hit me. That looks like a reduced version of 2 ^ 31 (i.e. 2 raised to the 31st power) -- in other words, the 32-bit signed integer maximum value:

2,147,483,647

However, that's 2.1 BILLION, not million. But what if we're talking about milliseconds here, and not seconds?

The answer is, this is very close, and keep in mind I don't know EXACTLY when my servers bugged out, and stopped firing timers. I am estimating based on when things started going wonky in the logs, presumably some number of seconds after the actual bug out time:

My server's Cronicle lifetime in milliseconds: 
2,147,520,000

32-bit signed integer maximum:
2,147,483,647

That's within 36,352 milliseconds, or only 36 seconds. And my estimates are based on log events that would occur about a minute after. OMG I think it adds up.

Um..... okay, this CANNOT be a coincidence.

I think Cronicle is dying after 2 ^ 31 (2,147,483,647) milliseconds of total runtime. What could possibly cause such a weird thing?

Chandra (tckb) · Answer 4 · Fri Aug 31 2018 18:31:31 GMT+0800 (China Standard Time)

@jhuckaby 👏 🙏 for all the fact findings. I am was considering cronicle for running some critical scheduling tasks. I will hold it off until some time. Any fixes for this yet? though am not a js guy to find and fix it, I will take a shot.

Alessandro Pio Ardizio · Answer 5 · Sat Sep 01 2018 01:48:24 GMT+0800 (China Standard Time)

Hi @jhuckaby ,
I find this nodejs/node#22149 and this issue seems to be a node js issue .

Probably only node js 10.x.x is affected , the same version that you are using :)

Joseph Huckaby · Answer 6 · Sat Sep 01 2018 01:57:32 GMT+0800 (China Standard Time)

@pioardi Oh wow, look at that. Thank you for tracking that down! I searched the Node.js bug tracker several times, but never found that bug. I am 99.9% certain that is the same issue as this.

Alessandro Pio Ardizio · Answer 7 · Sat Sep 01 2018 01:58:28 GMT+0800 (China Standard Time)

@jhuckaby So..good news :)

Joseph Huckaby · Answer 8 · Sat Sep 01 2018 02:04:12 GMT+0800 (China Standard Time)

Looks like the bug was introduced in this PR, and released with Node v10.0. nodejs/node#18486

Thankfully the bug is fixed, landed, and will be included in the next release (presumably Node v10.10). PR for fix: nodejs/node#22214

I'll have to put in a special check for Node v10.0 - v10.9, and emit a warning to the console on startup or something.

Chandra (tckb) · Answer 9 · Sat Sep 01 2018 02:18:21 GMT+0800 (China Standard Time)

okay, the latest version of node still has the issue right? 😐

Joseph Huckaby · Answer 10 · Sat Sep 01 2018 02:20:30 GMT+0800 (China Standard Time)

@tckb Yup, the bug is confirmed in Node v10.9, alas. It looks like all released versions of Node v10 are affected, from v10.0.0 all the way to v10.9.0. The bug was introduced in nodejs/node#18486 which was released as part of Node v10.0.0.

Joseph Huckaby · Answer 11 · Sat Sep 01 2018 02:27:16 GMT+0800 (China Standard Time)

What you can do in the interim, while we wait for the next Node.js release, is schedule a weekly restart of the service. Pick a day/time when you know you don't have any jobs running, for example Sunday at 3:00 AM:

echo "0 3 * * 0 /opt/cronicle/bin/control.sh restart" > /etc/cron.d/cronicle-weekly-restart.cron

I suppose you could do this inside Cronicle as a scheduled event using the Shell Plugin, but it would have to be detached, so the restart itself didn't abort the job. Might be simpler to just use a crontab ;)

Joseph Huckaby · Answer 12 · Mon Sep 03 2018 05:37:24 GMT+0800 (China Standard Time)

@tckb I apologize, but it looks like I was given some misinformation. The bug is indeed fixed in Node.js v10.9.0. I verified this myself using libfaketime and simulating 25 days of runtime in a few minutes. I was able to reproduce the bug in 10.5, but not on 10.9. Whew!

I have released Cronicle v0.8.27 which emits a warning message on startup if you are running on Node.js v10.0.0 to v10.8.0.

Chandra (tckb) · Answer 13 · Mon Sep 03 2018 13:22:21 GMT+0800 (China Standard Time)

Ah! that's a relief thanks @jhuckaby