Pegs CPU at 100%

Question

Pegs CPU at 100%

JLSchuler99 opened this issue 5 years ago · comments

Love this container and run it in kubernetes.
The only issue is every few days/hours, the container starts using 100% of CPU across all cores on the node running it, requiring a restart.

If this issue was fixed it would be great software!

Geoff Bourne · Answer 1 · Sat Jun 08 2019 10:26:20 GMT+0800 (China Standard Time)

I use it also for extended periods and have noticed similar behavior. I’m actually watching it with Go’s profiler to see if I can catch the cause for this.

Geoff Bourne · Answer 2 · Sat Jun 08 2019 11:09:36 GMT+0800 (China Standard Time)

...I built a new image with the latest version of Go, so we'll see if that helps too.

The tagged image is itzg/mc-router:1.4.4-1

Geoff Bourne · Answer 3 · Wed Jul 03 2019 10:56:51 GMT+0800 (China Standard Time)

Just making a note of recent profiling I captured after mc-router had been running for over a week:

(pprof) top10
Showing nodes accounting for 419.95hrs, 89.91% of 467.09hrs total
Dropped 578 nodes (cum <= 2.34hrs)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
 352.35hrs 75.43% 75.43%  386.32hrs 82.71%  syscall.Syscall
  10.75hrs  2.30% 77.74%   11.86hrs  2.54%  runtime.ifaceeq
   9.56hrs  2.05% 79.78%   15.30hrs  3.28%  runtime.reentersyscall
   8.19hrs  1.75% 81.54%  466.49hrs 99.87%  github.com/itzg/mc-router/mcproto.ReadFrame
   8.15hrs  1.75% 83.28%    8.15hrs  1.75%  runtime.casgstatus
   7.32hrs  1.57% 84.85%   10.69hrs  2.29%  runtime.deferreturn
   6.66hrs  1.43% 86.28%  449.01hrs 96.13%  net.(*conn).Read
   5.88hrs  1.26% 87.53%   17.71hrs  3.79%  runtime.exitsyscall
   5.87hrs  1.26% 88.79%    6.04hrs  1.29%  runtime.newdefer
   5.21hrs  1.12% 89.91%    8.10hrs  1.73%  runtime.exitsyscallfast

Looks like an strace of the pegged process might reveal why most time is spent in "Syscall".

Geoff Bourne · Answer 4 · Mon Jul 08 2019 01:00:19 GMT+0800 (China Standard Time)

Release 1.6.0 now includes a --debug command line argument for helping to diagnose the initial frame/packet reading. That seems to be where high CPU is spent.

Geoff Bourne · Answer 5 · Wed Jul 10 2019 07:41:26 GMT+0800 (China Standard Time)

@JLSchuler99 the latest release https://github.com/itzg/mc-router/releases/tag/1.7.0 includes several fixes to do connection rate limiting and timing out of slow/stalled handshakes. I had finally discovered I could recreate the issue by rapidly refreshing the server list in the Minecraft client, so I was able to greatly reduce fix and test cycles.

Jacob Schuler · Answer 6 · Wed Jul 10 2019 07:58:44 GMT+0800 (China Standard Time)

Fantastic. I just deployed the new image, I'll let you know how it works out. Thanks for taking time to look into this issue.

Geoff Bourne · Answer 7 · Sun Jul 14 2019 05:05:35 GMT+0800 (China Standard Time)

@JLSchuler99 , I finally found the mystery packet that was getting the router into a tight loop. There is a legacy message type that even a modern client seems to send sporadically. Release 1.8.0 includes the handling of that.

Jacob Schuler · Answer 8 · Wed Nov 27 2019 07:39:28 GMT+0800 (China Standard Time)

I haven't had this problem in a few months. Seems like that fix did the trick. Closing this issue, thanks again.