atomix / copycat

A novel implementation of the Raft consensus algorithm

Home Page:http://atomix.io/copycat

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Under certain disconnection circumstances, the copycat-client-io thread takes 100% CPU

JPWatson opened this issue · comments

jstack stack traces alternates between these frames:

"copycat-client-io-1" #17 prio=5 os_prio=0 tid=0x00007f5e60755800 nid=0x7412 runnable [0x00007f5e4cd57000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Throwable.fillInStackTrace(Native Method)
        at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
        - locked <0x00000000f10b3008> (a io.atomix.copycat.session.ClosedSessionException)
        at java.lang.Throwable.<init>(Throwable.java:265)
        at java.lang.Exception.<init>(Exception.java:66)
        at java.lang.RuntimeException.<init>(RuntimeException.java:62)
        at java.lang.IllegalStateException.<init>(IllegalStateException.java:55)
        at io.atomix.copycat.session.ClosedSessionException.<init>(ClosedSessionException.java:29)
        at io.atomix.copycat.client.session.ClientSessionSubmitter.submit(ClientSessionSubmitter.java:144)
        at io.atomix.copycat.client.session.ClientSessionSubmitter.access$300(ClientSessionSubmitter.java:51)
        at io.atomix.copycat.client.session.ClientSessionSubmitter$CommandAttempt.lambda$fail$0(ClientSessionSubmitter.java:370)
        at io.atomix.copycat.client.session.ClientSessionSubmitter$CommandAttempt$$Lambda$109/2008009084.run(Unknown Source)
        at io.atomix.catalyst.concurrent.Runnables.lambda$logFailure$0(Runnables.java:20)
        at io.atomix.catalyst.concurrent.Runnables$$Lambda$31/1944702768.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
"copycat-client-io-1" #17 prio=5 os_prio=0 tid=0x00007f5e60755800 nid=0x7412 runnable [0x00007f5e4cd57000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:328)
        at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
        at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:622)
        at io.atomix.catalyst.concurrent.SingleThreadContext$1.execute(SingleThreadContext.java:29)
        at io.atomix.copycat.client.session.ClientSessionSubmitter$CommandAttempt.fail(ClientSessionSubmitter.java:370)
        at io.atomix.copycat.client.session.ClientSessionSubmitter.submit(ClientSessionSubmitter.java:144)
        at io.atomix.copycat.client.session.ClientSessionSubmitter.access$300(ClientSessionSubmitter.java:51)
        at io.atomix.copycat.client.session.ClientSessionSubmitter$CommandAttempt.lambda$fail$0(ClientSessionSubmitter.java:370)
        at io.atomix.copycat.client.session.ClientSessionSubmitter$CommandAttempt$$Lambda$109/2008009084.run(Unknown Source)
        at io.atomix.catalyst.concurrent.Runnables.lambda$logFailure$0(Runnables.java:20)
        at io.atomix.catalyst.concurrent.Runnables$$Lambda$31/1944702768.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I can reproduce this pretty easily using ONOS.

I have a 3-node ONOS cluster. If I partition one node away, then bring it back to the cluster, that node will have persistent high CPU usage coming from the constant creation of ClosedSessionExceptions as shown in the first stack trace.

Interesting - the behaviour I saw was on the client.

This is fixed by #336 according to our tests