jmx takes ~90 seconds to start receiving connections
bentsi opened this issue · comments
affected versions: master and 4.0
It takes about 90 seconds JMX to start receiving connections.
2020-04-22T11:12:01+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | systemd: Started Scylla Server.
2020-04-22T11:12:01+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | systemd: Started Scylla JMX.
2020-04-22T11:12:04+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Picked up JAVA_TOOL_OPTIONS:
2020-04-22T11:12:41+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Using config file: /etc/scylla/scylla.yaml
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Starting the JMX server
2020-04-22T11:13:39+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: JMX is enabled to receive remote connections on port: 7199
Possibly related to #98
Can we correlate the messages form scylla-jmx to the node's logs?
Was the node bootstrapping for a long time?
starting scylla
2020-04-22T11:12:00+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb with build-id efad5bfb73c2ce2191c69bfbf882efbba4e3d0ac starting ...
Scylla started after 27 seconds
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] storage_service - Thrift server listening on 10.0.67.95:9160 ...
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] init - serving
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb initialization completed.
only then JMX tried to connect to the port 10000 (but the service started earlier 2020-04-22T11:12:01+00:00) that is a bit strange.
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Starting the JMX server
I investigated it a bit and that's what I found:
Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip
There are no changes in scylla-jmx
between 3.3 and 4.0. We have the following changes between 3.2 and 3.3 that might have some impact on startup:
- 2960125 ("dist/redhat: call systemctl --daemon-reload when upgraded (#92)")
- d8c4760 ("Create a HTTP client per instance (#86)")
The issue could be in scylla.git
, of course. @slivne asked about systems slices, but those were introduced in Scylla 3.2 already.
@bentsi You wouldn't happen to have systemd
logs stashed anywhere for the slow JMX startup case? To narrow down the issue, we need to first understand if the problem is jmx not starting, or jmx starting but stalling.
Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip
Look for log of node gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3
and check timestamps that I mentioned in the issue description
If I looked up the timestamps, correctly, we have CQL ready to serve requests at 11:12:27:
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)
JMX service starts 14 seconds later:
2020-04-22T11:12:41+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Using config file: /etc/scylla/scylla.yaml
However, as you say, it takes another 24 seconds before JMX is able to connect to Scylla REST API and start serving JMX requests:
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Starting the JMX server
It looks like connecting to the API server takes 24 seconds. If this is a regression, then the first thing I can think of is commit d8c4760. However, this commit is already part of Scylla 3.3, so it's not that recent.
One way to debug this further is to increase the logging of Jersey for Scylla JXM. It's been too long for me to remember how to configure that, but I think Jersey is using java.util
logging (let's CC @elcallio and @tarzanek who usually remember this stuff).
those 24s are spent in static inits for API as Pekka correctly pointed out
private static final APIConfig config = new APIConfig();
public static final APIClient client = new APIClient(config);
public static void main(String[] args) throws Exception {
System.out.println("Connecting to " + config.getBaseUrl());
I checked release notes of jersey and they fixed bugs around this
so it would make sense to update at least to 2.29.1 and retest
that said, I just tested with old 2.22.1 and
with master from scylla-jmx and scylla and for me the jersey connection to API was instant
so I think the problem might be in the environment too
- e.g. I've seen previously weird delays when naming is not properly set (even resolving localhost or 127.0.0.1 takes time) - which is my guess # 1 - @bentsi can you check that testing machine(or VM or docker or whatever is used) how fast does it resolve the localhost or the IP? (ideally
getent hosts localhost
(same for IP), or nslookup worst case) - the other option is any firewall or OS restriction or creating connections (e.g. some virtualizations might restrict this)
(btw. I tested 2.29.1 which compiles well and works well for me too, perhaps it might make sense to run this test with 2.29.1 - how do I run just this test Bentsi? )