scylladb / scylla-jmx

Scylla JMX proxy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

jmx takes ~90 seconds to start receiving connections

bentsi opened this issue · comments

affected versions: master and 4.0
It takes about 90 seconds JMX to start receiving connections.

2020-04-22T11:12:01+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | systemd: Started Scylla Server.

2020-04-22T11:12:01+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | systemd: Started Scylla JMX.

2020-04-22T11:12:04+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Picked up JAVA_TOOL_OPTIONS:

2020-04-22T11:12:41+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Using config file: /etc/scylla/scylla.yaml

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Connecting to http://127.0.0.1:10000

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Starting the JMX server

2020-04-22T11:13:39+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: JMX is enabled to receive remote connections on port: 7199 

Possibly related to #98

Can we correlate the messages form scylla-jmx to the node's logs?
Was the node bootstrapping for a long time?

starting scylla

2020-04-22T11:12:00+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb with build-id efad5bfb73c2ce2191c69bfbf882efbba4e3d0ac starting ...

Scylla started after 27 seconds

2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)
2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] storage_service - Thrift server listening on 10.0.67.95:9160 ...
2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] init - serving
2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb initialization completed.

only then JMX tried to connect to the port 10000 (but the service started earlier 2020-04-22T11:12:01+00:00) that is a bit strange.

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Starting the JMX server

I investigated it a bit and that's what I found:

image

Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip

There are no changes in scylla-jmx between 3.3 and 4.0. We have the following changes between 3.2 and 3.3 that might have some impact on startup:

  • 2960125 ("dist/redhat: call systemctl --daemon-reload when upgraded (#92)")
  • d8c4760 ("Create a HTTP client per instance (#86)")

The issue could be in scylla.git, of course. @slivne asked about systems slices, but those were introduced in Scylla 3.2 already.

@bentsi You wouldn't happen to have systemd logs stashed anywhere for the slow JMX startup case? To narrow down the issue, we need to first understand if the problem is jmx not starting, or jmx starting but stalling.

@penberg

Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip

Look for log of node gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 and check timestamps that I mentioned in the issue description

If I looked up the timestamps, correctly, we have CQL ready to serve requests at 11:12:27:

2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)

JMX service starts 14 seconds later:

2020-04-22T11:12:41+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Using config file: /etc/scylla/scylla.yaml

However, as you say, it takes another 24 seconds before JMX is able to connect to Scylla REST API and start serving JMX requests:

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Starting the JMX server

It looks like connecting to the API server takes 24 seconds. If this is a regression, then the first thing I can think of is commit d8c4760. However, this commit is already part of Scylla 3.3, so it's not that recent.

One way to debug this further is to increase the logging of Jersey for Scylla JXM. It's been too long for me to remember how to configure that, but I think Jersey is using java.util logging (let's CC @elcallio and @tarzanek who usually remember this stuff).

those 24s are spent in static inits for API as Pekka correctly pointed out

private static final APIConfig config = new APIConfig();
public static final APIClient client = new APIClient(config);

    public static void main(String[] args) throws Exception {
        System.out.println("Connecting to " + config.getBaseUrl());

I checked release notes of jersey and they fixed bugs around this
so it would make sense to update at least to 2.29.1 and retest

that said, I just tested with old 2.22.1 and
with master from scylla-jmx and scylla and for me the jersey connection to API was instant
so I think the problem might be in the environment too

  • e.g. I've seen previously weird delays when naming is not properly set (even resolving localhost or 127.0.0.1 takes time) - which is my guess # 1 - @bentsi can you check that testing machine(or VM or docker or whatever is used) how fast does it resolve the localhost or the IP? (ideally getent hosts localhost (same for IP), or nslookup worst case)
  • the other option is any firewall or OS restriction or creating connections (e.g. some virtualizations might restrict this)

(btw. I tested 2.29.1 which compiles well and works well for me too, perhaps it might make sense to run this test with 2.29.1 - how do I run just this test Bentsi? )

I tried to reproduce too and failed. @bentsi how can we check the issues @tarzanek pointed out? IOW, is there a problem with address resolution. One possible culprit could be Scylla's HTTP server too.