Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing

Question

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing

nknize opened this issue a year ago · comments

Is your feature request related to a problem? Please describe

Coming out of this public slack discussion I'd like to explore a possible spike in flaky test failures during gradlew check on PRs in the OpenSearch core repository during regular business hours.

The concrete test failures we're noticing are similar to:

Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at org.opensearch.nio.SocketChannelContext.connect(SocketChannelContext.java:157) ~[opensearch-nio-2.9.0-SNAPSHOT.jar:2.9.0-SNAPSHOT]

As can be seen in this one instance. This seems mostly related to socket issues in the runner and seems to occur on "aggressive" Integration Tests (e.g., those using Scope.Test level, which fires up a new cluster for each test method).

With jenkins having its own Runner for each invocation I wouldn't expect the high level of activity (e.g., multiple PRs throughout the day) to contribute, so maybe this is more related to the test intensity, --parallel gradle invocation, and size of the Runner instance?

Describe the solution you'd like

As a parallel effort to trying to lean out the intense integration tests in the core repo, I'd like for us to see if we can root cause these time outs as a function of instance resources (e.g., CPU, Memory) and the test configuration (e.g., number of concurrent integration tests, number of sockets).

It may be that we just aren't closing the sockets in the core IntegrationTest class? (we can explore that separately).

Describe alternatives you've considered

Check the core Integration Test harness is properly closing sockets
Check the socket pool configuration in the core test framework.
... other core improvements not explicitly mentioned here.

Additional context

Thank you!

Peter Zhu commented a year ago

PRs:

Peter Zhu · Answer 1 · Sat Jul 08 2023 07:08:03 GMT+0800 (China Standard Time)

We will try to create a new runner with @nknize own env specs: 32/128 similar to m5.8xlarge.
It is possible that Nick his 32/128 but we have 96/192, that means for --parallel to create 3 times more parallel tasks on our instance, each job is being assigned 2 times less the memory.

Also the desktop env setup means his cpu single core processing frequency is way higher than genuine intel server cpus. That needs to be taken into account as well. I will start investigating this next week.

Thanks.

Peter Zhu · Answer 2 · Wed Jul 12 2023 00:52:21 GMT+0800 (China Standard Time)

Test main:

https://build.ci.opensearch.org/job/gradle-check/19916/console

Peter Zhu · Answer 3 · Fri Jul 21 2023 07:33:14 GMT+0800 (China Standard Time)

Several days data shows the new setup would have 90% unstable rate vs 10% success rate, but yet to see complete failure rate yet.

So it is possible the new spec of m58xlarge is better than original c524xlarge setups.

Thanks.

Peter Zhu · Answer 4 · Sat Jul 22 2023 01:14:23 GMT+0800 (China Standard Time)

We have decided to test switching the default runner to m58xlarge next week.

Peter Zhu · Answer 5 · Wed Jul 26 2023 01:46:44 GMT+0800 (China Standard Time)

New spec live. Monitoring a bit.

Peter Zhu · Answer 6 · Wed Jul 26 2023 07:04:52 GMT+0800 (China Standard Time)

More success runs.

Barani · Answer 7 · Tue Aug 01 2023 01:09:02 GMT+0800 (China Standard Time)

Closing this issue as the changes were completed.