opensearch-project / opensearch-ci

Enables continuous integration across OpenSearch, OpenSearch Dashboards, and plugins.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing

nknize opened this issue · comments

Is your feature request related to a problem? Please describe

Coming out of this public slack discussion I'd like to explore a possible spike in flaky test failures during gradlew check on PRs in the OpenSearch core repository during regular business hours.

The concrete test failures we're noticing are similar to:

Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at org.opensearch.nio.SocketChannelContext.connect(SocketChannelContext.java:157) ~[opensearch-nio-2.9.0-SNAPSHOT.jar:2.9.0-SNAPSHOT]

As can be seen in this one instance. This seems mostly related to socket issues in the runner and seems to occur on "aggressive" Integration Tests (e.g., those using Scope.Test level, which fires up a new cluster for each test method).

With jenkins having its own Runner for each invocation I wouldn't expect the high level of activity (e.g., multiple PRs throughout the day) to contribute, so maybe this is more related to the test intensity, --parallel gradle invocation, and size of the Runner instance?

Describe the solution you'd like

As a parallel effort to trying to lean out the intense integration tests in the core repo, I'd like for us to see if we can root cause these time outs as a function of instance resources (e.g., CPU, Memory) and the test configuration (e.g., number of concurrent integration tests, number of sockets).

It may be that we just aren't closing the sockets in the core IntegrationTest class? (we can explore that separately).

Describe alternatives you've considered

  • Check the core Integration Test harness is properly closing sockets
  • Check the socket pool configuration in the core test framework.
  • ... other core improvements not explicitly mentioned here.

Additional context

Thank you!

We will try to create a new runner with @nknize own env specs: 32/128 similar to m5.8xlarge.
It is possible that Nick his 32/128 but we have 96/192, that means for --parallel to create 3 times more parallel tasks on our instance, each job is being assigned 2 times less the memory.

Also the desktop env setup means his cpu single core processing frequency is way higher than genuine intel server cpus. That needs to be taken into account as well. I will start investigating this next week.

Thanks.

Screenshot 2023-07-20 at 7 31 18 PM
Screenshot 2023-07-20 at 7 31 22 PM

Several days data shows the new setup would have 90% unstable rate vs 10% success rate, but yet to see complete failure rate yet.

So it is possible the new spec of m58xlarge is better than original c524xlarge setups.

Thanks.

We have decided to test switching the default runner to m58xlarge next week.

New spec live. Monitoring a bit.

image
More success runs.

Closing this issue as the changes were completed.