Investigate the Spike in Flaky test failures as a function of the gradle check configuration and Jenkins Runniner instance sizing
nknize opened this issue · comments
Is your feature request related to a problem? Please describe
Coming out of this public slack discussion I'd like to explore a possible spike in flaky test failures during gradlew check
on PRs in the OpenSearch core repository during regular business hours.
The concrete test failures we're noticing are similar to:
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
at org.opensearch.nio.SocketChannelContext.connect(SocketChannelContext.java:157) ~[opensearch-nio-2.9.0-SNAPSHOT.jar:2.9.0-SNAPSHOT]
As can be seen in this one instance. This seems mostly related to socket issues in the runner and seems to occur on "aggressive" Integration Tests (e.g., those using Scope.Test
level, which fires up a new cluster for each test method).
With jenkins having its own Runner for each invocation I wouldn't expect the high level of activity (e.g., multiple PRs throughout the day) to contribute, so maybe this is more related to the test intensity, --parallel
gradle invocation, and size of the Runner instance?
Describe the solution you'd like
As a parallel effort to trying to lean out the intense integration tests in the core repo, I'd like for us to see if we can root cause these time outs as a function of instance resources (e.g., CPU, Memory) and the test configuration (e.g., number of concurrent integration tests, number of sockets).
It may be that we just aren't closing the sockets in the core IntegrationTest class? (we can explore that separately).
Describe alternatives you've considered
- Check the core Integration Test harness is properly closing sockets
- Check the socket pool configuration in the core test framework.
- ... other core improvements not explicitly mentioned here.
Additional context
Thank you!
We will try to create a new runner with @nknize own env specs: 32/128 similar to m5.8xlarge.
It is possible that Nick his 32/128 but we have 96/192, that means for --parallel
to create 3 times more parallel tasks on our instance, each job is being assigned 2 times less the memory.
Also the desktop env setup means his cpu single core processing frequency is way higher than genuine intel server cpus. That needs to be taken into account as well. I will start investigating this next week.
Thanks.
We have decided to test switching the default runner to m58xlarge next week.
New spec live. Monitoring a bit.
Closing this issue as the changes were completed.