Flaky tests when transient error/delay happens

Question

Flaky tests when transient error/delay happens

rmt2021 opened this issue 3 years ago · comments

rmt2021 commented 3 years ago

Version Information
21710e9

Describe the bug

We find a few unit tests fail undeterministically due to different reasons when the Azure Storage service is unstable:

Akka.Streams.Azure.StorageQueue.Tests.QueueSinkSpec.A_QueueSink_should_add_elements_to_the_queue
Akka.Streams.Azure.StorageQueue.Tests.QueueSinkSpec.A_QueueSink_should_retry_failing_messages_if_supervision_strategy_is_resume
Akka.Streams.Azure.StorageQueue.Tests.QueueSinkSpec.A_QueueSink_should_skip_failing_messages_if_supervision_strategy_is_restart
Akka.Streams.Azure.StorageQueue.Tests.QueueSourceSpecs.A_QueueSource_should_fail_when_an_error_occurs
Akka.Streams.Azure.StorageQueue.Tests.QueueSourceSpecs.A_QueueSource_should_not_fail_if_the_supervision_strategy_is_not_stop_when_an_error_occurs
Akka.Streams.Azure.StorageQueue.Tests.QueueSourceSpecs.A_QueueSource_should_only_poll_if_demand_is_available
Akka.Streams.Azure.StorageQueue.Tests.QueueSourceSpecs.A_QueueSource_should_poll_for_messages_if_the_queue_is_empty
Akka.Streams.Azure.StorageQueue.Tests.QueueSourceSpecs.A_QueueSource_should_push_available_messages

For example, hardcoded timeout is often used in the test code like

t.Wait(TimeSpan.FromSeconds(3)).Should().BeTrue();

which expects t finishes within 3 seconds.
If the Azure API called by t gets processed slower than usual (when the service is busy), the test will fail.

Another example is in A_QueueSink_should_skip_failing_messages_if_supervision_strategy_is_restart

probe.SendNext("2");
probe.SendComplete();
await task;
(await Queue.ReceiveMessagesAsync()).Value[0].MessageText.Should().Be("2");

If some transient error happens (like 503 server is busy) when processing the Azure Storage request to send message, the test will fail due to System.IndexOutOfRangeException : Index was outside the bounds of the array.

We are not sure whether such flaky tests (caused by transient error/delay) are desired or not.
If not, would it be better to:

wait until t finishes w/o hardcoded timeout in the first example?
check whether Value contains any element before calling Value[0] in the second example?
If these are considered better practice, we are willing to send a PR for the fix.

Aaron Stannard · Answer 1 · Wed Apr 20 2022 22:13:52 GMT+0800 (China Standard Time)

check whether Value contains any element before calling Value[0] in the second example?
If these are considered better practice, we are willing to send a PR for the fix.

This is how I would do it. Let me know if you still are available to submit a PR