Solution for GCP PubSub routine 502's, again

SophisticaSean opened this issue · comments

This PR did not fix this issue, nor does it attempt to retry 502's from google.

I just tested this in production against latest master/main in this repo and the 502's started showing up again.

503's are not retried, but should be:

Unable to fetch events from Cloud Pub/Sub. Reason: 
Request to "http://pubsub.googleapis.com/v1/projects/my-project/subscriptions/my-topic:pull" failed with status 503, got:

"upstream connect error or disconnect/reset before headers. reset reason: connection termination"

502's are not retried, but should be:

Unable to fetch events from Cloud Pub/Sub. Reason: 
Request to "http://pubsub.googleapis.com/v1/projects/my-project/subscriptions/my_topic:pull" failed with status 502, got:

My solution in this PR has been working for several months in production with 0 503 or 502 errors.

Running the current master code in production resulted in about 1-2 of these errors every couple hours. When I was on my fix in production we hadn't had one of these errors since March 3rd 2021.

@wronfim since you wrote the PR I thought you should know.

here's a datadog screenshot of these errors in the last month, 0 before I swapped to using master and now we're seeing them routinely:

@SophisticaSean thank you for taking the time to test the master branch and reporting the issue.

I was able to reproduce this with an integration test:

and you're totally right, the retry logic was not executed for http errors. It would only be triggered for socket errors like :nxdomain and such.

The fix is simply:

diff --git a/lib/broadway_cloud_pub_sub/google_api_client.ex b/lib/broadway_cloud_pub_sub/google_api_client.ex
index d86aa1d..8784e68 100644
--- a/lib/broadway_cloud_pub_sub/google_api_client.ex
+++ b/lib/broadway_cloud_pub_sub/google_api_client.ex
@@ -55,7 +55,7 @@ defmodule BroadwayCloudPubSub.GoogleApiClient do
     %{client | adapter: adapter, pre: client.pre ++ pre}
-  defp should_retry?({:error, %{status: code}}), do: code in @retry_codes
+  defp should_retry?({:ok, %{status: code}}), do: code in @retry_codes
   defp should_retry?({:error, _reason}), do: true
   defp should_retry?(_other), do: false

@wronfim if you, or anyone else, would like to submit the fix along with the proper test setup with bypass (which doesn't require hardcoding the test url like I did) that'd be very appreciated.

Good catch! I'm on it.

@SophisticaSean meanwhile you can implement your own should_retry?/1 function and pass it to the retry opts.


  defp should_retry?({:error, %{status: code}}), do: code in @retry_codes
  defp should_retry?({:ok, %{status: code}}), do: code in @retry_codes
  defp should_retry?({:error, _reason}), do: true
  defp should_retry?(_other), do: false

Thanks for getting to this so quickly! I'm excited to try the new changes soon. :)

Closing in favour of PR #60