Solution for GCP PubSub routine 502's

Question

Solution for GCP PubSub routine 502's

SophisticaSean opened this issue 3 years ago · comments

GCP returns a 502 pretty routinely serveral times a day when fetching messages:

<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content=\"initial-scale=1, minimum-scale=1, width=device-width\">
  <title>Error 502 (Server Error)!!1</title>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>

The issue is the library always Logger.errors it and it can lead to false alerts around this problem.

I've manually forked and I'm going to try to wrap that block in a retry from the retry library and I've also submitted a ticket to GCP about the issue.

Long term, how do we want to solve this problem? It would be nice to be able to use Tesla middleware for these requests like the Retry middleware.

I'm happy to implement the fix I just want guidance before I do.

Sean Lewis · Answer 1 · Sat Mar 06 2021 05:57:23 GMT+0800 (China Standard Time)

Sean Lewis commented 3 years ago

Sean Lewis · Answer 2 · Sat Mar 06 2021 07:14:08 GMT+0800 (China Standard Time)

fixed by #56 (poorly) :)

José Valim · Answer 3 · Sat Mar 06 2021 15:35:09 GMT+0800 (China Standard Time)

I would honestly be worried about automatically hiding those. Even if we retried, I think we should log the possible retries and it should be up to users of the library to address this.

You can use Erlang's logger_filter since Elixir v1.10 to filter error messages. If necessary, we can add more metadata to these errors to make it easier to identify them (although the :app key should be a good enough starting point). But I don't think we should discard them on behalf of the user.

Quick question: does the pipeline also fail when we get those errors? Or does it just log?

Thanks!

Sean Lewis · Answer 4 · Tue Mar 09 2021 00:43:03 GMT+0800 (China Standard Time)

Right now, the receive_messages request failing means we don't pull messages on that pipeline run.

I'm on board with logging retries but I'm not sure how best to go about letting users of the library to set/configure this.

Any direction you can point me in on how you'd like that configuration for retries to look? I can just pass them through the opts in the client init function if that's sufficient.

I think its non-optimal that we're losing a run to an erroneous 502 from Google so I would rather this library allow us to retry and let users configure that retry instead of Logging an error and waiting for the next run to receive new messages.

Sean Lewis · Answer 5 · Tue Mar 09 2021 00:47:21 GMT+0800 (China Standard Time)

I would prefer us to Logger.warn("Retrying :retrieve_messages request: retry count of #{retry_count}") and then retry the request several times before logging it as a Logger.error considering this seems to be a common issue with pubsub at the moment.

José Valim · Answer 6 · Tue Mar 09 2021 14:12:06 GMT+0800 (China Standard Time)

Right now, the receive_messages request failing means we don't pull messages on that pipeline run.

Do you mean the pipeline as a whole crashes? If not, what do you mean by "pipeline run"?

I would prefer us to Logger.warn("Retrying :retrieve_messages request: retry count of #{retry_count}")

I am still thinking about the details of how to implement backoff but I definitely think it is not our job as a library to decide which amount of failures is enough to be logged which ones are not. We should log always, include the retry as metadata, and allow you to filter it.

Wojtek Mach · Answer 7 · Tue Mar 09 2021 20:38:02 GMT+0800 (China Standard Time)

Long term, how do we want to solve this problem? It would be nice to be able to use Tesla middleware for these requests like the Retry middleware.

Agreed. Just to re-iterate, our default client BroadwayCloudPubSub.GoogleApiClient uses https://hex.pm/packages/google_api_pub_sub which in turn indeed uses Tesla. I think exposing an option on BroadwayCloudPubSub.GoogleApiClient that would be passed down to Tesla seems like a good idea. Do you have an idea how it would look like?

In similar vein, instead of solving this problem inside BroadwayCloudPubSub.GoogleApiClient you could create your own client like this:

defmodule MyApp.BroadwayPubSubClient do
  @behaviour BroadwayCloudPubSub.Client

  defdelegate init(opts), to: BroadwayCloudPubSub.GoogleApiClient

  defdelegate acknowledge(ack_ids, opts), to: BroadwayCloudPubSub.GoogleApiClient

  defdelegate prepare_to_connect(name, producer_opts), to: BroadwayCloudPubSub.GoogleApiClient

  defdelegate put_deadline(ack_ids, ack_deadline, opts), to: BroadwayCloudPubSub.GoogleApiClient

  def receive_messages(demand, ack_builder, opts) do
    with_retries(..., fn ->
      BroadwayCloudPubSub.GoogleApiClient.receive_messages(demand, ack_builder, opts)
    end)
  end

  defp with_retries(...)
end

So I think instead of prescribing some particular handling strategy we make the library extensible enough so that the users can do that themselves.

Sean Lewis · Answer 8 · Wed Mar 10 2021 02:28:27 GMT+0800 (China Standard Time)

Nice! Yes great suggestions. I agree we don't necessarily have to solve this problem for the users but I do think its important for us to consider how we can make this experience better.

By forcing extensibility here instead of optimizing the default behavior it makes it more cumbersome for people to consume this library. I would have to cart around a custom pubsub client in every separate elixir repo instead of just using the default one with some configuration tweaks. Also, even if I have a custom pubsub client it will still Logger.error even if it's wrapped in a retry.. I think we can make some easy changes here and allow everyone to reap that benefit.

Sean Lewis · Answer 9 · Wed Mar 10 2021 02:40:17 GMT+0800 (China Standard Time)

@josevalim it fails just that run. It doesn't crash the pipeline but it does mean that if I have a 1 minute frequency on pulling messages I'm 1 minute behind simply because google/transport had an issue fulfilling my request.

It's also hard for me to understand even where to begin on log filtering. It doesn't seem like anyone on the Elixir side of things has done log filtering or written documentation about it and the Erlang docs are very hard to follow for it. I'm not sure if that's the best solution for this. I don't currently filter logs, but having to do that just because we don't have retry logic here feels like an anti-pattern. Logs emitted by libraries should be useful and contextually rich. I'm showing that this log isn't always useful and instead describes relatively frequent 502 behavior from GCP.

Sean Lewis · Answer 10 · Fri Mar 19 2021 09:05:53 GMT+0800 (China Standard Time)

I've made the retry configurable and added retries to the other requests as well.

Sean Lewis · Answer 11 · Sat Mar 27 2021 04:02:38 GMT+0800 (China Standard Time)

@josevalim Update: I have been using this in production for over a week now and it has effectively solved this issue. The 502's from Google are intermittent and unpredictable and the logic provided makes this library robust enough that the only errors and subsequent alerts we receive are actionable. This has effectively silenced these errors and has appropriately retried these requests when necessary.

José Valim · Answer 12 · Sat Mar 27 2021 04:15:30 GMT+0800 (China Standard Time)

That’s great @SophisticaSean! Given this works, can we explore an option that is based on Tesla configuration as mentioned by @wojtekmach? Thanks!

Wojtek Mach · Answer 13 · Sat Mar 27 2021 04:20:41 GMT+0800 (China Standard Time)

Another idea is I think we could leverage retry that is built-in to Tesla, just a matter of setting the appropriate middleware inside our library so the users won't have to do anything.

Well Ronfim · Answer 14 · Sat Apr 10 2021 02:15:24 GMT+0800 (China Standard Time)

Hey guys, I created a PR adding the @wojtekmach suggestion to use the Tesla retrier middleware.
I was having the same problem described by @SophisticaSean.
I'm running my solution in a dev environment at the moment, it's looking fine.
Let me know if I need to change or add anything.

Wojtek Mach · Answer 15 · Wed Apr 14 2021 17:57:24 GMT+0800 (China Standard Time)

Closing in favour of #58