dashbitco / broadway_cloud_pub_sub

A Broadway producer for Google Cloud Pub/Sub

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Solution for GCP PubSub routine 502's

SophisticaSean opened this issue · comments

GCP returns a 502 pretty routinely serveral times a day when fetching messages:

<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content=\"initial-scale=1, minimum-scale=1, width=device-width\">
  <title>Error 502 (Server Error)!!1</title>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>

The issue is the library always Logger.errors it and it can lead to false alerts around this problem.

I've manually forked and I'm going to try to wrap that block in a retry from the retry library and I've also submitted a ticket to GCP about the issue.

Long term, how do we want to solve this problem? It would be nice to be able to use Tesla middleware for these requests like the Retry middleware.

I'm happy to implement the fix I just want guidance before I do.

fixed by #56 (poorly) :)

I would honestly be worried about automatically hiding those. Even if we retried, I think we should log the possible retries and it should be up to users of the library to address this.

You can use Erlang's logger_filter since Elixir v1.10 to filter error messages. If necessary, we can add more metadata to these errors to make it easier to identify them (although the :app key should be a good enough starting point). But I don't think we should discard them on behalf of the user.

Quick question: does the pipeline also fail when we get those errors? Or does it just log?

Thanks!

Right now, the receive_messages request failing means we don't pull messages on that pipeline run.

I'm on board with logging retries but I'm not sure how best to go about letting users of the library to set/configure this.

Any direction you can point me in on how you'd like that configuration for retries to look? I can just pass them through the opts in the client init function if that's sufficient.

I think its non-optimal that we're losing a run to an erroneous 502 from Google so I would rather this library allow us to retry and let users configure that retry instead of Logging an error and waiting for the next run to receive new messages.

I would prefer us to Logger.warn("Retrying :retrieve_messages request: retry count of #{retry_count}") and then retry the request several times before logging it as a Logger.error considering this seems to be a common issue with pubsub at the moment.

Right now, the receive_messages request failing means we don't pull messages on that pipeline run.

Do you mean the pipeline as a whole crashes? If not, what do you mean by "pipeline run"?

I would prefer us to Logger.warn("Retrying :retrieve_messages request: retry count of #{retry_count}")

I am still thinking about the details of how to implement backoff but I definitely think it is not our job as a library to decide which amount of failures is enough to be logged which ones are not. We should log always, include the retry as metadata, and allow you to filter it.

Long term, how do we want to solve this problem? It would be nice to be able to use Tesla middleware for these requests like the Retry middleware.

Agreed. Just to re-iterate, our default client BroadwayCloudPubSub.GoogleApiClient uses https://hex.pm/packages/google_api_pub_sub which in turn indeed uses Tesla. I think exposing an option on BroadwayCloudPubSub.GoogleApiClient that would be passed down to Tesla seems like a good idea. Do you have an idea how it would look like?

In similar vein, instead of solving this problem inside BroadwayCloudPubSub.GoogleApiClient you could create your own client like this:

defmodule MyApp.BroadwayPubSubClient do
  @behaviour BroadwayCloudPubSub.Client

  defdelegate init(opts), to: BroadwayCloudPubSub.GoogleApiClient

  defdelegate acknowledge(ack_ids, opts), to: BroadwayCloudPubSub.GoogleApiClient

  defdelegate prepare_to_connect(name, producer_opts), to: BroadwayCloudPubSub.GoogleApiClient

  defdelegate put_deadline(ack_ids, ack_deadline, opts), to: BroadwayCloudPubSub.GoogleApiClient

  def receive_messages(demand, ack_builder, opts) do
    with_retries(..., fn ->
      BroadwayCloudPubSub.GoogleApiClient.receive_messages(demand, ack_builder, opts)
    end)
  end

  defp with_retries(...)
end

So I think instead of prescribing some particular handling strategy we make the library extensible enough so that the users can do that themselves.

Nice! Yes great suggestions. I agree we don't necessarily have to solve this problem for the users but I do think its important for us to consider how we can make this experience better.

By forcing extensibility here instead of optimizing the default behavior it makes it more cumbersome for people to consume this library. I would have to cart around a custom pubsub client in every separate elixir repo instead of just using the default one with some configuration tweaks. Also, even if I have a custom pubsub client it will still Logger.error even if it's wrapped in a retry.. I think we can make some easy changes here and allow everyone to reap that benefit.

@josevalim it fails just that run. It doesn't crash the pipeline but it does mean that if I have a 1 minute frequency on pulling messages I'm 1 minute behind simply because google/transport had an issue fulfilling my request.

It's also hard for me to understand even where to begin on log filtering. It doesn't seem like anyone on the Elixir side of things has done log filtering or written documentation about it and the Erlang docs are very hard to follow for it. I'm not sure if that's the best solution for this. I don't currently filter logs, but having to do that just because we don't have retry logic here feels like an anti-pattern. Logs emitted by libraries should be useful and contextually rich. I'm showing that this log isn't always useful and instead describes relatively frequent 502 behavior from GCP.

I've made the retry configurable and added retries to the other requests as well.

@josevalim Update: I have been using this in production for over a week now and it has effectively solved this issue. The 502's from Google are intermittent and unpredictable and the logic provided makes this library robust enough that the only errors and subsequent alerts we receive are actionable. This has effectively silenced these errors and has appropriately retried these requests when necessary.

That’s great @SophisticaSean! Given this works, can we explore an option that is based on Tesla configuration as mentioned by @wojtekmach? Thanks!

Another idea is I think we could leverage retry that is built-in to Tesla, just a matter of setting the appropriate middleware inside our library so the users won't have to do anything.

Hey guys, I created a PR adding the @wojtekmach suggestion to use the Tesla retrier middleware.
I was having the same problem described by @SophisticaSean.
I'm running my solution in a dev environment at the moment, it's looking fine.
Let me know if I need to change or add anything.

Closing in favour of #58