Without return_immediately: true, acknowledge messages get "stuck"

Question

Without return_immediately: true, acknowledge messages get "stuck"

mcrumm opened this issue 5 years ago · comments

The Problem

Using the default GoogleApiClient without return_immediately: true, there is a high likelihood that messages will not be acknowledged before they reach their deadline.

Background Info

If there are no messages in the topic queue, the subscription "pull request" will be held open for a "bounded amount of time", unless the user specifies return_immediately: true, at which point the server will respond immediately with an empty list† and the producer will have to wait the entire :receive_interval before beginning another pull request.

Currently, an acknowledge request can get blocked by an in-flight pull request, at which point the messages being acknowledged are almost certainly going to be returned to the queue, as they will have exceeded their deadline by the time the acknowledge request is actually sent.

As laid out in googleapis/elixir-google-api#98, all GoogleApi connections currently default to using Tesla with the httpc adapter. As of the latest version of Gax (0.1.3), this behavior is still not directly configurable.

Possible Solutions

1) Recommend global adapter configuration

We could recommend that users install and configure a global adapter for Tesla. This would resolve the issues with httpc, but it still won't allow for per-request adapter configuration, which we really need so we can hold open the pull request until a message arrives or the server disconnects us.

2) Wait for vendor adoption

We could wait for Gax to support Tesla 1.2, at which point we should be able to provide runtime adapter configuration. If/when that happens, the next question would be how much of this adapter-related configuration should be exposed to end users?

3) Ship with a working adapter

Ideally, as an end user, I can ignore all of the adapter complexity and just focus on handling messages. One way for that to be true would be to add the following dependencies:

# BroadwayCloudPubSub.MixProject
def deps() do
  [
    {:tesla, "~> 1.2.0"},
    {:hackney, "~> 1.6"}
  ]
end

and then inject the adapter into the Tesla.Client after it's been created:

defmodule BroadwayCloudPubSub.GoogleApiClient.Connection do
  alias GoogleApi.PubSub.V1.Connection, as: V1Connection

  def new(token, adapter_opts \\ []) do
    token
    |> V1Connection.new()
    |> override_adapter(adapter_opts)
  end

  defp override_adapter(client, opts \\ []) do
    %{adapter: adapter} = Tesla.client([], {Tesla.Adapter.Hackney, opts})
    %{client | adapter: adapter}
  end
end

so that we can set adapter options at runtime:

defmodule BroadwayCloudPubSub.GoogleApiClient do
  def receive_messages(demand, opts) do
    opts
    |> token!()
    |> GoogleApiClient.Connection.new(recv_timeout: :infinity)
    |> pubsub_projects_subscriptions_pull()
  end

  def ack({_, ref} = ack_ref, successful, _failed) do
    opts = Broadway.TermStorage.get!(ref)

    opts
    |> token!()
    |> GoogleApiClient.Connection.new()
    |> pubsub_projects_subscriptions_acknowledge()
  end
end

Steps to Reproduce:

Follow the Google Cloud Pub/Sub How-to guide for Testing apps locally with the emulator to run the Pub/Sub emulator and create a topic and subscription.
Configure a Broadway pipeline using BroadwayCloudPubSub.Producer.
Be sure to give your pipeline some work to do. This bug doesn't exhibit as easily if the Processors immediately return a successfully processed message.

Override the :base_url for Pub/Sub in the app configuration:

config :google_api_pub_sub, base_url: "http://localhost:8085"

Start the pipeline, and ensure it is connecting to the Pub/Sub emulator instance running locally.
Publish messages to the topic. Using the sample lib from the How-to guide, you can run the following command:
```
python publisher.py <PROJECT_ID> publish <TOPIC_NAME>
```
You should see the same messages being processed multiple times, possibly to the point where the queue is never drained.

† When the queue is empty, the PullResponse struct actually returns nil for receivedMessages, which breaks the typespec, but I digress.

Thoughts? @wojtekmach @msaraiva @josevalim

José Valim · Answer 1 · Tue Aug 06 2019 15:30:52 GMT+0800 (China Standard Time)

I like option 3 because I think we should bring such concerns up and shield the users away from them. Otherwise we have broadway_gcpb that usese pubsub that uses tesla and at this point the user is many layers deep in figuring out how to make everything work. By keeping it internal as well, we can change it back to option 2 once it is out.

Also, I believe amazon sqs also defaults to hackney, so maybe we can discuss a unified configuration/api for passing down options to the http client. It seems another thing we may want to do is to always configure the http client pool to have at least the number of producers + number of processors. Although I have seen some people setting up thousands of processors, so we may want a cap on that. maybe number of producers + 2 * number of cores?

Michael Crumm · Answer 2 · Tue Aug 06 2019 23:49:52 GMT+0800 (China Standard Time)

@josevalim are you thinking we should hackney_pool:start_pool for the pipeline? Is there a good place for this to live? Currently the only place I can think to put it would be in the producer's start_link, but wouldn't that require a separate pool per producer stage?

José Valim · Answer 3 · Wed Aug 07 2019 01:30:56 GMT+0800 (China Standard Time)

Maybe start_pool is idempotent? So we can always call it from eacch producer? But we can also add a new broadway callback if necessary that is invoked once for all producers... but I guess that would still need idempotency because of restarts and what not.

Wojtek Mach · Answer 4 · Wed Aug 07 2019 01:43:37 GMT+0800 (China Standard Time)

@josevalim good call, at least on the outside it looks like it's idempotent indeed:

iex(11)> :hackney_pool.start_pool(:my_pool, [])
:ok
iex(12)> :hackney_pool.start_pool(:my_pool, [])
:ok
iex(13)> :hackney_pool.start_pool(:my_pool, [])
:ok
iex(14)> :hackney_pool.start_pool(:my_pool, [])
:ok
iex(15)> {:ok, 200, _, ref} = :hackney.request(:get, "https://httpbin.org/json", [], "", pool: :my_pool); :hackney.body(ref)
{:ok,
 "{\n  \"slideshow\": {\n    \"author\": \"Yours Truly\", \n    \"date\": \"date of publication\", \n    \"slides\": [\n      {\n        \"title\": \"Wake up to WonderWidgets!\", \n        \"type\": \"all\"\n      }, \n      {\n        \"items\": [\n          \"Why <em>WonderWidgets</em> are great\", \n          \"Who <em>buys</em> WonderWidgets\"\n        ], \n        \"title\": \"Overview\", \n        \"type\": \"all\"\n      }\n    ], \n    \"title\": \"Sample Slide Show\"\n  }\n}\n"}
iex(16)> {:ok, 200, _, ref} = :hackney.request(:get, "https://httpbin.org/json", [], "", pool: :my_pool); :hackney.body(ref)
{:ok,
 "{\n  \"slideshow\": {\n    \"author\": \"Yours Truly\", \n    \"date\": \"date of publication\", \n    \"slides\": [\n      {\n        \"title\": \"Wake up to WonderWidgets!\", \n        \"type\": \"all\"\n      }, \n      {\n        \"items\": [\n          \"Why <em>WonderWidgets</em> are great\", \n          \"Who <em>buys</em> WonderWidgets\"\n        ], \n        \"title\": \"Overview\", \n        \"type\": \"all\"\n      }\n    ], \n    \"title\": \"Sample Slide Show\"\n  }\n}\n"}

There's also a way to use custom pool.

Michael Crumm · Answer 5 · Wed Aug 07 2019 01:54:41 GMT+0800 (China Standard Time)

@wojtekmach sweet, thanks for checking on that!

I'll be honest, I felt a little out of my depth looking at the hackney_disp example for a custom pool handler, but if a custom handler feel like the way to go, I can start diving deeper into that.

José Valim · Answer 6 · Wed Aug 07 2019 01:56:13 GMT+0800 (China Standard Time)

I think a regular pool would be just fine! :D -- *José Valimwww.plataformatec.com.br <http://www.plataformatec.com.br/>Founder and Director of R&D*

Wojtek Mach · Answer 7 · Wed Aug 07 2019 01:58:11 GMT+0800 (China Standard Time)

I was just mentioning the custom pool option for completeness, agreed about using regular one!

…

On 6 Aug 2019, at 19:56, José Valim ***@***.***> wrote: I think a regular pool would be just fine! :D -- *José Valimwww.plataformatec.com.br <http://www.plataformatec.com.br/>Founder and Director of R&D* — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAASSJ6KBYL4DS7AYZ5DL3LQDG3L3ANCNFSM4IJSR2XA>.