Intermittent 500's

Question

Intermittent 500's

adamgotterer opened this issue 6 years ago · comments

I'm seeing random and intermittent failures when trying to interface with Chroxy. I'm testing in a isolated environment with plenty of resources and currently have 8 available port ranges for chrome.

When running one off requests I will get a handful of 500 errors and then a handful of successful responses.

The log of the failed response is:


21:29:36.171 pid=<0.791.0> module=Plug.Logger [info]  GET /api/v1/connection

21:29:36.171 pid=<0.216.0> module=Chroxy.ChromeManager [info]  Selected chrome server: {:undefined, #PID<0.326.0>, :worker, [Chroxy.ChromeServer]}

21:29:37.175 pid=<0.792.0> module=Chroxy.ChromeProxy [debug] Obtained new page from ChromeServer [#PID<0.326.0>]

21:29:37.175 request_id=2ksqq0mmjsj7v5fgks000181 pid=<0.791.0> module=Plug.Logger [info]  Sent 200 in 1003ms

21:29:37.178 pid=<0.214.0> module=Chroxy.ProxyListener [info]  Connection accepted, spawning proxy server to manage connection

21:29:37.180 pid=<0.793.0> module=Chroxy.ProxyServer [debug] Downstream connection established

21:29:37.181 pid=<0.793.0> module=Chroxy.ProxyServer [debug] Upstream socket closed, terminating proxy

21:29:37.181 pid=<0.654.0> module=Chroxy.ChromeProxy [info]  Proxy connection down - closing page

21:29:37.181 pid=<0.216.0> module=Chroxy.ChromeManager [info]  ChromeManager linked process #PID<0.654.0> exited with reason: :normal

Adam Gotterer · Answer 1 · Wed Jun 20 2018 05:45:26 GMT+0800 (China Standard Time)

I've been able to reliable reproduce the error. I believe this happens when the proxy server doesn't successfully complete a request because of an error, like the upstream request died. I reproduced this by forcing my upstream code to prematurely exit before receiving a response. Subsequent requests began to intermittently fail from there. It's almost as if one of the available chrome instances is in a stuck and unavailable state.

Steven Holdsworth · Answer 2 · Wed Jun 20 2018 17:19:48 GMT+0800 (China Standard Time)

@adamgotterer thanks for the report 👍 appreciated!

I think we need to be sure that the Browser is indeed in a Bad State and is no longer acting in a responsive way and for this we will need to get the error (+ stack trace) which is causing the 500 or the steps to recreate the issue and chrome verbose logging enabled (mentioned a bit down below).

Client dropped the Connection
The client code has dropped the connection: 21:29:37.181 pid=<0.793.0> module=Chroxy.ProxyServer [debug] Upstream socket closed, terminating proxy
Any further communications over that web-socket connection will not work as the ProxyServer will ensure both (up/down) stream connections are closed when there is a socket closed on either side. As you mentioned we can see the client has disconnected.
See: https://github.com/holsee/chroxy/blob/master/lib/chroxy/proxy_server.ex#L150-L182

Automatically Cleans up Chrome Tab Resource on DC
There is a callback hook which ChromeProxy has registered which is invoked when the ProxyServer is going down, that will close the Chrome Tab:
21:29:37.181 pid=<0.654.0> module=Chroxy.ChromeProxy [info] Proxy connection down - closing page See: https://github.com/holsee/chroxy/blob/master/lib/chroxy/chrome_proxy.ex#L112-L120

But that doesn't explain Chrome going into a "Bad State" - nor the 500s on new connection requests
Chrome is temperamental, unfortunately, and especially when running in Docker?
But we are not out of options, firstly:

Let's see what Chrome is Doing
Can recreate the issue, can I ask that you do so with the Chrome Verbose Logging enabled, please?
https://github.com/holsee/chroxy/blob/master/config/config.exs#L50 <- set this value to 1.

If it is in a bad state, and Chrome is emitting a Log Message that we can match on - Easy Win
If that Chrome Process in the pool has gone into a bad state the important thing to do is kill the Chrome Process - and a new one will be initialised automatically in its place, that will be able to accept and process new connections. This is something Chroxy handles automatically when it can see that the Chrome Process has gone into a "Critical State" e.g.

Let's hope 🤞 Chrome is emitting a message, and it will just be a case of adding a condition to match on it and restart Chrome.

If not -> A new Chrome OS Process Healthcheck?
There was an idea floated whereby an internally scheduled check would take place against all Chrome Processes in the Pool. This would create an new connection to the Browser, ensure a basic operation completed ok, then close the connection (and the resource) - if this failed the entire browser will be terminated and restarted, putting the pool into a good state and working around the problem of the bad chrome processes.

Ultimately - we have ideas and we will be able to address the problems when we have more information from the Verbose chrome logging and/or steps to recreate locally, @adamgotterer 🙏. The truth is, Chroxy can scale and allow many concurrent interactions but the underlying Chrome process stability does mean that client side code will need to "retry" in the event of failures.

Manu J · Answer 3 · Wed Jun 20 2018 18:52:01 GMT+0800 (China Standard Time)

I was trying to set up travis testing and ran into this bug - https://travis-ci.org/j-manu/chroxy/jobs/394488723

I have created a branch with verbose logging enabled and I was able to recreate this bug here - https://travis-ci.org/j-manu/chroxy/jobs/394497915. Also encountered a timeout error here - https://travis-ci.org/j-manu/chroxy/jobs/394495151

Instead of having a chrome process healthcheck, we can have an API endpoint which accepts a ws_url and recycles the chrome instance associated with that.
Since the client side needs to retry, it can send this API call if it encounters the error.

Manu J · Answer 4 · Wed Jun 20 2018 21:08:04 GMT+0800 (China Standard Time)

Not conclusive - but if i start only one chrome process, I've been unable to trigger this bug. Otherwise running mix test triggers this bug pretty often.

Steven Holdsworth · Answer 5 · Wed Jun 20 2018 23:12:31 GMT+0800 (China Standard Time)

@j-manu yes that could work, I would say though that background process which is continuously testing the chrome browsers health would do the same without the end user needing to worry about such things.

I am open to your suggestion of an endpoint which would act as a "hint" that a ChromeServer is unresponsive. It opens up an issue of needing to maintain the websocket urls as an Identifier (simple enough), but also that there may be many working connections running against that browser which would be impacted every time and overzealous client script got an error response.

I didn't run into this kind of issue unless I was running in Docker, and trying to address that cost me a few weeks of the time I had to work on this fulltime 🤕 - but I am still very interested in getting chrome to run as stable as possible in such an environment as it is what we target.

There could be a mode of operation implemented whereby browsers are spawned per connection (and terminated after) - maybe use something like poolboy to manage the access / number of browsers which can be spawned (via poolsize), but this is an extremely resource intensive mode when running 1000s of concurrent connections to Browsers versus Tabs, and also exposes the full browser API (which might not be a bad thing at all tbh, as far as features go).

When running 100s of concurrent tests against the same browser, you are trusting that chrome browser process is able to manage that number of web-socket connections and schedule all the commands across each of the pages reliably, and I have little doubt this is not the most stable code in the world within Chrome - if it was Chroxy would not need to exist 🤷‍♂️

Manu J · Answer 6 · Wed Jun 20 2018 23:32:05 GMT+0800 (China Standard Time)

yeah. background process works too. Simpler for the client.

Thinking aloud, I'm wondering if instead of a background process, chroxy does a sanity check of websocket connection before returning it to the API client. This will of course add a delay and hence can be optional.

Manu J · Answer 7 · Wed Jun 20 2018 23:33:42 GMT+0800 (China Standard Time)

Btw i'm able to reproduce this problem pretty reliably outside of docker (on a mac) by running mix test. So it is not docker specific.

Adam Gotterer · Answer 8 · Wed Jun 20 2018 23:36:57 GMT+0800 (China Standard Time)

I also haven't been able to reproduce this issue with a single running chrome instance.

Steven Holdsworth · Answer 9 · Thu Jun 21 2018 21:05:32 GMT+0800 (China Standard Time)

@adamgotterer @j-manu noted - single instance with many tabs may be the work-a-round for stability, with multiple instances behind a load balancer. Can I ask if you are running in Docker? / Have you seen the 500 errors outside of it?

Adam Gotterer · Answer 10 · Thu Jun 21 2018 22:53:33 GMT+0800 (China Standard Time)

I'm running in Docker. @j-manu said he also can reproduce the problem outside of Docker.

I'm concerned that running a single Chrome instance will end up having the same scaling characteristics as running Chrome without Chroxy at all. In my testing with remote debugging interfacing with Chrome directly connections start to drop after a few context are created. Not sure the exact number, but my non-scientific guesstimate puts it somewhere around 5-10.

Manu J · Answer 11 · Thu Jun 21 2018 23:51:43 GMT+0800 (China Standard Time)

@holsee If the issue happens with only multiple chrome instances, then it is probably a problem at the chroxy proxy server and not chrome?

I tried setting different "user-data-dir" for the different chrome instances but that didn't help. Other than that why would running multiple chrome instances cause a problem?

Steven Holdsworth · Answer 12 · Thu Jun 21 2018 23:56:55 GMT+0800 (China Standard Time)

@adamgotterer a concern I share. I am currently very busy with another project delivery at the minute, but I think the unit of work to pick up next will come out of this thread.

Adding the support for running ChromeServer on multiple nodes, (we could create a nice docker template for spinning up the web server node which will house the API endpoint and the ProxyServers for channeling the traffic, and the worker nodes which will run instances of chrome - that will be selected based on a strategy from the pool of available registered ChromeServer processes). I think using elixir clustering would be a great fit for this.
Better chrome OS process restarting when it is no longer responsive.

From these units of work, I think we will address the issue or at least minimise the impact of when chrome gets unstable.

Another work item I would like to add is support for Firefox - but that is another thread that has yet to be started.

Steven Holdsworth · Answer 13 · Fri Jun 22 2018 00:27:18 GMT+0800 (China Standard Time)

@j-manu Ah HAH! I think you might have hit the nail on the head here buddy!

I have a theory... (only a theory at this point 😬) but hear me out:

Hypothesis:

If you have a single chrome all pages which are created will be to the same chrome instance, therefore the incoming connections will be routed to the correct Chrome OS process for which there is a Page (Tab) waiting for the connection. Note: that the web-socket url returned does have a Page ID which is specific to the underlying chrome instance, and if you are routed to the wrong chrome process you will get a 500 error
When there are multiple chrome instances running, the incoming connection might be routed to the wrong chrome process, as the order of connection to the url returned matters! -- We should totally fix this.

If you trace the code from the /connection HTTP handler you will see that the connection is 2 phase and there might be a RACE CONDITION as outlined above if the client code does not connect in the order in which they requested a GET /connection

The GET /connection request is made
A chrome process is selected from the pool (at random)
A page is created on that chrome process
A ChromeProxy is initialised which holds the selected chrome instance, and instructs the TCP listener to await a connection.
The ws:// url is returned to the client (which contains a PAGE ID for a specific chrome os process) to the client and the client will connect to that ws:// (which will goto the Proxy)

RACE

The TCP listener blocks and awaits a connection, but the first connection to come in is from the "second" client to request a connection, which has the PAGE ID for the "second" chrome instance.
So the connecting client will be routed to the "first" ChromeProxy which holds a reference to the first chrome os process, and a 500 error will occur.

I think this is the problem.

Steven Holdsworth · Answer 14 · Fri Jun 22 2018 00:36:04 GMT+0800 (China Standard Time)

So, to clarify the theory behind this scenario:

Client 1 - GET /connection
-> Chroxy - "Chrome 1" selected
-> Chroxy - Page created in "Chrome 1" which returns "ws://localhost:$CHROME_1_PORT/devtools/page/$CHROME_1_PAGE_ID_FOO"
-> Chroxy substitutes the "ws://PROXY_HOST:PROXY_PORT/devtools/page/$CHROME_1_PAGE_ID_BAR"
-> Chroxy instructs internal TCP Listern to accept connection on "ws://PROXY_HOST:PROXY_PORT/"

Client 2 - GET /connection
-> Chroxy - "Chrome 2" selected
-> Chroxy - Page created in "Chrome 2" which returns "ws://localhost:$CHROME_2_PORT/devtools/page/$CHROME_2_PAGE_ID_BAR"
-> Chroxy substitutes the "ws://PROXY_HOST:PROXY_PORT/devtools/page/$CHROME_2_PAGE_ID_BAR"
-> Chroxy instructs internal TCP Listern to accept connection on "ws://PROXY_HOST:PROXY_PORT/"

Client 2 - connects first to -> Chroxy instructs internal TCP Listern to accept connection on "ws://PROXY_HOST:PROXY_PORT/devtools/page/$CHROME_2_PAGE_ID_BAR"
-> Chroxy accepts the connection and routes it to "Chrome 1"
-> 500 ERROR 💥

As you can see there would be no such issue with a single chrome.

This means we need to correctly route the incoming connections based on the PAGE_ID to the correct ChromeProxy process for the incoming connection 🤞

Steven Holdsworth · Answer 15 · Fri Jun 22 2018 00:41:30 GMT+0800 (China Standard Time)

The connection is accepted here:
https://github.com/holsee/chroxy/blob/master/lib/chroxy/proxy_listener.ex#L87-L97

The proxy_opts are what contains the reference to the Chrome instance, as such could be the WRONG Chrome for the incoming client connection.
https://github.com/holsee/chroxy/blob/master/lib/chroxy/chrome_proxy.ex#L98 is where the reference to the ChromeProxy which itself has the reference to the selected ChromeServer (and chrome os process) is passed to the listener.

The FIX (needs more thought but):

On an incoming request - based on an internal registry > route to the correct ChromeProxy
Hold an internal mapping between PAGE_IDs and the ChromeProxy processes (an ETS table is probably perfect for this!)

And might I add that this would work well for cleaning up Pages in Chrome for which a connection is never made, which is another feature on the backlog 👍

Manu J · Answer 16 · Fri Jun 22 2018 03:31:01 GMT+0800 (China Standard Time)

Phew Yeah, a race condition is what I thought too but I had assumed you were routing based on url to process mapping already and thought the storing/retrieving of it had some race condition.

Btw for splitting into worker and server nodes, how will the worker node register with the server node? Also if chrome crashes how will server node know? I think the current architecture works really well. The only hard part is getting it set up (install erlang, install elixir, mix compile). Is there a way to package this as a self contained binary?

Manu J · Answer 17 · Mon Jun 25 2018 19:04:48 GMT+0800 (China Standard Time)

@holsee Once you have connection tracking is it possible to eliminate the API step? i.e the client connects to ws://localhost:1331/ and the proxy server spawns a new connection (instead of the API server) and it continues from there.

This makes it easier to load balance multiple chroxy instancers because it will be a single connection. Currently it takes two connections (one to api and one ws) and those can go to different instances. The only way to solve it from what I see is to ensure that a client machine always gets routes to the same chroxy instance but that means that a single client machine can only use one chroxy instance

Steven Holdsworth · Answer 18 · Wed Jun 27 2018 19:10:29 GMT+0800 (China Standard Time)

@j-manu possibly, but the client needs to connect to what it thinks is a Chrome Remote Debug instance (but is, in fact, a proxy) - so this might break the protocol and the client compatibility.

Manu J · Answer 19 · Wed Jun 27 2018 19:55:42 GMT+0800 (China Standard Time)

@holsee But that's how it works now also?

Here is how I envisioned it.

Proxy listens for a connection
For a NEW connection, it spawns a new ChromeProxy instance.
The connection is handed over to the ChromeProxy instance.
From this point on it is just like current architecture.

I'm not familiar with elixir but I guess you can differentiate between new and old connections and this change doesn't require holding a mapping of connections -> chromeProxy for multiple chrome instances to work.

Adam Gotterer · Answer 20 · Wed Jun 27 2018 20:17:53 GMT+0800 (China Standard Time)

What are some Chrome use case for targeting and reconnecting to the same session via the Chrome provided WS URL?

Manu J · Answer 21 · Wed Jun 27 2018 20:33:10 GMT+0800 (China Standard Time)

@adamgotterer Are you asking for use cases or are you saying that there are use cases for targeting and reconnecting to the same session?

Adam Gotterer · Answer 22 · Wed Jun 27 2018 21:30:03 GMT+0800 (China Standard Time)

I’m asking for use cases. I’ve personally never needed to access a session a second time or in parallel. So I’m curious what people have used it for.

Steven Holdsworth · Answer 23 · Sat Jun 30 2018 20:24:45 GMT+0800 (China Standard Time)

I am working on the connection routing fix now (while baby sleeps 👶) if I don't get this done today I will pick it up on Wednesday as I will have time while I travel. Please open other issues for ideas and bugs not related to the 500 issue as defined here, thanks.

Steven Holdsworth · Answer 24 · Mon Jul 02 2018 22:50:32 GMT+0800 (China Standard Time)

@j-manu @adamgotterer I have a fix in place in pull request #10 - I need to tidy this up and perform some testing on my end before merging this.

I also needed to perform #9 first in order to reliably recreate the issue (i.e. each connection request will take the next chrome browser in the pool instead of random selection). The unit test added in #9 recreates the issue reliably, and I can confirm #10 sees that test passing.

Steven Holdsworth · Answer 25 · Tue Jul 03 2018 00:50:11 GMT+0800 (China Standard Time)

You can try the fixes out on ~~fix/issue-5-conn-race~~ develop branch if you wish

Steven Holdsworth · Answer 26 · Tue Jul 03 2018 01:44:56 GMT+0800 (China Standard Time)

@adamgotterer @j-manu version 0.4.0 push to hex.pm which includes the fix for the issue detailed here. Closing now. Thanks for your reports and help tracking issues down, much appreciated!

Manu J · Answer 27 · Tue Jul 03 2018 01:46:17 GMT+0800 (China Standard Time)

Thank you! Now I can deploy this to production :)

Steven Holdsworth · Answer 28 · Tue Jul 03 2018 02:21:58 GMT+0800 (China Standard Time)

@j-manu version tag 0.4.1 for sanity 👍 let me know how you get on.