deis / router

Edge router for Deis Workflow

Home Page:https://deis.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Random 502 bad gateway

robinmonjo opened this issue · comments

Hello all,

Using deis v2.11.0 I experience random 502 error on the Deis router. I deployed an app and scale it up to 5 webs.

So I've got this service:

NAME          CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
my_app   100.65.135.200   <none>        80/TCP    1d

and these endpoints:

Name:		my_app
Namespace:	my_app
Labels:		app=my_app
		heritage=deis
		router.deis.io/routable=true
Subsets:
  Addresses:		10.42.0.12,10.42.0.2,10.43.128.11,10.43.128.12,10.43.128.13
  NotReadyAddresses:	<none>
  Ports:
    Name	Port	Protocol
    ----	----	--------
    http	3000	TCP

No events.

Everything looks good, nginx on deis router is properly configured to send request on my kubernetes service:

[...]
proxy_pass http://100.65.135.200:80;
[...]

I can also properly curl each endpoints from within the router pod.

However I get random and regularly 502 bad gateway when accessing my app (it works most of the time but 20% of my requests got a 502). Here are some logs of the router:

2017/02/09 14:27:10 [error] 48#0: *8869 connect() failed (113: No route to host) while connecting to upstream, client: 10.42.0.0, server: ~^my_app\.(?<domain>.+)$, request: "GET /en/accounts/1/events HTTP/1.1", upstream: "http://100.65.135.200:80/en/accounts/1/events", host: "my_app.deis.my_host.com"
[2017-02-09T14:27:10+00:00] - my_app - 10.42.0.0 - - - 502 - "GET /en/accounts/1/events HTTP/1.1" - 732 - "-" - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" - "~^my_app\x5C.(?<domain>.+)$" - 100.65.135.200:80 - my_app.deis.my_host.com - 2.996 - 2.996
[2017-02-09T14:27:10+00:00] - my_app - 10.42.0.0 - - - 200 - "GET /favicon.ico HTTP/1.1" - 2664 - "http://my_app.deis.my_host.com/en/accounts/1/events" - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" - "~^my_app\x5C.(?<domain>.+)$" - 100.65.135.200:80 - my_app.deis.my_host.com - 0.002 - 0.002

I have no idea how to debug this further ...

Regards,
Robin

Linking this issue: #180 but pretty hard to reproduce as it happens randomly.

I've got more information about this subject. I scaled deis router to have 2 pods. From within the 2 deis router pods, I started a simple loop that curl my service and output the status code every ten seconds.

For about 1 hour, everything worked fine, only the expected 302 status code was output. Then I launched a deployment I started to see some failures and managed to catch one:

$> curl 100.65.135.200 -vv
* Rebuilt URL to: 100.65.135.200/
*   Trying 100.65.135.200...
* connect to 100.65.135.200 port 80 failed: No route to host
* Failed to connect to 100.65.135.200 port 80: No route to host
* Closing connection 0
curl: (7) Failed to connect to 100.65.135.200 port 80: No route to host

So good news, it doesn't seem to come from the deis router. It looks like my kubernetes service failed sometimes, and more frequently when endpoints changed recently. I have a liveness and a readiness probs set on my app web processes.

Does that sounds to be a reasonable conclusion for you ? My cluster is a k8s 1.5.2 cluster setup with kops, running on AWS and using the weave network plugin.

I don't know much about weave, but your conclusion seems reasonable. This sounds like a problem upstream from Nginx, so the overlay network, kube-proxy, or some bug in deployments are all possibilities.

Ok thank you. What do you recommend to use ? I have had good success with flannel not that good with weave (hence the issue :) ). I don't really want to use the "not software based" networking as I don't really want to have to worry about routing in my VPC route table...

I don't want to be too quick to pin the problem on weave, but since you asked... most of my clusters have used kube-aws from CoreOS. That uses flannel by default and I've never personally witnessed this problem.

Also, more recently, I have created clusters using kops and that uses kubenet by default.

I used to use kube-aws as well but had had a terrible experience lately since they introduced node pools. Tried kops and really loved it. I'll try another overlay network and see if my problem happens again.

Robin,

Were did you end up with this. We noticed 502 within our cluster as well running weave.

We are running kube 1.7.4 and weave 2.1.3. We notice times where thing start dropping.

image

But is not isolated to a single node.
image

What would you recommend as next steps? Would one weave pod crash really cause 502 originating on different deis pods on different nodes?