deis / nsq

A kubernetes based docker image for running nsqd

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nsqd pod exited as 'completed' triggering app health check failures

vdice opened this issue · comments

Deis' own champagne cluster was affected by the nsqd pod exiting w/ Completed status, seemingly triggering errors in both the registry and workflow-manager pods. (The alerts leading us to investigate were health check failures on the k8s-claimer app deployed on this cluster.)

Here is the log tail from said pod:

...
[nsqd] 2016/10/31 21:25:50.576088 200 GET /stats?format=json (10.20.1.9:34742) 39.962µs
[nsqd] 2016/10/31 21:25:51.050988 HTTP: closing [::]:4151
[nsqd] 2016/10/31 21:25:51.052387 TCP: closing [::]:4150
[nsqd] 2016/10/31 21:25:51.052510 NSQ: persisting topic/channel metadata to /opt/nsq/data/nsqd.905.dat
[nsqd] 2016/10/31 21:25:51.534562 NSQ: closing topics
[nsqd] 2016/10/31 21:25:51.534729 TOPIC(metrics): closing
[nsqd] 2016/10/31 21:25:51.534804 TOPIC(metrics): closing ... messagePump
[nsqd] 2016/10/31 21:25:51.534844 CHANNEL(consumer): closing
[nsqd] 2016/10/31 21:25:51.534921 DISKQUEUE(metrics:consumer): closing
[nsqd] 2016/10/31 21:25:51.534989 DISKQUEUE(metrics:consumer): closing ... ioLoop
[nsqd] 2016/10/31 21:25:51.536344 PROTOCOL(V2): [10.20.2.9:53476] exiting ioloop
[nsqd] 2016/10/31 21:25:51.547640 ERROR: client(10.20.2.9:53476) - failed to read command - read tcp 10.20.0.5:4150->10.20.2.9:53476: use of closed network connection
[nsqd] 2016/10/31 21:25:51.549589 PROTOCOL(V2): [10.20.1.9:40500] exiting ioloop
[nsqd] 2016/10/31 21:25:51.549655 ERROR: client(10.20.1.9:40500) - failed to read command - read tcp 10.20.0.5:4150->10.20.1.9:40500: use of closed network connection
[nsqd] 2016/10/31 21:25:51.551025 PROTOCOL(V2): [10.20.1.9:40500] exiting messagePump
[nsqd] 2016/10/31 21:25:51.551086 PROTOCOL(V2): [10.20.0.7:48059] exiting ioloop
[nsqd] 2016/10/31 21:25:51.551144 ERROR: client(10.20.0.7:48059) - failed to read command - read tcp 10.20.0.5:4150->10.20.0.7:48059: use of closed network connection
[nsqd] 2016/10/31 21:25:51.551182 PROTOCOL(V2): [10.20.0.7:48059] exiting messagePump
[nsqd] 2016/10/31 21:25:51.551246 PROTOCOL(V2): [10.20.2.9:53476] exiting messagePump
[nsqd] 2016/10/31 21:25:51.664931 DISKQUEUE(metrics): closing
[nsqd] 2016/10/31 21:25:51.665069 DISKQUEUE(metrics): closing ... ioLoop
[nsqd] 2016/10/31 21:25:51.809062 TOPIC(logs): closing
[nsqd] 2016/10/31 21:25:51.809250 TOPIC(logs): closing ... messagePump
[nsqd] 2016/10/31 21:25:51.809313 CHANNEL(consume): closing
[nsqd] 2016/10/31 21:25:51.809359 DISKQUEUE(logs:consume): closing
[nsqd] 2016/10/31 21:25:51.809416 DISKQUEUE(logs:consume): closing ... ioLoop
[nsqd] 2016/10/31 21:25:51.810602 PROTOCOL(V2): [10.20.2.6:44983] exiting ioloop
[nsqd] 2016/10/31 21:25:51.810664 ERROR: client(10.20.2.6:44983) - failed to read command - read tcp 10.20.0.5:4150->10.20.2.6:44983: use of closed network connection
[nsqd] 2016/10/31 21:25:51.810752 PROTOCOL(V2): [10.20.2.6:44983] exiting messagePump
[nsqd] 2016/10/31 21:25:51.961055 DISKQUEUE(logs): closing
[nsqd] 2016/10/31 21:25:51.961237 DISKQUEUE(logs): closing ... ioLoop
[nsqd] 2016/10/31 21:25:52.081016 QUEUESCAN: closing
[nsqd] 2016/10/31 21:25:52.081297 200 GET /stats?format=json (10.20.1.9:34742) 896.959019ms
[nsqd] 2016/10/31 21:25:52.166461 200 GET /stats?format=json (10.20.2.9:43496) 650.746772ms
[nsqd] 2016/10/31 21:25:52.167008 ID: closing
[nsqd] 2016/10/31 21:25:52.167217 LOOKUP: closing

Cluster state when investigated:

$ kd get po
NAME                                     READY     STATUS      RESTARTS   AGE
deis-builder-808234085-4er8f             1/1       Running     11         14d
deis-controller-3550476522-vjmsw         1/1       Running     0          19d
deis-logger-1750174080-h61sh             1/1       Running     0          19d
deis-logger-fluentd-0s73r                1/1       Running     1          19d
deis-logger-fluentd-m6m7c                1/1       Running     0          19d
deis-logger-fluentd-w2gbi                1/1       Running     0          19d
deis-logger-redis-4282343581-o53zz       1/1       Running     0          19d
deis-monitor-grafana-947410455-7cujl     1/1       Running     0          19d
deis-monitor-influxdb-802438432-eyz7v    1/1       Running     0          19d
deis-monitor-telegraf-6qx76              1/1       Running     3          19d
deis-monitor-telegraf-c8d4t              1/1       Running     0          19d
deis-monitor-telegraf-cz6g7              1/1       Running     0          19d
deis-nsqd-2528936859-oe5je               0/1       Completed   3          19d
deis-registry-2153931602-3jmt9           0/1       Error       0          19d
deis-registry-proxy-s9p27                1/1       Running     0          19d
deis-registry-proxy-vg02u                1/1       Running     1          19d
deis-registry-proxy-vih1z                1/1       Running     0          19d
deis-router-1833288991-6itpr             1/1       Running     0          19d
deis-workflow-manager-3855316227-r1z8o   0/1       Error       0          19d

Log tail from registry pod, unhelpful besides timestamps:

10.20.0.1 - - [31/Oct/2016:21:23:20 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "Go-http-client/1.1"
time="2016-10-31T21:23:20.983954657Z" level=info msg="response completed" go.version=go1.6.2 http.request.host="10.20.0.8:5000" http.request.id=7825c2e9-11a2-45f3-b906-af8f6094dfc2 http.request.method=GET http.request.remoteaddr="10.20.0.1:52909" http.request.uri="/v2/" http.request.useragent="Go-http-client/1.1" http.response.contenttype="application/json; charset=utf-8" http.response.duration=174.29994ms http.response.status=200 http.response.written=2 instance.id=bc726225-3d34-4142-a327-fc7f79f87ccf service=registry version=v2.1.1-601-g0afef00
10.20.0.1 - - [31/Oct/2016:21:23:20 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "Go-http-client/1.1"

Note: Interestingly, unless there is some time skew at play, the last log from registry is at 21:23:20 whereas the nsqd timestamp is a bit later at 21:25:51 -- so it may indeed be a registry issue...

This issue was moved to teamhephy/nsqd#2