Failure to connect to dhstore due to i/o timeout
gammazero opened this issue · comments
Andrew Gillis commented
There is an occasional intermittent failure to connect to a dhstore node due to an i/o timeout error when the indexer tries to delete metadata.
{"level":"error","ts":"2023-10-03T01:01:29.390Z","logger":"indexer/ingest","caller":"ingest/ingest.go:1210","msg":"Error while ingesting ad. Bailing early, not ingesting later ads.","publisher":"12D3KooWKa1NQKxveFmppQAjkHG8RGpjoQ2ivubeNuEUgZGgc2T6","adCid":"baguqeeragw6jbdqznqphihw264orde743jmy4hhesrma7p54eq6j3s3ebypq","err":"indexerErr: internal error: failed to remove provider context: Delete \"http://dhstore.internal.prod.cid.contact/metadata/92RXJkNDA2iXjxcAqD82Esxoe5y4Fy1jtaqAM49Z4d2p\": dial tcp 20.10.6.204:80: i/o timeout","adsLeftToProcess":22346}
Only appears to happen:
- after the dhstore node has been running for days
- on dhstore nodes that are only handling reads and deletes, not the current node being written to
Seems like it might be a kubernetes network issue since the problem does not appear to happen for a long time after a pod is restarted.
Andrew Gillis commented
This may be related to the bug described here:
https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/
We should upgrade kubernetes.