cloudfoundry / loggregator-release

Cloud Native Logging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

trafficcontroller fails to communicate with log-cache

JamesClonk opened this issue · comments

  • I am having trouble getting setup, understanding documentation, or using
    Loggregator in some way - please highlight any resources you are using.

After upgrading to the latest cf-deployment v4.3.0 we're currently experiencing problems retrieving app statistics. (cf app )
We get the error message "Stats unavailable: Stats server temporarily unavailable."
Tracing this in the Cloud-Controller code base led us to the trafficcontroller, and we saw a whole lot of these error messages in it's logfiles:

2018/09/19 16:07:45.282634 error communicating with log cache: rpc error: code = Internal desc = connection error: desc = "transport: authentication handshake failed: tls: oversized record received with length 20527"
2018/09/19 16:07:48.334626 LogCache request failed: rpc error: code = Internal desc = connection error: desc = "transport: authentication handshake failed: tls: oversized record received with length 20527"

It seems the trafficcontroller is not able to communicate with log-cache?
I've manually made some grpc test calls to the log-cache, but it seems to respond.
I'm not sure what the problem here is.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160628095

The labels on this github issue will be updated when the story is started.

Hi there. Can you check the environment.sh file located in /var/vcap/jobs/loggregator_trafficcontroller/bin on the loggregator_trafficcontroller vm? What is the LOG_CACHE_ADDRESS that's configured?

@MasslessParticle Sure thing.
The log-cache address is taken from the log-cache bosh-link:

LOG_CACHE_ADDR="q-s0.doppler.management.cloudfoundry.bosh:8080"

Resolving that address works, I can also grpcurl it.

I've been wondering about that actually, since log-cache's certificates have just log-cache as a CN, doesn't that pose a problem for connecting/verifying SSL?
I've noticed that other log-cache components like for example the log-cache-scheduler even have directly the instance IPs via log-cache bosh-link set as their log-cache target address, which would pose the same issue.
I guess clients don't verify the CN of log-cache's certificates, as their just simple clients?

edit: just to clarify, the log-cache-scheduler for example seems to be able to connect to log-cache just fine.

LOG_CACHE_ADDR="q-s0.doppler.management.cloudfoundry.bosh:8080 looks like the expected address/port but the error you're seeing indicates that the trafficcontroller is trying to make a TLS connection to a non-tls endpoint.

When you grpcurl the endpoint, are you providing certs? Which log-cache process is listening on :8080?

The certs are validated client side. We configure the client to expect the provided CN.

Everything is deployed as per https://github.com/cloudfoundry/cf-deployment/blob/v4.4.0/cf-deployment.yml
log-cache is configured to listen on port 8080. No I did not provide any certs for grpcurl, simply use the --insecure flag to ignore certificate problems.

What I'm not understanding is how the log-cache certificates work.
According to https://github.com/cloudfoundry/cf-deployment/blob/v4.4.0/cf-deployment.yml#L2050 it's CN is simply log-cache.
But if I try to reach log-cache from elsewhere by either using the LOG_CACHE_ADDR that loggregator uses q-s0.doppler.management.cloudfoundry.bosh, or directly the IP as for example the log-cache-scheduler does, then of course any certificate validation would not work.
So log-cache at port 8080 does actually not have TLS? Why would the trafficcontroller then try to contact it over TLS and get that error message?

@JamesClonk So the CN being log-cache is more of a artifact of how flexible SSL is in go (specifically grpc-go). grpc-go does not require the common name to be the DNS name. That being said, port 8080 is grpc via mTLS and therefore would have to have credentials.

The interesting part of your given error message is:

tls: oversized record received with length 20527

I can recreate this by doing:

$ grpcurl --insecure localhost:8081 Egress/Read
Failed to dial target host "localhost:8081": tls: oversized record received with length 20527

The port 8081 is normally serviced by the log-cache-gateway which converts grpc via mTLS to plaintext HTTP 1.1 (which is why it is only ever bound to localhost). I wonder if somehow your configuration has some ports swapped around.

Can you jump on a VM with a log-cache process running (normally the doppler VM) and run the following and give us the output:

lsof -i :8080

@apoydence

hmm, ok. Let's see:

doppler/dc0ab5e6-9be6-4cc7-8277-630b78c92714:~# lsof -i :8080
COMMAND     PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
log-cache 20858 vcap    3u  IPv4 6829284      0t0  TCP *:http-alt (LISTEN)
log-cache 20858 vcap    5u  IPv4 6835556      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->90514117-b667-4a4b-813c-4468ad1d9e28.doppler.management.cloudfoundry.bosh:48798 (ESTABLISHED)
log-cache 20858 vcap    6u  IPv4 6844673      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->58125d89-0992-4e06-a24d-cd42aad587f3.doppler.management.cloudfoundry.bosh:38382 (ESTABLISHED)
log-cache 20858 vcap    7u  IPv4 6842178      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->af953fa4-4853-42fd-823a-c60de1f88864.doppler.management.cloudfoundry.bosh:50768 (ESTABLISHED)
log-cache 20858 vcap    8u  IPv4 6835313      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->63174d50-0a60-4eb2-9c28-d201bc2b481f.scheduler.diego-scheduler.cloudfoundry.bosh:58666 (ESTABLISHED)
log-cache 20858 vcap    9u  IPv4 6839506      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->c6b793fe-dc2b-406c-82c9-3f7341f1b93a.doppler.management.cloudfoundry.bosh:60026 (ESTABLISHED)
log-cache 20858 vcap   11u  IPv4 6865004      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->d4348361-445d-4aae-81f5-be391aff3bd8.log-api.management.cloudfoundry.bosh:43332 (ESTABLISHED)
log-cache 20858 vcap   12u  IPv4 6843609      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->3d226791-d145-4adc-ab8d-4c3702843e7f.doppler.management.cloudfoundry.bosh:37198 (ESTABLISHED)
log-cache 20858 vcap   13u  IPv4 6844723      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:57464->3d226791-d145-4adc-ab8d-4c3702843e7f.doppler.management.cloudfoundry.bosh:http-alt (ESTABLISHED)
log-cache 20858 vcap   14u  IPv4 6828418      0t0  TCP localhost:39692->localhost:http-alt (ESTABLISHED)
log-cache 20858 vcap   15u  IPv4 6827663      0t0  TCP localhost:http-alt->localhost:39692 (ESTABLISHED)
log-cache 20858 vcap   16u  IPv4 6828541      0t0  TCP localhost:http-alt->localhost:39742 (ESTABLISHED)
log-cache 20858 vcap   17u  IPv4 6838631      0t0  TCP dc0ab5e6-9be6-4cc7-8277-630b78c92714.doppler.management.cloudfoundry.bosh:http-alt->54447272-0d1d-4879-8248-fcbeea835d41.doppler.management.cloudfoundry.bosh:38626 (ESTABLISHED)
log-cache 20858 vcap   18u  IPv4 6830154      0t0  TCP localhost:http-alt->localhost:39826 (ESTABLISHED)

@JamesClonk That looks as expected.

@toddboom Can you offer some insight? It seems like everything is configured correctly, however it also appears that log-cache is responding without TLS (this is an assumption based on the error).

@JamesClonk I don't immediately know what would be causing this, but we've got a few additional questions:

  • Could you send us some sample commands and output from your grpcurl commands?
  • Does cf tail work?
  • You had initially mentioned cf-d 4.3.0, but then later referenced 4.4.0 - which version do you currently have deployed? (And what version did you have deployed prior to the upgrade that caused the initial problems?)

@JamesClonk please let us know if this is still an issue. I am going to close this due to inactivity. But if you still have problems, feel free to reopen.

Sorry, I forgot about the opened issue.
Strangely we have not encountered the problem anymore during any CF-deployment > 4.x, it only happened from 3.x to 4.x.
I was not able to reproduce it anymore.