Slow response times when intercepting pod traffic

Question

Slow response times when intercepting pod traffic

alextricity25 opened this issue 5 months ago · comments

Describe the bug
I have a web service that I am intercepting traffic for. The intercept works as expected, but my page load times are 3-5 seconds. When I stop the intercept, the load times are typically less than 1 second.
My web service also connects to the database server in the cluster, which I may suspect is the reason for the slowness. Still, I wouldn't expect this kind of latency between my intercepted pod (running on my workstation) and the database it's connecting to on the cluster.

telepresence_logs.zip

To Reproduce
Steps to reproduce the behavior:

When I run telepresence intercept xrdm-portal --port 80:80 --docker-build ./devops/local-development --docker-build-opt file=./devops/local-development/Dockerfile.portal-web-watch -- --rm --name blah -e WATCH=true -v ./apps/portal:/app/ IMAGE
I see a new pod spin up, and when I visit the service in my kubernetes cluster, it successfully directs traffic to the service running on my local workstation. However, the app is extremely slow.

Expected behavior
I expect that the response times are marginally slower, but not significantly slower.

Versions (please complete the following information):

Output of telepresence version

OSS Client             : v2.17.0
OSS Daemon in container: v2.17.0
Traffic Manager        : v2.19.0

Operating system of workstation running telepresence commands:

macOS 14.3.1

Kubernetes environment and Version [e.g. Minikube, bare metal, Google Kubernetes Engine]:

GKE - 1.27.8-gke.1067004

Alex Cantu · Answer 1 · Sat Mar 30 2024 00:38:43 GMT+0800 (China Standard Time)

I upgraded to OSS Client v2.18.0 and unfortunately that didn't help :(

OSS Client             : v2.18.0
OSS Daemon in container: v2.18.0
Traffic Manager        : v2.19.0
Traffic Agent          : docker.io/datawire/ambassador-telepresence-agent:1.14.5

Alex Cantu · Answer 2 · Tue Apr 02 2024 01:09:23 GMT+0800 (China Standard Time)

I've managed to pin this down to the connection my web app is making to the postgresql service running in the cluster.

I have created a script to test the latency of SQL requests from the web service to the database, and here are the results.
When I run the script on the web service running in the cluster, I get almost zero latency (as expected since postgresql is also running in the cluster):

However, when I intercept that service and run the same script from the intercepted service running on my laptop, the execution time for this script significantly increases:

Here, I am connecting to the postgresql database running on the k8s cluster from the intercepted service running on my laptop inside a Docker container.

Is it normal for Telepresence to have this much latency when an intercepted service is trying to reach another service running in the cluster? The way I see it, this is what is happening:
1 - The client connects to the web service running in the k8s cluster.
2 - The web service receives the connection, and is intercepted by the telepresence traffic agent sidecar.
3 - The agent sends that request to the traffic manager.
4 through 5 - The traffic manager sends that request to the daemon running on my laptop
6 - The web service on my laptop recieves the request
7 - The web services needs to connect to the database, it makes a connection to the database service running in the k8s cluster
8 - The traffic manager receives that connection, and sends it off to the postgresql pod running in the cluster.
9 - postgresql receives that connection.

Is this a correct understanding of what is happening?

Alex Cantu · Answer 3 · Tue Apr 02 2024 03:51:44 GMT+0800 (China Standard Time)

A small update here,

I created a k8s port forward to my database service, then created a docker socat container to forward traffic to my docker host machine on port 5473. I also changed the DB_HOST environment variable on my web service to my docker host. This way, connections going to the database from my local service would be utilizing the k8s port forward instead of Telepresence.
I saw only a marginal decrease in latency. It seems like there may be something else going on here specific to my cluster.

At this point, I am not convinced that this is a Telepresence issue. This seems more like an issue with the connection to my database running in k8s. Closing this for now.

Thomas Hallgren · Answer 4 · Tue Apr 02 2024 14:50:49 GMT+0800 (China Standard Time)

Your understanding of what's going on is mostly correct, but it reflects the old (pre-2.17.0) behavior. Today Telepresence will set up a direct tunnel between the client and the traffic-agent. The traffic-manager is not involved and will not route traffic once an intercept is established. There's one exception to that rule and that is if the client isn't permitted to do a port-forward to the traffic-agent. In such situations, Telepresence will fall back to the old behavior.