owais/istio-grpc-headless-test

Demonstrate gRPC + headless service issues with istio

issue: istio/istio#49391

Steps to reproduce

Deploy server components kubectl apply -f server.yaml
Switch to istio-test namespace. kubens istio-test
Wait for all three server pods to become healthy

Deploy client components kubectl apply -f client.yaml
Observe client is able to send messages to the server pods with kubectl logs -f -l=component=client

Scale down server pods with kubectl scale sts istio-grpc-test-server --replicas 0
Observe errors in client pod logs

Look at istio proxy endpoints. Most of the time it'll still list old server pod IPs as healthy. istioctl proxy-config endpoints <client-pod-name> | grep server

Scale up server pods back to three replicas and Wait for new server pods to come up and become healthy and take note of the new IPs. kubectl scale sts istio-grpc-test-server --replicas 3

Describe service to confirm k8s service has updated its endpoints to the new pod IPs

List client proxy endpoints again as in step 8 and notice they are still pointing to old IPs

Look at client pods logs again and confirm that the errors have not resolved even though replacement server pods are up and healthy

At this point pretty much the only way to recover the service is to restart the client pod so it gets the new server IPs. Sometimes making some changes to the k8s service resource or related istio resources also triggers something and updates client endpoints to the new IPs but this does not work reliably.

Deployment have the same issue

I've deployed the servers as a statefulset as that is the closest setup to my real world scenario but it doesn't really matter. I've been able to reproduce it with deployment as well as long as the pods are exposed with a headless service (clusterIP: None).

accessing individual pods vs service behaves the same

It doesn't matter whether the client tries to connect to the headless service (istio-grpc-test-server.istio-test.svc.cluster.local) or a specific pod (istio-grpc-test-server-0.istio-grpc-test-server.istio-test.svc.cluster.local). Both cases behave exactly the same.

What works

Headful services

When using a headful service instead of headless, none of this happensa and the istio proxy is able to discover new endpoints as soon as they become healthy. It also removes the old endpoints as soon as server pods are deleted. This allows the client to recover once server pods become available again.

It doesn't matter whether the server pods are from a stateful or a deployment. As long as they are exposed via a headless service (clientIP: None), I can reliably reproduce the issue.

Marking service port as TCP

This only happens when the service port app protocol is set to gRPC by prefixing the port name with grpc-. If the port name is prefixed with tcp-, everything works as expected even with a headless service.

owais / istio-grpc-headless-test