refresh-period not triggering any periodic discovery

Question

refresh-period not triggering any periodic discovery

weand opened this issue 2 years ago · comments

Hi,

we use Stork on Quarkus with Kubernetes Service Discovery.

It turns out, that Stork does not consider refresh-period properly, when using a configuration like

quarkus.grpc.clients.fooService.host=foo
quarkus.grpc.clients.fooService.port=9000
quarkus.grpc.clients.fooService.name-resolver=stork


quarkus.stork.foo.service-discovery.type=kubernetes
quarkus.stork.foo.service-discovery.k8s-namespace=${jkube.namespace}
quarkus.stork.foo.service-discovery.application=foo-service-grpc
quarkus.stork.foo.service-discovery.refresh-period=5s

We debugged the application remotely in k8s, and see that
KubernetesServiceDiscovery.fetchNewServiceInstances(List previousInstances) is not invoked regulary. Even not when having traffic on the application.

We expected that new services get discovered (after refresh-period) automatically, e.g. when scaling up a client pod, or restarting a pod. But thats obviously not the case.

Latest Quarkus Release uses Stork 1.1.2, but there don't seem any change on CachingServiceDiscovery in Stork 1.2.0.

Any advice ?

Best regards
Andreas

Aurea Muñoz Hernández · Answer 1 · Fri Sep 30 2022 17:59:03 GMT+0800 (China Standard Time)

Hi Andreas,
Thanks for reaching out, it should refresh the cache after the refreshing period, do you have a reproducer?

I don't see anything wrong in the config.

Andreas Weise · Answer 2 · Fri Sep 30 2022 18:20:53 GMT+0800 (China Standard Time)

I've got nothing ready for github.

But we basically follow this approach:

Inject a channel object as described here

@GrpcClient("fooService")
Channel channel;

Then on application start we create one async stub on the channel:

FooServiceStub fooService = FooServiceGrpc.newStub(channel)

The boundary of the application contains an JAX-RS service. On every REST call, we have some bidirectional communication on the stub:

fooService.someMethod(responseStreamObserver)

Aurea Muñoz Hernández · Answer 3 · Fri Sep 30 2022 20:17:19 GMT+0800 (China Standard Time)

Ok, can you confirm that the problem is about the refreshing of the cache? I mean that the service instances are collected the first time and cache is gathered with them or it is never fetching the instances from the cluster?

I will try to set up an scenario equivalent in order to see if there is a problem but we have not detected anything recently. Can you please confirm the Quarkus and Stork versions used?

I'll be off to an event 2 days next week so you can expect some delay in my investigations.

Aurea Muñoz Hernández · Answer 4 · Fri Sep 30 2022 20:20:50 GMT+0800 (China Standard Time)

If you have access to the cluster, Could you please verify that the following command returns the instances that you expect Stork to discover?
kubectl get endpoints foo-service-grpc -n ${jkube.namespace}

Andreas Weise · Answer 5 · Tue Oct 04 2022 15:36:56 GMT+0800 (China Standard Time)

Ok, can you confirm that the problem is about the refreshing of the cache? I mean that the service instances are collected the first time and cache is gathered with them or it is never fetching the instances from the cluster?

Yes, its about refreshing the caches view, e.g. after scaling up pods. So basic reproducer is:

grpc service started with 2 pods initially
start application with grpc client: it load balances properly (in round robin manner) between both service instances
scale grpc service up to 3 pods total: client won't ever call the new pod.

# service with 2 pods started
$ oc get endpoints foo-service-grpc
NAME                               ENDPOINTS                             AGE
foo-service-grpc   10.128.4.110:9000,10.130.4.248:9000   6m6s

# scale up to 3 pods
$ oc get endpoints foo-service-grpc
NAME                               ENDPOINTS                                               AGE
foo-service-grpc   10.128.4.110:9000,10.130.4.248:9000,10.131.2.171:9000   8m46s

Andreas Weise · Answer 6 · Mon Nov 28 2022 05:00:39 GMT+0800 (China Standard Time)

After recent PR #374 we hoped for some improvement, but the issue remains. Tested with latest Quarkus 2.14.2.

Any workaround idea for how to get service instances discovered properly after scaling up ?

Thanks in advance.

Junior Dussouillez · Answer 7 · Tue Jun 20 2023 15:27:53 GMT+0800 (China Standard Time)

Similar issue with Quarkus 3.0.4.

When I deploy a new version, new pods are created and the old ones destroyed. The client application detects the new pods but then use the old ones again.

Start the client, it connects to pod 2 and 3.
Deploy the new version of the gRPC service, pod 4 and 5 are created. Pod 2 and 3 are destroyed.

Here the logs on the client side (logs are old, with Quarkus v2 but same issue with Quarkus v3) :

Aurea Muñoz Hernández · Answer 8 · Wed Jul 05 2023 20:33:32 GMT+0800 (China Standard Time)

Hi, sorry for the delay, I haven't been able to take a look on this because of traveling, I will do later on this week and try to provide a response.

Aurea Muñoz Hernández · Answer 9 · Fri Jul 07 2023 22:46:15 GMT+0800 (China Standard Time)

Hi here! I finally was able to reproduce this issue. I will do a fix beginning of next week.

Aurea Muñoz Hernández · Answer 10 · Wed Jul 12 2023 17:57:23 GMT+0800 (China Standard Time)

Fixed by #611

Junior Dussouillez · Answer 11 · Wed Jul 26 2023 16:28:09 GMT+0800 (China Standard Time)

Just to let you know that it also fixes the similar bug I had using Quarkus v3.2.1

Thanks a lot for the fix!