smallrye / smallrye-stork

SmallRye Stork is a service discovery and client side-load balancing framework.

Home Page:http://smallrye.io/smallrye-stork/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Business calls keep failing on non running instance using load-balancer=least-response-time

vsevel opened this issue · comments

My discovery returns 2 service instances:

  • 8080 (running)
  • 8090 (not running)

My client application repeatedly calls this service. The first call succeeds, the subsequent calls all fail.

To reproduce:

mvn clean install
java -Dquarkus.http.port=8080 -jar target/quarkus-app/quarkus-run.jar
curl http://localhost:8080/hello/stork => SUCCESS
curl http://localhost:8080/hello/stork => KO
curl http://localhost:8080/hello/stork => KO
curl http://localhost:8080/hello/stork => KO
...

see https://github.com/vsevel/stork-issue/tree/issue_233

Isn't it an issue of non preserving the service instance ids?
The code below seems to assing a new id on every service discovery refresh

    private List<ServiceInstance> map(String list) {
        return Arrays.stream(list.split(",")).map(this::createInstance).collect(Collectors.toList());
    }

    private ServiceInstance createInstance(String hostPort) {
        String[] split = hostPort.split(":");
        String host = split[0];
        int port = Integer.parseInt(split[1]);
        return new DefaultServiceInstance(ServiceInstanceIds.next(), host, port, false);
    }

hmm... with 600s refresh period the refresh should not be called at all

The code below seems to assing a new id on every service discovery refresh

what is the contract on service ids?
I tried to cache the service instance id across refresh per host-port:

    MyServiceDiscoveryProviderConfiguration config;
...
    private ServiceInstance createInstance(String hostPort) {

        String[] split = hostPort.split(":");
        String host = split[0];
        int port = Integer.parseInt(split[1]);

        if (!ids.containsKey(hostPort)) {
            ids.putIfAbsent(hostPort, ServiceInstanceIds.next());
        }
        long id = ids.get(hostPort);

        return new DefaultServiceInstance(id, host, port, false);
    }

but the test keeps failing.

on the first call (instance 0, which is up) the call succeeds
on the second call (instance 1, which is down) the call fails, and we store newTotalTime=60000.0 and newWeightSum=1.0
on the third call we choose the failing instance because:

  • instance 0 (up): score=3.39907167E7
  • instance 1 (down): score=60000.0, which is the errorPenalty

I think may be the issue is that the errorPenalty is in ms, whereas the timeInNs is in nanos

I confirm that by changing that line in the LeastResponseTimeLoadBalancer fixes the problem:
long errorPenalty = DurationUtils.parseDuration(config.getErrorPenalty()).toNanos();

note that caching the service id is probably a bad idea because an instance that failed once would be ignored forever I suspect. changing the service ids after each refresh will loose the stats, but allow failed instances to be retried.

the way we handled this in our impl was that failed instance would go on the "not_ok" list for 1 minute with no calls dispatch to it (unless there were no instances available at all), and come back into the "ok" list after spending this minute on the side.
if this instance was still failing. we would put it back into the "not_ok" list, and continue the cycle ...
since we were recognizing the idempotent situations, we would not mind trying out an instance that had failed, as long as we would not try to use it more than once per minute, until there was one successful call. then the instance would go back into the "ok" pool.

the the difference is that our default refresh period (3 mins) was unrelated to our purgatory period (1 min).
right now, the only way we get our failed instance to be retried is to reset stats, and my understanding is that the way to do this is to change the service ids during the discovery refresh. but it does not have to.

anyway, I am going to do a PR for the nanos issue.

I confirm that by changing that line in the LeastResponseTimeLoadBalancer fixes the problem: long errorPenalty = DurationUtils.parseDuration(config.getErrorPenalty()).toNanos();

note that caching the service id is probably a bad idea because an instance that failed once would be ignored forever I suspect. changing the service ids after each refresh will loose the stats, but allow failed instances to be retried.

it's up to load balancer to decide when to reuse an instance that failed. The serviceIds should be preserved to let LB get a chance to make that decision

the way we handled this in our impl was that failed instance would go on the "not_ok" list for 1 minute with no calls dispatch to it (unless there were no instances available at all), and come back into the "ok" list after spending this minute on the side. if this instance was still failing. we would put it back into the "not_ok" list, and continue the cycle ... since we were recognizing the idempotent situations, we would not mind trying out an instance that had failed, as long as we would not try to use it more than once per minute, until there was one successful call. then the instance would go back into the "ok" pool.

the the difference is that our default refresh period (3 mins) was unrelated to our purgatory period (1 min). right now, the only way we get our failed instance to be retried is to reset stats, and my understanding is that the way to do this is to change the service ids during the discovery refresh. but it does not have to.

anyway, I am going to do a PR for the nanos issue.

great, thank you!

it's up to load balancer to decide when to reuse an instance that failed. The serviceIds should be preserved to let LB get a chance to make that decision

so I understand this is a must to preserve those ids.
I put the code back.

however, least-response-time seems to be designed to never retry a previously failed instance (unless all other instances would fail).
is this correct?
is this by design (considered a characteristic of this load balancer), or a non optimal behavior that shall be fixed?
that would not work for me, because we would be loosing load balancing after the first failure. is that correct?

here is how the least-response-time works now: #225

E.g. if your error-penalty is set to 60s, and declining-factor is 0.9, after another 60 other calls made, it's significance will be as if it was ca 100 ms response time at the most recent call.
So in time, the significance of the data collected in the past is lowered, hence even service instances that were failing, or had a long response time, a while back, will be retried in the future.

Both parameters are configurable

interesting. so if you have a high rate of outgoing calls (e.g. 20 per sec) the failed instance will be retried more often than if you have a low rate. I like better the predictable nature of the 1 minute penalty box, but this is not a big concern for me.

the ability to do automatic retries when possible (which means that the load balancer should be able to provide a list of prioritized eligible instances, and instruct on the situations where calls can be retried or not) is what I would be looking at next.

I am not an expert on MP fault tolerance, but one thing I would like to avoid is having all developers to annotate their @GET services with @Retry just because they know those services are idempotent as defined by the http spec. And even if we were asking them to do it, they would not know how many retries to set, because only the client side load balancing component knows how many instances are worth retrying.

I will expand those thoughts in #232