ValueCache: batch function isn't always triggered on cache miss

Question

ValueCache: batch function isn't always triggered on cache miss

albert02021 opened this issue 3 years ago · comments

albert02021 commented 3 years ago

Describe the bug
I tried to add a Redis cache by leveraging the new ValueCache feature. On cache misses, my get method would always return an exceptionally complete future. I noticed that everything is working as expected when batch loading is disabled. However, if I enabled batching, the batch function wouldn't always be triggered to fetch the data from the backend sources.

I see that in DataLoaderHelper, we would add the futures to a loader queue on cache misses when batching is enabled. Is it possible that the dispatch has already been called before the future is added to the queue? We don't manually make a dispatch call on the data loader. We just define a DataLoaderRegistry and pass it to the graphql-java engine in our codes. The batch mode is working fine if we only set the cache map and max batch size in the data loader options, but not if we also set a new value cache.

To Reproduce
Version: 3.0.1

I could reproduce the issue if I added a 3 seconds delay before calling future.completeExceptionally(exception), e.g.:

public CompletableFuture<V> get(K key) {
   ...
   CompletableFuture<V> future = new CompletableFuture<>();
   redisGetFuture.onComplete((value, exception) -> {
        delay();
        if (exception == null) {
            if (value == null) {
                future.completeExceptionally(new RuntimeException("null value"));
            }
            future.complete(value);
        } else {
            future.completeExceptionally(exception);
        }
   }
   return future;
}

private void delay() {
    try {
        TimeUnit.SECONDS.sleep(3);
    } catch (Exception e) {
    }
}

Brad Baker · Answer 1 · Sat Jul 31 2021 10:46:11 GMT+0800 (China Standard Time)

Damn - I know what is happening here - its the dreaded when is a good time to dispatch problem

In graphql-java it tracks the set of fields that are outstanding (per level) and then calls dispatch when they have all completed their fetches (but may not have results)

query {
    fieldA {
        fieldA_child
        fieldB_child
   }
    fieldB {
        fieldB_child
        fieldB_child
   }
}

each field above calls dataLoader.loadKey("fieldX) and then returns back a CompletableFuture value.

When the engine has dispatched all of fieldA and fieldB is calls dispatch, which nominally (for batched fields that have not completed) cause them to batch dispatch and hence complete their CFs

This will happen BEFORE the ValueCache has completed - but graphql is now tracking the child fields and waiting for them to return in their data fetchers so it can call dispatch again for them.

I should have considered this when we put the new ValueCache into the code base and made it asynchronous.

This is a bad miss on my behalf.

We might have to revert DataLoader 3.x in the up and coming graphql-java 17.0 because of this.

The work arounds right now for 3.x of DataLoader are

Call dispatch yourself on cache completion
Use a org.dataloader.registries.ScheduledDataLoaderRegistry that will dispatch periodically (you can use the predicates to decide in minimum depth and duration time)

A possible fix (just spit balling right now) is to some how track in the DataLoader itself that a cache miss happened and hence a dispatch is in order.

albert02021 · Answer 2 · Tue Aug 03 2021 02:12:35 GMT+0800 (China Standard Time)

Thank you very much for the prompt reply.

I've thought about dispatch manually. However, I am not sure how to track when all the value futures have completed. It seems that I'll have to dispatch after each cache miss. One big reason for adding a cache is to reduce the load of backend systems which could be more expensive than the external cache. Let's say the batch size is 50 and the hit rate is 50%. If we dispatch after each cache miss, we may send out 25 single requests to backend instead of a single batch call. This may impact the backend service. Is this also the behavior for this proposed fix?

Regarding the ScheduledDataLoaderRegistry solution, the issue is that many of our fields also have low latency. For example, if we dispatched every 100ms, we would delay requests by 50ms in average and latency increase would be big percentage-wise. Are there any negatives on the performance (CPU, premature batching) if we have a very short period, e.g. 10ms, for around 10 data loaders?

As a temporary workaround, I've integrated the external cache inside our batch function:

First fetch cached results for the requests
Filter out requests with cached results
Submit the remaining requests to our backend system (our original batch function)
Insert the cached results into the right positions of the returned list.

This temporary workaround has two shortcomings:

While it doesn't increase the number of batch calls, it can't reduce the number of batch calls effectively. We still need to execute a batch call unless all items in the batch function are cache hits. It can reduce the number of items in each batch call though. Ideally, we would like to reduce both the number of batch calls as well as the total number of items in those calls.
It adds more complexity to our batch function although we have more options like batching external cache requests (ValueCache isn't a batch interface).

Thanks

Updated:
Looking into these lines in DataLoaderHelper, I think we can overcome the first shortcoming by using the default maxBatchSize (-1) in dataLoaderOption and splitting up our calls by batchsize inside the batchLoadFunction ourselves. This would allow us to call the external cache and remove cache hits first before splitting up the remaining calls.
^^^
hmm...it has a side effect on caching error handling and DataLoader statistics. We may just tune our batch size to a larger value.

Brad Baker · Answer 3 · Wed Aug 04 2021 08:34:36 GMT+0800 (China Standard Time)

Are there any negatives on the performance (CPU, premature batching) if we have a very short period, e.g. 10ms, for around 10 data loaders?

Nominally not - the ScheduledDataLoaderRegistry works by checking every N ms IF something should be dispatched.

So you could set it to 10ms meaning on average you will indice 5ms of possible delay from something needing to be dispatched to it being dispatcher

however the closer to ZERO time you get, the less likely it is that N batched things will come together. For example imagine a call has 5 dataLoader.load calls that could be batch (after async cache look up) - if they all get to the same state within a 10ms window then they could be dispatched together - but if only 2 do then only 2 will be dispatched and then the others will follow

So the longer the window, the greater the chance a async load call is inside a dispatch event horizon. The shorter the window, the less chance.

The reason the older graphql java dispatch calls got near perfect batch loading was that it was in fact a syncrhonous call - the call to dataLoader.load did return a completable future BUT the call itself was syncrhous in nature - hence when grpahql java did field tracking, it got the batching perfect.

but now with async "ValueCache" calls, we now increase the time window so batch loading MAY be less efficient if a grou of fields have dramatically different load times.

But it will be correct - they will get dispatched (which was the bug I fixed - albeit not as efficiently as I would like)

If we dispatch after each cache miss, we may send out 25 single requests to backend instead of a single batch call. This may impact the backend service. Is this also the behavior for this proposed fix?

Yes - which is why I am not happy with the outcome - its correct but less efficient than it could be and hence it needs a rethink at a higher level.

As a temporary workaround, I've integrated the external cache inside our batch function:....

This is pretty much what you needed to do in DataLoader 2.x because the thing called CacheMap was not a value cache by a FutureCache. So you would need to do what you did (which is why we want to invent the ValueCache)

It adds more complexity to our batch function although we have more options like batching external cache requests (ValueCache isn't a batch interface).

Oooh now that is interesting - could a future DataLoader 4.x introduce the ValueCache as part of the batch call and not as part of the load call - so that it first batch asked for cached value and only batch gets the rest and reassemble the list for you. Hmmmm...... anyway that is not the problem at hand

Brad Baker · Answer 4 · Wed Aug 04 2021 08:36:06 GMT+0800 (China Standard Time)

Just so people reading along know - the DataLoader 2.x did not have a proper ValueCache (it has a future cache which cant be externalised) and hence the introduction of a async ValueCache.get mechanism interferes with optimal batching.

Brad Baker · Answer 5 · Sat Aug 07 2021 19:37:36 GMT+0800 (China Standard Time)

@albert02021

you gave me the answer when you said "It adds more complexity to our batch function although we have more options like batching external cache requests (ValueCache isn't a batch interface)."

This is how we should attack you bug. We should do ValueCache loading before the batch function calls get made.

See - this should fix you bug and also be a better way for DataLoader going forward.

#99