ValueCache: batch function isn't always triggered on cache miss
albert02021 opened this issue · comments
Describe the bug
I tried to add a Redis cache by leveraging the new ValueCache
feature. On cache misses, my get
method would always return an exceptionally complete future. I noticed that everything is working as expected when batch loading is disabled. However, if I enabled batching, the batch function wouldn't always be triggered to fetch the data from the backend sources.
I see that in DataLoaderHelper, we would add the futures to a loader queue on cache misses when batching is enabled. Is it possible that the dispatch has already been called before the future is added to the queue? We don't manually make a dispatch call on the data loader. We just define a DataLoaderRegistry and pass it to the graphql-java engine in our codes. The batch mode is working fine if we only set the cache map and max batch size in the data loader options, but not if we also set a new value cache.
To Reproduce
Version: 3.0.1
I could reproduce the issue if I added a 3 seconds delay before calling future.completeExceptionally(exception)
, e.g.:
public CompletableFuture<V> get(K key) {
...
CompletableFuture<V> future = new CompletableFuture<>();
redisGetFuture.onComplete((value, exception) -> {
delay();
if (exception == null) {
if (value == null) {
future.completeExceptionally(new RuntimeException("null value"));
}
future.complete(value);
} else {
future.completeExceptionally(exception);
}
}
return future;
}
private void delay() {
try {
TimeUnit.SECONDS.sleep(3);
} catch (Exception e) {
}
}
Damn - I know what is happening here - its the dreaded when is a good time to dispatch problem
In graphql-java it tracks the set of fields that are outstanding (per level) and then calls dispatch when they have all completed their fetches (but may not have results)
query {
fieldA {
fieldA_child
fieldB_child
}
fieldB {
fieldB_child
fieldB_child
}
}
each field above calls dataLoader.loadKey("fieldX)
and then returns back a CompletableFuture
value.
When the engine has dispatched all of fieldA and fieldB
is calls dispatch, which nominally (for batched fields that have not completed) cause them to batch dispatch and hence complete their CFs
This will happen BEFORE the ValueCache
has completed - but graphql is now tracking the child fields and waiting for them to return in their data fetchers so it can call dispatch
again for them.
I should have considered this when we put the new ValueCache
into the code base and made it asynchronous.
This is a bad miss on my behalf.
We might have to revert DataLoader 3.x in the up and coming graphql-java 17.0 because of this.
The work arounds right now for 3.x of DataLoader are
- Call
dispatch
yourself on cache completion - Use a
org.dataloader.registries.ScheduledDataLoaderRegistry
that will dispatch periodically (you can use the predicates to decide in minimum depth and duration time)
A possible fix (just spit balling right now) is to some how track in the DataLoader itself that a cache miss happened and hence a dispatch is in order.
Thank you very much for the prompt reply.
I've thought about dispatch manually. However, I am not sure how to track when all the value futures have completed. It seems that I'll have to dispatch after each cache miss. One big reason for adding a cache is to reduce the load of backend systems which could be more expensive than the external cache. Let's say the batch size is 50 and the hit rate is 50%. If we dispatch after each cache miss, we may send out 25 single requests to backend instead of a single batch call. This may impact the backend service. Is this also the behavior for this proposed fix?
Regarding the ScheduledDataLoaderRegistry solution, the issue is that many of our fields also have low latency. For example, if we dispatched every 100ms, we would delay requests by 50ms in average and latency increase would be big percentage-wise. Are there any negatives on the performance (CPU, premature batching) if we have a very short period, e.g. 10ms, for around 10 data loaders?
As a temporary workaround, I've integrated the external cache inside our batch function:
- First fetch cached results for the requests
- Filter out requests with cached results
- Submit the remaining requests to our backend system (our original batch function)
- Insert the cached results into the right positions of the returned list.
This temporary workaround has two shortcomings:
- While it doesn't increase the number of batch calls, it can't reduce the number of batch calls effectively. We still need to execute a batch call unless all items in the batch function are cache hits. It can reduce the number of items in each batch call though. Ideally, we would like to reduce both the number of batch calls as well as the total number of items in those calls.
- It adds more complexity to our batch function although we have more options like batching external cache requests (ValueCache isn't a batch interface).
Thanks
Updated:
Looking into these lines in DataLoaderHelper, I think we can overcome the first shortcoming by using the default maxBatchSize (-1) in dataLoaderOption and splitting up our calls by batchsize inside the batchLoadFunction ourselves. This would allow us to call the external cache and remove cache hits first before splitting up the remaining calls.
^^^
hmm...it has a side effect on caching error handling and DataLoader statistics. We may just tune our batch size to a larger value.
Are there any negatives on the performance (CPU, premature batching) if we have a very short period, e.g. 10ms, for around 10 data loaders?
Nominally not - the ScheduledDataLoaderRegistry works by checking every N ms IF something should be dispatched.
So you could set it to 10ms meaning on average you will indice 5ms of possible delay from something needing to be dispatched to it being dispatcher
however the closer to ZERO time you get, the less likely it is that N batched things will come together. For example imagine a call has 5 dataLoader.load
calls that could be batch (after async cache look up) - if they all get to the same state within a 10ms window then they could be dispatched together - but if only 2 do then only 2 will be dispatched and then the others will follow
So the longer the window, the greater the chance a async load
call is inside a dispatch event horizon. The shorter the window, the less chance.
The reason the older graphql java dispatch calls got near perfect batch loading was that it was in fact a syncrhonous call - the call to dataLoader.load
did return a completable future BUT the call itself was syncrhous in nature - hence when grpahql java did field tracking, it got the batching perfect.
but now with async "ValueCache" calls, we now increase the time window so batch loading MAY be less efficient if a grou of fields have dramatically different load times.
But it will be correct - they will get dispatched (which was the bug I fixed - albeit not as efficiently as I would like)
If we dispatch after each cache miss, we may send out 25 single requests to backend instead of a single batch call. This may impact the backend service. Is this also the behavior for this proposed fix?
Yes - which is why I am not happy with the outcome - its correct but less efficient than it could be and hence it needs a rethink at a higher level.
As a temporary workaround, I've integrated the external cache inside our batch function:....
This is pretty much what you needed to do in DataLoader 2.x because the thing called CacheMap was not a value cache by a FutureCache. So you would need to do what you did (which is why we want to invent the ValueCache)
It adds more complexity to our batch function although we have more options like batching external cache requests (ValueCache isn't a batch interface).
Oooh now that is interesting - could a future DataLoader 4.x introduce the ValueCache as part of the batch call and not as part of the load
call - so that it first batch asked for cached value and only batch gets the rest and reassemble the list for you. Hmmmm...... anyway that is not the problem at hand
Just so people reading along know - the DataLoader 2.x did not have a proper ValueCache (it has a future cache which cant be externalised) and hence the introduction of a async ValueCache.get
mechanism interferes with optimal batching.
you gave me the answer when you said "It adds more complexity to our batch function although we have more options like batching external cache requests (ValueCache isn't a batch interface)."
This is how we should attack you bug. We should do ValueCache loading before the batch function calls get made.
See - this should fix you bug and also be a better way for DataLoader going forward.