DataLoader dispatches together keys from different requests

Question

DataLoader dispatches together keys from different requests

edacostacambioupgrade opened this issue 6 years ago · comments

Enrique da Costa Cambio commented 6 years ago

At when using what I think is a standard setup (using graphql-spring-boot with DataLoaderDispatcherInstrumentation and DataLoaderRegistry singleton beans) when two (http) requests from different callers request the same data type by the same key (i.e. use the same DataLoader) all keys are enqueued and dispatched together: BatchLoader.load(List<K> keys) is called with keys merged from both request.
I have not used the facebook node implementation but from what I understand, their DataLoaders are created per-request, so this merging doesn't happen.
While this behavior may be desirable in some cases it comes with some drawbacks:

issues with keys on one request affect the other request and this not very deterministic (unless you backing service is smart enough to return per-key errors)
if one request loads 1 key and another one loads 1K keys, both will have the latency of loading 1001 requests, and again, this is not very deterministic.
if you are propagating authentication and your backing service only takes a global authentication principal (ie: an authorization header) you cannot send the requests together anyway, you need to split by requestor (or execution id) (you could live with this if you backing service took in a per-key principal but that would be pretty ugly i think)

i wonder:

is this behavior intentional?
is this a problem with the way I have it set up?
would you be open for a PR that enables devs choose to merge or not to merge keys?

if this is an issue with my setup then you can skip the rest, otherwise:

these are the options i'm considering at the moment:

wrapping the BatchLoader.load(...) method with one that splits by execution id, this solves some interference issues but it still makes all concurrent requests wait until everyone else's data is available.
subclassing DataLoader to implement something like sliceIntoBatchesOfBatches but doing it by execution id. this could work but it has two issues:
- most of the things i would need to change in the DataLoader class are private so it would involve either copying code or gaining access by reflection :S
- this is fine for the BatchLoader.dispatch() method because it doesn't wait for the overall result, but the dispatchAndJoin() would still wait for every request to finish. i don't mind because I don't use it and the instrumentation only ends up calling dispatch()
- while this approach won't make callers wait, it would still sometimes dispatch "early" some keys of other requests maybe even before they are completely enqueued, resulting occasionally in more requests in a non-deterministic way)
another option i considered is to make the DataLoader a per-request object to make DataLoaders entirely isolated, this isn't easy though, I would need to provide means for DataFetcher to access the right DataLoader for given request, with some effort, I could keep a map by execution id but is not easy to manage it's life-cycle (I fear i would end up with leaked instances).

this is what i would like:
option 1

DataLoader.dispatch() and DataLoader.dispatchAndJoin() and DataLoaderRegistry.dispatchAll() should take an executionId as a parameter. Depending on a data loader option either all requests are dispatched or only requests for that execution id are dispatched. The DataLoader.load(K key) method would also need to take in an execution id (or a DataFetchingEnvironment)
DataLoaderDispatcherInstrumentation.dispatch() passes the execution id to DataLoaderRegistry.dispatchAll()
DataLoaderDispatcherInstrumentation.beginExecution(instrumentationParameters).onEnd(...) calls a new method DataLoaderRegistry.discardAll(ExecutionId) (that calls a new DataLoader.discard(ExecutionId) method) to make sure appropriate clean is on in case of errors/abortion.
would that enough cleanup or is there any case in which keys may have been queued but beginExecution.onEnd is not called?

option 2
similarly but without changing the DataLoader make DataLoaderRegistry be aware of executions and keep a map of executionid -> DataLoaders (it would need to be built with DataLoader suppliers instead of DataLoaders directly (with this apporach only the DataLoaderRegistry.dispatchAll()` method needs to be modified to take in the execution id. in this case the DataLoaderRegistry would need to expose a means to retrieve the DataLoader for a specific execution for DataFetchers to use.

option 3
same thing but managed by the instrumentation, changing the DataLoaderDispatcherInstrumentation to take DataLoaderRegistry supplier instead of a DataLoaderRegistry this supplier or the instrumentation would to expose a method to return the DataLoaderRegistry associated with an execution id so that DataFetchers can get the right one.

Bojan Tomić · Answer 1 · Mon Apr 23 2018 05:49:03 GMT+0800 (China Standard Time)

Wait... Why are you making DataLoaderRegistry a singleton if you want it per request? A singleton DataLoaderRegistry is only applicable to a very specific use-case and is not common at all.

What is normally done is having a DataLoaderRegistry created per request and stored into the global context for the execution, e.g.

DataLoaderRegistry dataLoaderRegistry = ...; // create per request

//Transform the pre-configured GraphQL instance or create a new one
GraphQL runtime = graphQL.transform(builder -> builder.instrumentation(
  new DataLoaderDispatcherInstrumentation(dataLoaderRegistry)));

//Make dataLoaderRegistry accessible to fetcher functions
ExecutionInput.newExecutionInput()
  .query(...)
  .context(dataLoaderRegistry)
  .build ();

This is very simple and requires no low-level concurrency control nor keeping track of executions. So I think there's nothing wrong with the current implementation.

Enrique da Costa Cambio · Answer 2 · Mon Apr 23 2018 07:21:19 GMT+0800 (China Standard Time)

oh i see... thanks! i didn't realize that creating a GraphQL object was so lightweight. so it is indeed a problem with my setup.

By looking at graphql-spring-boots GraphQLWebAutoConfiguration.graphQLServlet(...), it looks like i have to declare my instrumentation and data loader registry beans with @RequestScope and then add a GraphQLContextBuilder that creates a context with the request-scoped registry. is that right?

Enrique da Costa Cambio · Answer 3 · Mon Apr 23 2018 08:54:21 GMT+0800 (China Standard Time)

closing this, it's a non-issue, thanks a lot!

although i found it a bit unintuitive, so i'm leaving this here for other noobs like me, i had to do this:

    @Bean
    @RequestScope
    public DataLoaderRegistry dataLoaderRegistry() {
        ...
    }

    @Bean
    @RequestScope
    public Instrumentation instrumentation(DataLoaderRegistry dataLoaderRegistry) {
        return new DataLoaderDispatcherInstrumentation(dataLoaderRegistry);
    }

but because while ExecutionInput it's ok with any Object context, the GraphQLServlet.createContext(..) wants a GraphQLContext instance (see SimpleGraphQLServlets GraphQLContextBuilder field too) so I had to create a GraphQLContextBuilder implementation that returns a subclass of GraphQLContext instead of setting the registry directly. not a big deal, but i wonder if everyone is doing this or if people is using singletons without realizing the consequences? (or perhaps the is a more straightforward way that i'm not seeing)

to set the registry in the context i had to do this:

    @Bean
    public GraphQLContextBuilder graphQLContextBuilder(DataLoaderRegistry dataLoaderRegistry) {
        // note that dataLoaderRegistry is a request scoped proxy.
        return new GraphQLRequestContextBuilder(dataLoaderRegistry);
    }

and here are my context builder and context subclass:

public class GraphQLRequestContextBuilder implements GraphQLContextBuilder {
    private final DataLoaderRegistry dataLoaderRegistry;

    // ... constructor ...
    @Override
    public GraphQLContext build(Optional<HttpServletRequest> request, Optional<HttpServletResponse> response) {
        return new GraphQLRequestContext(request, response, dataLoaderRegistry);
    }
}

public class GraphQLRequestContext extends GraphQLContext {
    private final DataLoaderRegistry dataLoaderRegistry;
    
    public GraphQLRequestContext(Optional<HttpServletRequest> request, Optional<HttpServletResponse> response, DataLoaderRegistry dataLoaderRegistry) {
        super(request, response);
        this.dataLoaderRegistry = dataLoaderRegistry;
    }
   // ... getter ...
}

then my DataFetchers do:

DataFetchingEnvironment environment = ...;
GraphQLRequestContext context = environment.getContext();
DataLoaderRegistry registry = context.getDataLoaderRegistry();
DataLoader<K, V> dataLoader = registry.getDataLoader(name);
return dataLoader.load(key);

Brad Baker · Answer 4 · Mon Apr 23 2018 19:33:48 GMT+0800 (China Standard Time)

Thanks for taking the time to make such a detailed issue

…

On Sun., 22 Apr. 2018, 20:54 Enrique da Costa Cambio, < ***@***.***> wrote: Closed #21 <#21>. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABlbSWqnYkT1_9PltMFd0bbG7oeRiLmOks5trSY9gaJpZM4TfAaY> .

Bojan Tomić · Answer 5 · Mon Apr 23 2018 20:34:16 GMT+0800 (China Standard Time)

@edacostacambioupgrade I found the servlet overly convoluted, so I normally advise a simple Spring controller, as it's a lot more obvious.

I'm just curious, since your DataLoaderRegistry is already request scoped, you could directly inject it instead of keeping it in the context, right?

Enrique da Costa Cambio · Answer 6 · Wed Apr 25 2018 04:46:35 GMT+0800 (China Standard Time)

@kaqqao i didn't try but i'm not sure i can inject them in my datafetchers because (i think) the request scoped proxies are somehow bound by spring to the current thread, i suspect spring will not find the right loader if the fetching happens a different thread.
(i guess something like this answer would be needed, or i would need to decorate the tasks of the executors to pass that around anywhere where an async task/thread is fired, but i didn't like either solution)

yeah, maybe having my own controller would be better, but by briefly looking at the servlet i see it has a lot of stuff (multipart handling, callbacks, etc) and i don't know enough to tell if i will need that, but i don't want to reimplement them in my controller if i eventually do.
i think i will end up subclassing the SimpleGraphQLServlet servlet, where i can create a new instrumentation and create the context without having to use request-scoped beans.