OTel Context is getting lost in GraphQL manual instrumentation

Question

OTel Context is getting lost in GraphQL manual instrumentation

govi20 opened this issue 3 months ago · comments

I have an async GraphQL resolver which I am using along with a GraphQLdata loader. The GraphQL service performs manual instrumentation.
The problem I am facing here is that OpenTelemetry context is getting lost in the load method. This is happening even before offloading the task on the context-wrapped executor.

Steps to reproduce

Clone: https://github.com/govi20/dgs-otel
Build and run the project.
Access localhost:9090/graphiql
execute the following GraphQL query and see the logs where I've printed Context in DepartmentDataLoader

query emploees {
  employees {
    id
    name
    department {
      id
      name
    }
  }
}

What did you expect to see?
Otel Span and context should be available in the DepartmentDataLoader's load() method as there is no thread switch in between. the span's end method is not called so I assume the span is not closed either

What did you see instead?
OTel Context is not getting propagated.

What version and what artifacts are you using?
Version: 1.38.0, I use custom implementation of SpanExporter.

Environment
This is not environment specific issue, it's reproducible on MacOS as well as CentOS.

Additional context
I've reported this issue on GraphQL DGS Framework that I use: Netflix/dgs-framework#1928
The Netflix DGS Folks have recommended me to check with tracing framework team.

Govinda Sakhare · Answer 1 · Tue Jul 16 2024 13:44:36 GMT+0800 (China Standard Time)

Let me know if a sample project with minimal reproducible code is required.

jack-berg · Answer 2 · Wed Jul 17 2024 03:47:36 GMT+0800 (China Standard Time)

Let me know if a sample project with minimal reproducible code is required.

That would be very useful 🙂

Govinda Sakhare · Answer 3 · Sun Jul 21 2024 18:44:19 GMT+0800 (China Standard Time)

@jack-berg here is a sample project with a reproducible example https://github.com/govi20/dgs-otel

This is where the OTel Context gets lost => DepartmentDataloader

and the data loaders get called from here => EmployeeDataFetcher and there is no thread switch in between. The thread pool that I have configured executor that wraps the task using Context.taskWrapping API.

I've added steps to reproduce in the bug report.

John Watson · Answer 4 · Mon Jul 22 2024 01:29:36 GMT+0800 (China Standard Time)

It looks like dgs is based on graphql-java, which has library instrumentation (https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/graphql-java/graphql-java-20.0/library). I wonder if you just plug in that instrumentation library if a bunch of these issues would go away. Worth a shot, at least, to see how it does!

Govinda Sakhare · Answer 5 · Mon Jul 22 2024 01:41:50 GMT+0800 (China Standard Time)

@jkwatson does this library perform auto instrumentation? I need to rely on manual instrumentation logic

John Watson · Answer 6 · Mon Jul 22 2024 01:53:40 GMT+0800 (China Standard Time)

The "library" instrumentation doesn't require the javaagent. You can just plug it in programmatically. You might need to figure out how to hook it into dgs, but it doesn't need the agent.

John Watson · Answer 7 · Mon Jul 22 2024 03:27:59 GMT+0800 (China Standard Time)

Given this documentation, I would guess it'll work just fine: https://netflix.github.io/dgs/advanced/instrumentation/

Govinda Sakhare · Answer 8 · Mon Jul 22 2024 13:12:09 GMT+0800 (China Standard Time)

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

John Watson · Answer 9 · Mon Jul 22 2024 22:25:14 GMT+0800 (China Standard Time)

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

I recommend trying the library instrumentation then, and raising issues in the instrumentation repo if it has holes.

Govinda Sakhare · Answer 10 · Tue Jul 23 2024 01:34:23 GMT+0800 (China Standard Time)

@jkwatson Unfortunately, I can't use the library because it lacks the ability to instrument GraphQL data resolvers.

But still, I've tried it out, and the issue can indeed be reproducible with this library. I plan to report this issue to the instrumentation repo. However, I'm unsure if the problem lies within the core libraries or in the instrumentation.

John Watson · Answer 11 · Tue Jul 23 2024 01:45:23 GMT+0800 (China Standard Time)

The issue won't be with the core libraries. It's definitely an issue with making the instrumentation correctly propagate the context where it needs to go.

Harshit Rajput · Answer 12 · Wed Sep 11 2024 00:04:45 GMT+0800 (China Standard Time)

Continuing @kilink 's comment Netflix/dgs-framework#1928 (comment) ,
I don't think it's a bug in otel java instrumentation either.
We know that to pass context in CompletableFuture we need to wrap Executor in Context.taskWrapping(), which is already done for the dgsAsyncTaskExecutor.
However the catch is:
since batch loading is in action here, there is another CompletableFuture, which uses default executor, for the Batch Loader wrapping the DepartmentLoader. This is the reason why context is reset. If you debug line 39 EmployeeDataFetcher.department() and keep going in, you can see DataLoaderHelper.queueOrInvokeLoader()

To me it seems parameters.get(0).getContext().makeCurrent() is the only way to propagate context, without changing the current code.

Govinda Sakhare · Answer 13 · Wed Sep 11 2024 00:18:30 GMT+0800 (China Standard Time)

since batch loading is in action here, there is another CompletableFuture, which uses default executor

If I remember correctly, it doesn’t work even if the batch contains only 1 parameter.

To me it seems parameters.get(0).getContext().makeCurrent() is the only way to propagate context

Yes that fixes the issue but it is a workaround, doesn’t look elegant because ‘parameter’ is a domain object.

Harshit Rajput · Answer 14 · Wed Sep 11 2024 00:38:50 GMT+0800 (China Standard Time)

If I remember correctly, it doesn’t work even if the batch contains only 1 parameter.

Doesn't matter if it's 1 or more parameter. It's the DataLoader framework which is spitting out the CompletableFuture with fixed default executor.

Yes that fixes the issue but it is a workaround, doesn’t look elegant because ‘parameter’ is a domain object.

Doesn't seem that this is in opentelemetry control. Opentelemetry already provides context propagation by customizing CompletableFuture with otel context wrapped executor, but the DataLoader framework is using fixed default executor in between.