open-telemetry / opentelemetry-java

OpenTelemetry Java SDK

Home Page:https://opentelemetry.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OTel Context is getting lost in GraphQL manual instrumentation

govi20 opened this issue · comments

I have an async GraphQL resolver which I am using along with a GraphQLdata loader. The GraphQL service performs manual instrumentation.
The problem I am facing here is that OpenTelemetry context is getting lost in the load method. This is happening even before offloading the task on the context-wrapped executor.

Steps to reproduce

  1. Clone: https://github.com/govi20/dgs-otel
  2. Build and run the project.
  3. Access localhost:9090/graphiql
  4. execute the following GraphQL query and see the logs where I've printed Context in DepartmentDataLoader
query emploees {
  employees {
    id
    name
    department {
      id
      name
    }
  }
}

What did you expect to see?
Otel Span and context should be available in the DepartmentDataLoader's load() method as there is no thread switch in between. the span's end method is not called so I assume the span is not closed either

What did you see instead?
OTel Context is not getting propagated.

What version and what artifacts are you using?
Version: 1.38.0, I use custom implementation of SpanExporter.

Environment
This is not environment specific issue, it's reproducible on MacOS as well as CentOS.

Additional context
I've reported this issue on GraphQL DGS Framework that I use: Netflix/dgs-framework#1928
The Netflix DGS Folks have recommended me to check with tracing framework team.

Let me know if a sample project with minimal reproducible code is required.

Let me know if a sample project with minimal reproducible code is required.

That would be very useful 🙂

@jack-berg here is a sample project with a reproducible example https://github.com/govi20/dgs-otel

This is where the OTel Context gets lost => DepartmentDataloader

and the data loaders get called from here => EmployeeDataFetcher and there is no thread switch in between. The thread pool that I have configured executor that wraps the task using Context.taskWrapping API.

I've added steps to reproduce in the bug report.

It looks like dgs is based on graphql-java, which has library instrumentation (https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/graphql-java/graphql-java-20.0/library). I wonder if you just plug in that instrumentation library if a bunch of these issues would go away. Worth a shot, at least, to see how it does!

@jkwatson does this library perform auto instrumentation? I need to rely on manual instrumentation logic

The "library" instrumentation doesn't require the javaagent. You can just plug it in programmatically. You might need to figure out how to hook it into dgs, but it doesn't need the agent.

Given this documentation, I would guess it'll work just fine: https://netflix.github.io/dgs/advanced/instrumentation/

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

I recommend trying the library instrumentation then, and raising issues in the instrumentation repo if it has holes.

@jkwatson Unfortunately, I can't use the library because it lacks the ability to instrument GraphQL data resolvers.

But still, I've tried it out, and the issue can indeed be reproducible with this library. I plan to report this issue to the instrumentation repo. However, I'm unsure if the problem lies within the core libraries or in the instrumentation.

The issue won't be with the core libraries. It's definitely an issue with making the instrumentation correctly propagate the context where it needs to go.

Continuing @kilink 's comment Netflix/dgs-framework#1928 (comment) ,
I don't think it's a bug in otel java instrumentation either.
We know that to pass context in CompletableFuture we need to wrap Executor in Context.taskWrapping(), which is already done for the dgsAsyncTaskExecutor.
However the catch is:
since batch loading is in action here, there is another CompletableFuture, which uses default executor, for the Batch Loader wrapping the DepartmentLoader. This is the reason why context is reset. If you debug line 39 EmployeeDataFetcher.department() and keep going in, you can see DataLoaderHelper.queueOrInvokeLoader()

Screenshot 2024-09-10 at 9 11 55 PM

To me it seems parameters.get(0).getContext().makeCurrent() is the only way to propagate context, without changing the current code.

since batch loading is in action here, there is another CompletableFuture, which uses default executor

If I remember correctly, it doesn’t work even if the batch contains only 1 parameter.

To me it seems parameters.get(0).getContext().makeCurrent() is the only way to propagate context

Yes that fixes the issue but it is a workaround, doesn’t look elegant because ‘parameter’ is a domain object.

If I remember correctly, it doesn’t work even if the batch contains only 1 parameter.

Doesn't matter if it's 1 or more parameter. It's the DataLoader framework which is spitting out the CompletableFuture with fixed default executor.

Yes that fixes the issue but it is a workaround, doesn’t look elegant because ‘parameter’ is a domain object.

Doesn't seem that this is in opentelemetry control. Opentelemetry already provides context propagation by customizing CompletableFuture with otel context wrapped executor, but the DataLoader framework is using fixed default executor in between.