Jeff-Lewis / cls-hooked

cls-hooked : CLS using AsynWrap or async_hooks instead of async-listener for node 4.7+

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Context occasionally gets corrupted

Quantumplation opened this issue · comments

Hello!

Trying to debug a pretty tricky issue. We use this library to thread a distributed traceId down to our logs and shared HTTP library across many many microservices. In rare cases, we're seeing logs and requests hop from one traceId to another. In essence, this code:

      const oldTrace = clsHooked.getNamespace('DEFAULT_NAMESPACE').get('context').traceId;
      const res = await httpClient.get(someUrl);
      const newTrace = clsHooked.getNamespace('DEFAULT_NAMESPACE').get('context').traceId;
      if(oldTrace !== newTrace) {
        console.log('~~ TRACE ID MISMATCH: ', oldTrace, newTrace);
        process.exit(1);
      }

Fails pretty consistently, and I'm trying to figure out why. I'm still grappling with how async_hooks and cls-hooked work in general, but my understanding is that things like older or non-standard promise libraries or custom thenables (as mentioned in #37) can cause this to happen.

Any advice on tracking down exactly what's happening?

Worth noting that the httpClient here is an axios instance, and doesn't seem to be using custom thenables.

(@SupernaviX is a coworker of mine 😅)

Another thing worth noting is that this service has babel transpilation with the following plugins/config:

"babel-polyfill": "^6.26.0",
"babel-preset-bluebird": "^1.0.1",
"babel-preset-env": "^1.6.0",
"babel-register": "^6.26.0",
{
  "presets": ["bluebird", [
    "env", {
      "targets": {
        "node": "current"
      }
    }
  ]]
}

I'm thinking this is related to #36. Which version of node are you running?

@Jeff-Lewis

/srv/package $ node --version
v10.17.0

@Jeff-Lewis

It's worth noting that our reproduce this involves hammering the service with GET requests. As far as I know there are no POST requests hitting the service.

@Jeff-Lewis any updates?

+1 to this issue.

We are currently using cls-hooked with Express. We have it as a middleware and are setting a few request specific values such as a client and tenant ID which are used for routing the connection to the correct database.

We have noticed our service is generating 404's (as part of validation from database queries) as the connections were being routed to the wrong databases. As part of debugging this issue we wrote some scripts to load the server calling GET's with valid information but quickly swapping between two different tenants and thus the connections should be routed to different databases.

Within 30 seconds or less, we would get a 404 and upon inspecting the logs, the request would be using the tenant ID from the previous request.

We have not been able to make any inroads on fixing this bug except for the scripts which easily reproduce it. We don't have any problems in times of low concurrent requests.

We are running Node 12.13.0.

@william00179 hello, do you managed your issue somehow? Because we want to start use cls-hooked library to track tenant-id too.

We have not been able to resolve this problem and removed cls-hooked.

@william00179 What are you using instead?

We just went back to using req.local

I've encountered similar issues.

Node v8.17.0

@Quantumplation / @william00179 ... not sure if this wil help however ... we have found that things can get messed up when there are complex promise chains and/or awaits within try/catch blocks because of the way in which the promises are resolved.

We ended up working with async_hooks directly and managing the storage differently for different types of async resource.

We also created "frames" (think of them as storage that can be linked to multiple async resources) and then attaching the frame for a resource to any child async resource that is created (via the triggerAsyncId).

This enabled us to retain the storage even if the triggering async resource (aka the parent async resource) was destroyed prior to the child async resource.

Trust me when I say it is even more complex to write the code than it is to explain :)

TBH I would not say that we are 10000% certain that we have completely nailed it, but this work along with some coding rules to keep promise chains short and try/catch with await calls as clean as possible, means that we haven't hit any issues.

My advice would be to to look at the async resource types that are being initialised and destroyed, and also what's happening in the promiseResolve hook, that will give you some direction on where things are going awry.

Has anyone found a resolution to this? We appear to be experiencing the same issue as described here. Still running some tests to see if I can triage the issue further, but the description sounds exactly like what we see - data that is leaking between requests.

Same issue here! Is there any other soluction other than stop using cls-hooked?