nats-io / nats.js

Node.js client for NATS, the cloud native messaging system.

Home Page:https://nats.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unhandled NatsError: DISCONNECTED

ArmorDarks opened this issue · comments

  • Client version: 2.8.0
  • Node version: 16.14.2

Sometimes we receive the following unhandled rejection on our servers when running Nats.js:

NatsError: DISCONNECT
    at Function.errorForCode (/code/node_modules/nats/lib/nats-base-client/error.js:100:16)
    at /code/node_modules/nats/lib/nats-base-client/protocol.js:116:40
    at Array.forEach (<anonymous>)
    at ProtocolHandler.resetOutbound (/code/node_modules/nats/lib/nats-base-client/protocol.js:115:15)
    at ProtocolHandler.prepare (/code/node_modules/nats/lib/nats-base-client/protocol.js:133:14)
    at ProtocolHandler.<anonymous> (/code/node_modules/nats/lib/nats-base-client/protocol.js:179:31)
    at Generator.next (<anonymous>)
    at /code/node_modules/nats/lib/nats-base-client/protocol.js:8:71
    at new Promise (<anonymous>)
    at __awaiter (/code/node_modules/nats/lib/nats-base-client/protocol.js:4:12)

The error originates from here https://github.com/nats-io/nats.deno/blob/177c3da18319cbd0ec6066228e08f6709feb0511/nats-base-client/protocol.ts#L197

There are two issues with that:

  • It happens inside NATS, and there's no way to catch a failing promise, so on Node 16+ it crashes whole server
  • It's unclear why it happens in the first place.

NATS config:

      const connection = await connect({
        maxReconnectAttempts: -1,
        name: 'some-name',
        pass: 'some-pass',
        servers: ['...servers'],
        user: 'some-user',
        inboxPrefix: 'some.inbox',
      })

What we tried:

  • All NATS async methods are wrapped in try catches, so I believe it's thrown somewhere in a callback and can't be caught

Some observations:

  • There are no other logs before or after that message.
  • It seems to be happening mostly when there's a NATS reconnect happens on the server.
  • I wasn't able to reproduce it locally despite doing many bad things to the NATS server and connection
  • Last time when we restarted one of the NATS nodes, it caused reconnect on hundreds of our servers. Most parts of them didn't have any issues, but about 30% received that error, so it seems to be some condition that triggers it

So in this case I realized that the trace there is somewhat misleading - because all it is doing is tell you that the request that was pending (in this case a pong) was rejected with the error. The contents of the trace at that point is useless to you. I added code to remove the stack from that error because that can be confusing.

https://github.com/nats-io/nats.deno/blob/015c0306f241c6b5765f1aa23e0845a194356de6/nats-base-client/protocol.ts#L197-L198

@aricart 👍 thank you