amqp-node / amqplib

AMQP 0-9-1 library and client for Node.JS

Home Page:https://amqp-node.github.io/amqplib/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SDK connecting without errors after RabbitMQ service is shut down

vancejc-mt opened this issue · comments

Hello,

Currently running ampqlib as part of my apollo server deployment, developing locally using docker containers.

When I start up all my services amqplib connects to the RabbitMQ server just fine. It correctly creates the connection, then creates the channel, and is able to send messages without issue.

Once I kill the RabbitMQ server, I get the following events: error on the connection, followed by close on the connection. After waiting a bit for the RabbitMQ server to fully shut down (2 seconds normally; but upping this didn't seem to make any difference) I try to reconnect. When I reconnect the SDK sometimes appears to connect just fine, despite the RabbitMQ server being down, and will continue without error as I request him to create a channel, and even verify an exchange exists. No errors are given at all.

This seems to occur more on some machines than others, but it's inconsistent. Sometimes it will correctly error after the connection attempt, while other times it will think a connection is made when there's no RabbitMQ server up.

Code snippet:

    async connect(initialDelay = false) {
        let retries = this.connectionRetries;
        let backoff = this.connectionRetryBackoff;

        // If we're already connecting, do nothing.
        if (this.connecting) return;
        else this.connecting = true;

        if (initialDelay) {
            // TESTING back off for 15 seconds when reconnecting.
            await setTimeout(15000);
        }

        // Attempt to connect to the RabbitMQ server.
        do {
            await amqp
                .connect(
                    {
                        protocol: this.protocol,
                        hostname: this.hostname,
                        port: this.port,
                        username: this.username,
                        password: this.password,
                        heartbeat: 30
                    },
                    {
                        clientProperties: {
                            connection_name: this.connectionName
                        }
                    }
                )
                .then((connection) => {
                    this.connection = connection;
                    // Successful connection, no need to retry.
                    retries = 0;
                    // If our connection is closed, retry the connection.
                    this.connection.on('close', () => {
                        this.connection = null;
                        this.connect(true).catch((err) => {
                          console.log('error reconnecting')
                        });
                    });
                    // If our connection gets an error, retry the connection.
                    this.connection.on('error', (err) => {
                        this.connection = null;
                        this.connect(true).catch((err) => {
                            console.log('Unable to reconnect');
                        });
                    });
                })
                .catch(async (err) => {
                    this.connection = null;
                    // If no retries, error out.
                    if (!--retries)
                        throw new AssetUsageConnectionError(
                            `Unable to connect to server; ${err}`
                        );
                    else {
                        // Back off the requested amount before re-attempting.
                        await setTimeout(backoff);
                    } // end of else - retrying connection
                });
        } while (!this.connection);

        // Done connecting.
        this.connecting = false;

        // Create our channel.
        this.channel = await this.connection.createChannel();
        await this.channel.checkExchange(this.exchange);
    }

Am I missing something in my handling of the error or close handlers which I need to do in order for the SDK to... clear his cache, or something? I don't understand how he'd say the following connect succeeded when the server isn't even up (and even then - when it happens inconsistently).

Thank you for any help!
Jeff

Running:

  • amqplib version 0.10.3
  • rabbitmq version 3.12.3
  • node.js version 18.17.1

Hi @vancejc-mt,

Thanks for taking the time to post such a clear example. I'm very sure that the createChannel and checkExchange calls would fail if the broker was still unavailable, but will try it later to confirm.

In the meantime, I can see one problem with the code you've written (but I don't think it would cause the symptoms you describe). You have registered a permanent error handler on both the close and error events. I believe you can get both of these events, and also you can get multiple error events. With your code it will therefore be possible to reconnect multiple times for the same connection error. I see you've attempted to prevent this with a boolean flag, however if you were unlucky the first reconnection attempt might succeed before the second reconnection attempt was initiated.

Instead, I tend to do something like...

this.connection.on('error', (err) => {
  console.log('Connection error', err);
  connection.emit('lost');
});
this.connection.on('close', () => {
  console.log('Connection closed');
  connection.emit('lost');
});
this.connection.once('lost', () => {
  this.connection = null;
  this.connect(true).catch((err) => {
    console.log('Error reconnecting', err);
  });
});

I will get back to you after I've had a change to try your example. Something to try in the meantime is using wireshark to debug what's going on.

I also notice you don't set a socket timeout which you can do as follows

await amqp.connect({
  protocol: this.protocol,
  hostname: this.hostname,
  port: this.port,
  username: this.username,
  password: this.password,
  heartbeat: 30
}, {
  timeout: 1000,
  clientProperties: {
    connection_name: this.connectionName
  }
});

Once I kill the RabbitMQ server

Out of interest how are you killing the RabbitMQ server?

When I reconnect the SDK sometimes appears to connect just fine, despite the RabbitMQ server being down, and will continue without error as I request him to create a channel, and even verify an exchange exists.

And how are you verifying that a channel was created and that the exchange was checked?

I'm unable to reproduce using docker kill $CONTAINER_ID. Your best option for debugging is wireshark. When everything works an you filter by amqp you should see something similar to the following

5	0.000738	::1	::1	AMQP	84	Protocol-Header 0-9-1
7	0.003534	::1	::1	AMQP	589	Connection.Start 
9	0.006700	::1	::1	AMQP	426	Connection.Start-Ok 
11	0.008094	::1	::1	AMQP	96	Connection.Tune 
13	0.008683	::1	::1	AMQP	96	Connection.Tune-Ok 
15	0.008847	::1	::1	AMQP	92	Connection.Open vhost=/ 
17	0.010071	::1	::1	AMQP	89	Connection.Open-Ok 
19	0.014325	::1	::1	AMQP	89	Channel.Open 
21	0.015889	::1	::1	AMQP	92	Channel.Open-Ok 
23	0.016933	::1	::1	AMQP	111	Exchange.Declare x=issue737 
25	0.018190	::1	::1	AMQP	88	Exchange.Declare-Ok 

When the broker is killed you should not see any more traffic (assuming you keep filtering by amqp). If you remove this filter and instead filter by tcp.dstport == 5672 you should see the SYN packets attempting to establish a connection, i.e.

80	115.160883	::1	::1	TCP	88	60154 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=239402240 TSecr=0 SACK_PERM=1
82	116.165167	::1	::1	TCP	88	60155 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=1919497636 TSecr=0 SACK_PERM=1
84	117.168391	::1	::1	TCP	88	60156 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=1707563111 TSecr=0 SACK_PERM=1
86	118.174436	::1	::1	TCP	88	60157 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=2464711875 TSecr=0 SACK_PERM=1
88	119.180208	::1	::1	TCP	88	60158 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=2489639874 TSecr=0 SACK_PERM=1
90	120.183648	::1	::1	TCP	88	60159 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=3295603972 TSecr=0 SACK_PERM=1
92	121.187520	::1	::1	TCP	88	60160 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=4084392921 TSecr=0 SACK_PERM=1
94	122.190422	::1	::1	TCP	88	60162 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=365481072 TSecr=0 SACK_PERM=1
96	123.196929	::1	::1	TCP	88	60163 → 5672 [SYN] Seq=0 Win=65535 Len=0 MSS=16324 WS=64 TSval=4004402570 TSecr=0 SACK_PERM=1

What am am confident about though is there isn't any caching within amqplib. However unlikely, maybe your code is connecting to a different broker? Alternatively have you somehow mocked or memoized any functions?

@vancejc-mt any further update or OK to close?

Apologies - I went on vacation shortly after I added this issue (didn't expect you guys would get back to me so fast - super awesome); I'll be looking at all your comments and working through where my issues may have been. If this doesn't rectify things then I'll open a new issue with the problem hopefully narrowed down.

Thanks again!

Sorry to get back to you so late, it's been a crazy week.

So I modified my code to utilize the lost once event, preventing what could have been a race condition (though the boolean should have prevented a double connection; but this implementation is a lot cleaner, thanks!) and also added in a socket timeout; unfortunately it still happens work just fine on my machine, but a co-workers machine is still exhibiting the same problems. My guess is there's something with the way that his RabbitMQ server (hosted on docker) is shutting down that is causing some kind of race condition - first guess was that it was just reconnecting to RabbitMQ as it was going down; but even adding in a long enough pause before reconnection (15 seconds) to allow the server to fully go down doesn't seem to be fixing the issue.

Currently our setup is just running both our apollo server (using amqplib) and the RabbitMQ server in separate docker containers on the same network; in order to kill RabbitMQ we just stop the container with docker.

"And how are you verifying that a channel was created and that the exchange was checked?"

So.. currently I'm just trusting that amqplib.connect() throws if the connection failed; I don't see him throwing so I assumed that he's connecting correctly. Is there some follow-up check which I should do on the connection in order to verify that it's a working connection? For the exchange I'm using channel.checkExchange() in order to verify that the exchange exists; it succeeds without throwing. Is there a different call I should be using on the.. channel maybe to verify that things were setup correctly? connection.createChannel() similarly appears to succeed without throwing any errors.

The events that we get back when we shut the RabbitMQ docker container down is first the error event (ECONNRESET) followed by the close event. When these happen we're correctly just getting the single lost handler called; and after a waiting period to back to connect and it still appears to succeed (on my coworkers machine, on my machine he correctly fails to connect and then just loops trying to reconnect).

Going to dive into piecing apart what might be happening on his machine using Wireshark starting next week; if I get any more information I'll be posting it here.

Thank you again for all of your help,
Jeff

My best guess is there's something very funky going on with your colleagues machine. Hopefully wireshark will confirm you have actually connected to a real broker despite the docker container shutting down.

Alright - you were completely right; we were hitting a separate rabbitmq server when we were connecting to the company VPN (didn't realize my coworker was on it, and I wasn't - just random coincidence that we had an alias on the company network with the same name as our docker container, zzz) which explains why we were getting the reconnection; sorry for spending so much of your time. But thank you for all your help. Going to close the issue.