Problems with Amazon RDS failover

Question

Problems with Amazon RDS failover

NoSync opened this issue 6 years ago · comments

Hi,
in a production application I'm using an Amazon RDS Aurora cluster as endpoint. The way it works, when something happens to the master DB its CNAME is changed to the new master, and the old IP value for the CNAME is pointed to the slave. The problem is, whenever the master DB fails I get this error:

"The MySQL server is running with the --read-only option so it cannot execute this statement"

Which seems to indicate that the connections are made to the old IP address, as if it was cached, while the webserver itself correctly resolves to the updated IP.

Is any DNS caching taking place in mysql2?

Thanks!

Aaron Stone · Answer 1 · Tue Mar 20 2018 07:35:52 GMT+0800 (China Standard Time)

The DNS is not explicitly taking place in mysql2, and even in libmysqlclient I don't think it's explicit DNS lookup but rather being passed to the system libraries. There may be a local caching name service on your system to track down and force a lower TTL.

Alessio Re · Answer 2 · Tue Mar 20 2018 22:49:57 GMT+0800 (China Standard Time)

Thanks for your answer, sodabrew. The problem is that even when the local DNS gives an updated record (verified through host/ping xxxx.xxx), the app will still use the old IP until I restart the appserver (puma, in my case).

For now, in case anybody stumbles upon the same problem, I made it work thanks to this gem: https://github.com/sonots/mysql2-reconnect_with_readonly

Aaron Stone · Answer 3 · Thu May 17 2018 15:30:02 GMT+0800 (China Standard Time)

It took me a while to think this one through. My hunch is that there is no reconnecting happening at all. I suspect the connection to the server never fails. The master rotates to slave position, is set to read-only, but never drops its clients.

Here's where reconnect happens in the client library:
https://github.com/mysql/mysql-server/blob/4f1d7cf5fcb11a3f84cff27e37100d7295e7d5ca/sql-common/client.cc#L4766

Note that it just calls mysql_real_connect again with the original hostname, so it uses getaddrinfofresh each time:
https://github.com/mysql/mysql-server/blob/4f1d7cf5fcb11a3f84cff27e37100d7295e7d5ca/sql-common/client.cc#L4347

That said, getaddrinfo is implemented in the C library, and if the C library is doing some extra DNS caching, then you may be exactly right after all, and would need to look at your C library options to see how to disable that extra caching. See https://sourceware.org/bugzilla/show_bug.cgi?id=15862 for some discussion of this (in the context of an inconsistency in nscd).

Houtan · Answer 4 · Mon Jun 18 2018 01:54:35 GMT+0800 (China Standard Time)

I had no idea this issue existed until my production server went down for several minutes because of an AWS power outage last week. I have been trying to debug the issue like @NoSync but there doesn't seem to be much. In fact amazon themselves didn't know what to tell me. I spent the last week digging around.

Here is what exactly happened: When our instance went away we started getting the "gone away" errors and Aurora failover moved to our replica. However, our system was still using the old IPs. @sodabrew I dont think its getaddrinfo that is the issue because the TTL of the Aurora DNS is 5, so it should recover within 5 seconds even if its cached. Right?

@NoSync That gem wont work in all cases. The read-only case happens when you manually switch over but not when AWS goes down. You will get the "gone away" or "Lost Connection" which this gem does NOT cover.

I found this GIST here for creating a middleware that correctly finds the scenarios in which to reconnect: https://gist.github.com/matsukaz/fb0e0bc47752ebb6f751c79268e1b6f5/revisions

However, I was still getting read-only errors.

I believe the best solution is to use the gem that @NoSync found and combine it with all the scenarios in the above GIST.

@sodabrew As more and more people use Amazon Aurora, I think it makes sense to include this in this gem. If you agree, I can create a pull request that does this.

Update: It may even make sense to add this to the C wrapper and create a reconnect method that deletes and recreates the client.

Aaron Stone · Answer 5 · Wed Jun 20 2018 12:12:01 GMT+0800 (China Standard Time)

As I'm reading https://github.com/sonots/mysql2-reconnect_with_readonly/blob/master/lib/mysql2/reconnect_with_readonly.rb I can only stop to wonder if there is some underlying methods I can expose to make this easier to implement /cc @sonots

Houtan · Answer 6 · Thu Jun 28 2018 03:06:02 GMT+0800 (China Standard Time)

Is there something we can get going in the meantime to make this possible?

Naotoshi Seo · Answer 7 · Mon Jul 02 2018 17:34:14 GMT+0800 (China Standard Time)

Reply to #948 (comment)

In my understanding, in the case of Lost connection or gone away errors, the connection is disconnected, so just reconnect is fine enough. In the case of rails, we can use reconnect: true in database.yml to cover these errors.

However, the problem we have in the case of failover is that the mysql connection is not disconnected but the mysql server changes to read-only status. My gem https://github.com/sonots/mysql2-reconnect_with_readonly covers the case.

Naotoshi Seo · Answer 8 · Mon Jul 02 2018 17:40:30 GMT+0800 (China Standard Time)

Reply to #948 (comment)

First thing I want is a proper way to reconnect instead of this code https://github.com/sonots/mysql2-reconnect_with_readonly/blob/846e92e979b1695d8c959e465bbcac5d973c1b28/lib/mysql2/reconnect_with_readonly.rb#L77-L84.

I am not sure which other APIs should be exposed. I may want a retry logic, but it looks out of scope of mysql2 gem. I have no opinion.

Kris Kemper · Answer 9 · Fri Sep 21 2018 22:48:44 GMT+0800 (China Standard Time)

Hello!

I too have encountered this issue. I work on a project where we deploy many small apps and have many testing environments. We switched a bunch of the apps to aurora a few months back and yesterday one of our apps encountered the same scenario in the issue. The current app with the issue is a rails 4.2.5 app using active record and version 0.3.18 of the mysql2 driver.

Thanks @sonots for making that gem. I'll investigate it further. I'm a bit hesitant though to add something that monkey-patches for fear that if we later upgrade mysql2 and the patch causes a side effect with newer code.

Also, looking into it a bit, I'm not sure the semantics of automatically reconnecting is what I'd want. I've historically avoided the regular "reconnect: true" because it doesn't work with transactions.

The better option perhaps would be to raise an error of a type that causes upstream libraries such as active record to know that the connection has gone bad and allow the application level reconnection to occur. Though, it seems a bit strange to have logic in a generic library to mitigate an issue with a particular deployment strategy.

Given that I'm getting and logging an error already I'm guessing that type of fix should happen at the active record and/or rails level. Converting to a connection error could be done at the ActiveRecord::ConnectionAdapters::Mysql2Adapter level for example.

Christian-Manuel Butzke · Answer 10 · Tue Sep 25 2018 14:25:08 GMT+0800 (China Standard Time)

I encountered this too.

I have the same opinion about @sonots gem, furthermore it is not working with puma.
So, until we find something better, we will point the health checker to something which updates something in the DB.

UPDATE (just some side notes):

I tried Failover many times, but this problem only occurred once.
In puma.rb we are using

before_fork do
  ActiveRecord::Base.connection_pool.disconnect! if defined?(ActiveRecord)
end
on_worker_boot do
  ActiveRecord::Base.establish_connection if defined?(ActiveRecord)
end

Emilis Panovas · Answer 11 · Mon Oct 01 2018 23:17:11 GMT+0800 (China Standard Time)

I tried @sonots patch too, but reconnect was causing really bizarre behaviour (on highly concurrent servers). Instead, I simplified the patch to simple connection close and rely on DB pool to reopen connections:

if ActiveRecord::Base.connection.instance_values['config'][:adapter] == 'mysql2'
  require 'mysql2'
  require 'mysql2/client'
  module Mysql2
    class Client
      orig_query = instance_method(:query)

      define_method(:query) do |sql, options = {}|
        begin
          orig_query.bind(self).call(sql, options)
        rescue Mysql2::Error => e
          if e.message =~ /read-only/
           your.logger.error 'MySQL Read-Only error detected - closing connection, router should reroute to new node'
            close
          end
          raise e
        end
      end
    end
  end
end

This is still not fool-proof, though. As read-only transactions will successfully continue against evicted node until one of the write queries ends up killing that connection.
Another alternative, if you can afford the cost, is to not allow your connection to be reused.

Finally, this is really an issue of the router. I tested it with MySQL router v2.1.6 (haven't tried whether the problem is also present on 8.0). Otherwise, I was told that ProxySQL router supports connection resets on node eviction, but I haven't had the chance to test it yet.

alfa · Answer 12 · Fri Jan 18 2019 20:55:50 GMT+0800 (China Standard Time)

@kemper
Hello, I also had the same concern.

So, i created clever(lol) monkey patch.
https://github.com/alfa-jpn/mysql2-aurora

The basic idea is based on @sonots patch. (very thanks!! 🙇‍♂️)
https://github.com/alfa-jpn/mysql2-aurora/blob/master/lib/mysql2/aurora.rb

The difference is that this gem has tests and CI.

and, lock mysql version.
https://github.com/alfa-jpn/mysql2-aurora/blob/master/mysql2-aurora.gemspec

Well, I already use it in production environment.

Yaroslav F. · Answer 13 · Mon Jan 21 2019 17:32:55 GMT+0800 (China Standard Time)

Hello,

The same issue here it seems. We also use makara to split database reads from writes, they go to different aurora clusters. After setting up auto-scaling on Aurora Read Cluster I can see that failover doesn't actually happen, even though Aurora does it's part.

The problem is that even though new instance is being started in read cluster when required - App still remains connected to the original instance for some reason. See the graphs in the referenced issue above: 0 connections to the new instance...

I'm considering calling ActiveRecord::Base.clear_all_connections! before doing most expensive Select queries. Hope the new connection will be routed to the new instance appropriately.

Kris Kemper · Answer 14 · Sat Jan 26 2019 00:46:45 GMT+0800 (China Standard Time)

@alfa-jpn Thanks! I like your solution. When I get around to revisiting this issue I'll try out your gem in one of the apps. I cloned it, ran the tests, and played around with it a bit. I wasn't sure why the StandardError rescue was being squelched on @client&.close. I guess a network error could happen and you want to be able to set a new client so that the retry can continue? Either way, it seems pretty safe. Thanks for making the gem!

alfa · Answer 15 · Mon Jan 28 2019 20:44:58 GMT+0800 (China Standard Time)

@kemper
Hi, thanks for review!

Because I could not predict the library behaviour of during aurora failover,
The rescue clause is intended to try reconnect that no matter what happen.

hugoSomewhere · Answer 16 · Fri Mar 25 2022 23:38:04 GMT+0800 (China Standard Time)

Hi people, I want to make an update or at least my feedback on this issue.
I was unable to have a persistant failure and my rails services were able to commute without doing anything special

Here our setup

Puma server with Ruby 2.5.7 running on AWS ECS container with AWS linux base image 2
Mysql2 version 0.5.3
AWS RDS Aurora MySQL cluster setup with 2 nodes
The application is configured to use only the writer RDS DNS

Different tests performed by using failover option from AWS RDS console or command line
The service was switching to read only during some seconds until the DNS is properly switched on network
The service was available and working fine with READ and WRITE into DB
Only one kind of background job has failed and ECS leverage that by restarting a new task

If someone have a new scenario or test case to reproduce the issue it will be a good thing

Assumption

The AWS is now taking care of that DNS failover in a better way
MySQL2 library has get some updates since the first raise of that