Delayed closing of idle connections

Question

Delayed closing of idle connections

osvathzo opened this issue 5 years ago · comments

Hi,
I am writing a REST API that uses mysql as a backend, and I have did a load test which seemed to work fine but after some time all the new connections to the mysql server started to fail (then work again for some time, and then fail again).

The error was: dbi connect failed can't connect to mysql server on (99)

I first suspected that there were too many concurrent connections, but it turned out that the problem is not the open connections rather the closed ones as they are left in a TIME-WAIT state on the mysql server and after a time no more ports are left for new connections. (The problem is described here in detail).

The Mojo::mysql module has a basic connection pooling feature and the cause of the problem is that the close_idle_connections() closes the enqueued connections immediately if there are more than max_connections of them in the queue.

I tried increasing the max_connections value, but it only made the problem appear somewhat later.

What I tried to test my theory and seems to work is I made a subclass of Mojo::mysql and I added time() with the connection in _enqueue, and modified the close_idle_connections() method to close the unused connections only after some time.

Do you think that some kind of delayed closing could be introduced in the official package?

Thanks,
Zoltán

Jan Henning Thorsen · Answer 1 · Wed Apr 08 2020 17:07:01 GMT+0800 (China Standard Time)

I'm not quite sure if this is something I want. I don't want to over complicate the code base if this is a more theoretical issue, than a practical one. I probably take a PR if the feature is well tested though.

Zoltán Osváth · Answer 2 · Wed Apr 08 2020 18:18:02 GMT+0800 (China Standard Time)

Thanks for the quick reply!

It may have practical implications. If the load is such that there are more connections in rotation than the max_connections setting, then the module is constantly closing old and opening new connections, instead of reusing already existing ones.
This can be solved by setting a high enough max_connections value, but then in low load situations these connections are kept open even if they are not really needed.

Jan Henning Thorsen · Answer 3 · Wed Apr 08 2020 21:16:38 GMT+0800 (China Standard Time)

Is there a problem with keeping many connections around? Not sure what “many” is though. I’m not entirely sure why a process would need a lot of connections to MySQL, because they are being reused.

The more I think about it, the more I think there might be something wrong with your schemas, sql queries, or something else.

Franz Skale · Answer 4 · Fri Apr 10 2020 18:30:43 GMT+0800 (China Standard Time)

Today, i encountered a similar problem.
The program queues 5 parallel connections using AsyncAwait and the ReadWriteFork module.
It runs once a week and did that for 1 month, today i found out, that the process hang so i attached the process using strace:

restart_syscall(<... resuming interrupted poll ...>^Cstrace: Process 33863 detached

After killing the session on the mariaDB server using mysqladmin kill, the script continued without any hangs.
So i started to investigate and found out, that Marc Lehmann pushed a new version of EV omitting the "buggy" perl assert syscall and use his own implementation. This could be the root of the problem, since assert is used by any C programmer to find runtime problems and then abort accordingly.
I updated EV to version 4.33 and will continue to monitor the program.
Will keep you informed !

Best regards
Franz

Jan Henning Thorsen · Answer 5 · Fri Apr 10 2020 19:12:45 GMT+0800 (China Standard Time)

One way to investigate could be to force the Poll reactor: https://github.com/mojolicious/mojo/blob/fb221bb5c8c94c21eb7202be09833f5592e95d0d/lib/Mojo/Reactor.pm#L94

Franz Skale · Answer 6 · Fri Apr 10 2020 19:23:30 GMT+0800 (China Standard Time)

Thanks @jhthorsen
I also found out, that the DBI module had some bugs.
I did a repackage of the DBI library too.
It still runs.
After that, i will force the POLL reactor and will report back.
Since i use your ReadWriteFork module, all the stderr (DBI_TRACE=1) will go to the logfile, which makes debugging a lot easier !

Thx.
Franz

Franz Skale · Answer 7 · Fri Apr 10 2020 19:56:33 GMT+0800 (China Standard Time)

So,
after upgrading the DBI library, i get an error:

[2020-04-10 13:39:43.54031] [99877] [debug] _run_cmd_ref: buffer from stdin/stderr:     <- execute= ( 2 ) [1 items] at Database.pm line 65
    <- DELETE('HandleError')= ( CODE(0x559dc70ad060) ) [1 items] at Database.pm line 68
    !! ERROR: 2000 'fetch() without execute()' (err#0)
    <- fetchall_arrayref= ( [ ] ) [1 items] at Results.pm line 55
    -> HandleError on DBI::st=HASH(0x559dc7c74958) via CODE(0x559dc70ad060) (ARRAY(0x559dd1e998d8))
    <- can(CARP_TRACE) = 0 (? 0)
    <- HandleError= 0 (ARRAY(0x559dd1e998d8))
       ERROR: 2000 'fetch() without execute()' (err#0)
    <- DESTROY(DBI::st=HASH(0x559dc7c74958))= ( undef ) [1 items] at Reporter.pm line 710

Will try Mojo::Reactor::Poll next.

Franz Skale · Answer 8 · Fri Apr 10 2020 20:41:55 GMT+0800 (China Standard Time)

It works when running with Mojo::Reactor::Poll. Tried for two times. (40k select/inserts per run).
Rgds.
Franz

Franz Skale · Answer 9 · Fri Apr 10 2020 21:14:23 GMT+0800 (China Standard Time)

Sorry, but i didn't check the logs in depth. I get 17k of errors when using the POLL Backend:

[2020-04-10 14:14:18.27473] [124003] [debug] _run_cmd_ref: buffer from stdin/stderr:        ERROR: 2006 'MySQL server has gone away' (err#0)
    <- DELETE('HandleError')= ( CODE(0x5610a9f82bc8) ) [1 items] at Database.pm line 65
       ERROR: 2006 'MySQL server has gone away' (err#0)
    <- DESTROY(DBI::st=HASH(0x5610aab6ad08))= ( undef ) [1 items] at Reporter.pm line 710

We checked all the variables, enabled errors logs, but no error on the client side. Now it's getting complicated. When using EPOLL, there are no errors.
The query is a simple select if an id already exists. Nothing fancy or complex. Only reports.
I will try some things but am at a loss right now ;-)
Rgds.
Franz

Franz Skale · Answer 10 · Mon Apr 13 2020 19:46:42 GMT+0800 (China Standard Time)

After a lot of debugging i found the error.
When having 0 rows when using select(_p) sth->fetch calls will result in an error. (fetch without execute).
To overcome this problem, you have to check the $results->rows.
If rows > 0 there are results and you can continue otherwise you have to ommit the $results methods.
@osvathzo
You'll see only one connection on the mysql server due to the fact, that DBD::Mysql(MariaDB) uses the SO_REUSEADDR socket option. So, all connections share the same source TCP port !
To debug you have to issue something like:

 
lsof -n -i|grep -i established

I'm pretty sure, when using delay loops in cojunction with non-blocking calls, you'll end up in trouble because callbacks won't be called after hitting an error !
My solution was quite simple.
1.) I created a async queue routine which creates one callback for every passed code reference to the async call.
2.) I switched to Mojo:.IOLoop::subprocess because Mojo::IOLoop::RewriteFork doesn't play well when calling CODE references in an async setting (async await) subs.

But beware, you the lastest DBD::mysql module ! (V. 4.050) 4.049 has several bugs, see the Changelog for details.

For me, case closed.
Rgds.
Franz

Tekki · Answer 11 · Tue Apr 14 2020 14:19:38 GMT+0800 (China Standard Time)

DBD::mysql ... 4.049 has several bugs, see the Changelog for details.

Shall we set version 4.050 in the cpanfile? At the moment we are at 4.042.

Zoltán Osváth · Answer 12 · Tue Apr 14 2020 17:31:58 GMT+0800 (China Standard Time)

@jhthorsen
It is certainly possible that there is something wrong how I do things.

Most of my REST API endpoints are used for bulk operations. Some SELECTs that are required can be merged into a single query with multiple where conditions or IN (...) conditions, while others that can't be merged are executed with Mojo::Promise->map({concurrency =>5}...). In a single bulk request there are several queries started with Mojo::Promise->all(...), that in total use ~8 connections.
When there are lots of such bulk queries coming in (e.g. a stress test) the number of connections used goes well above 100.

With the current implementation, if at any moment the number of enqueued connections goes above the threshold, then they're immediately closed and new ones are created when the next request comes in.

I can increase the max_connections value to 15, 20, 30, and that certainly helps in burst situations, but with a prefork of 8 processes, this keeps in rotation 120, 160, 240 idle connections even if they are lightly used which seems a waste of resources.

Unfortunately the schema of the database can be considered a legacy, and cannot be changed at the moment.

@fskale
Sorry, but I don't understand how our problems are related, I have no stuck connections in the MySQL server.

The TIME-WAIT connections that I experience are on the MySQL server, not on my client, so I don't think that DBD::MySQL(MariaDB) using SO_REUSEADDR has a relevance here.
I am not using AsyncAwait, ReadWriteFork nor Subprocess, only Promises.

Jan Henning Thorsen · Answer 13 · Thu Apr 30 2020 10:35:14 GMT+0800 (China Standard Time)

@osvathzo: I don't consider idle connections an issue. Please do let me know why that is, or I won't see any reason to change the code .

Also, do you need to run the SQL concurrently? Why not create an "SQL queue" and run the statements sequentially? (As opposed to using Mojo::Promise::map)

Jan Henning Thorsen · Answer 14 · Tue Aug 11 2020 11:53:57 GMT+0800 (China Standard Time)

Going to close this issue, since it seems to have gone idle. I might consider a PR, if it includes additional tests and not too much complexity.