google / grr

GRR Rapid Response: remote live forensics for incident response

Home Page:https://grr-doc.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MySQL performance issues causing frontend lockups

atkinsj opened this issue · comments

Hi all,

This is a pretty complicated one that I'd appreciate some help on. I think the root cause might be a performance issue in either MySQL schema definition or front-end insertion logic.

Environment

  • AWS deployment. ALB in-front of admin/frontend. Auto-scaling admin, worker and frontend instances. Workers scale on average CPU utilisation, frontend instances scale on average # of inbound HTTP requests per second.
  • RDS deployed as db.m5.4xlarge (8 core/16 vCPU, 64Gb RAM, ~3500 Mbps of bandwidth, ~2000 write IOPS)
  • max_allowed_packet is 1073741824

Situation
I haven't found a smaller subset to reproduce this but doing a recursive directory listing of / across ~100 hosts generates this 100% of the time for me.

Issue
Frontend instances scale up automatically to handle the load of inbound client requests returning the results of a RecursiveDirectoryList flow over /. Frontends eventually become completely non-responsive to all inbound connectivity (e.g., a curl to /server.pem times out). Frontend instances are terminated by AWS after failing health-check, instances are replaced, rinse repeat very slowly gaining progress each time.

The frontend instances are continuously throwing this error message:

_mysql_exceptions.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')

On a whim I tried to set the MySQL param innodb_lock_wait_timeout to 100 (up from 50). This did not help.

RDS shows ~60% CPU usage and write iops of ~1500, so we're not hitting resource constraints on that side. I decided to interrogate MySQL directly with a SHOW ENGINE INNODB STATUS which shows a lot of very long duration transactions.

---TRANSACTION 704030533, ACTIVE 2098 sec fetching rows
mysql tables in use 1, locked 1
35016 lock struct(s), heap size 3481808, 3517149 row lock(s), undo log entries 14744
MySQL thread id 4843, OS thread handle 47279362615040, query id 79187948 10.10.21.111 grr updating
UPDATE client_paths
        SET last_stat_entry_timestamp = FROM_UNIXTIME('1588350215.910328')
        WHERE (client_id = 13407277259504602580 AND path_type = 1 AND path_id = '|\?Q3?j\?\?*\?#\?\?\?eϚ?\'M????{??.??') OR (client_id = 13407277259504602580 AND path_type = 1 AND path_id = '\?\?a\???\?\????\?O߮????\?)1???)\?
                    ') OR (client_id = 13407277259504602580 AND path_type = 1 AND path_id = 'Ǻ}\?\\??s@b?y?X?

That transaction has been actively running for ~34 minutes on a simple UPDATE statement.

---TRANSACTION 704228753, ACTIVE 98 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 5465, OS thread handle 47281214134016, query id 89272830 10.10.20.244 grr update
INSERT INTO client_paths(client_id, path_type, path_id,
                                 timestamp,
                                 path, directory, depth)
        VALUES (455317224786704125, 1, 'ʟY?Nn3?7@`ۘ\?\?z???f?6?\???J	', FROM_UNIXTIME('1588352221.078742'), '/ormal/path/redacted', 0, 3)
        ON DUPLICATE KEY UPDATE
          timestamp = VALUES(timestamp),
          directory = directory OR VALUES(directory)

This took ~100 seconds to insert.

I don't know where else to go from here. Has anyone experienced this before?

Just to have a full picture here: how big is you client_paths table?

Big. A count(*) is still running, but:

MySQL [grr]> SELECT table_name AS "Table",
    -> ROUND(((data_length + index_length) / 1024 / 1024), 2) AS "Size (MB)"
    -> FROM information_schema.TABLES
    -> WHERE table_schema = "grr"
    -> ORDER BY (data_length + index_length) DESC;
+--------------------------------+-----------+
| Table                          | Size (MB) |
+--------------------------------+-----------+
| flow_responses                 |   6725.19 |
| client_paths                   |   5198.30 |
| client_stats                   |   4502.16 |
| blobs                          |   3855.58 |
| client_path_stat_entries       |   3672.95 |
| flow_results                   |   2869.97 |

Update:

MySQL [grr]> select COUNT(*) from client_paths;
+----------+
| COUNT(*) |
+----------+
| 10841536 |
+----------+

I'm happy to nuke the tables and start fresh, but this will likely just happen again eventually. I'll leave it for now in case you need to do some more testing.

It's unexpected that GRR has performance issues at 10 million entries already. It's definitely an amount that we should be able to handle. Something that we have to look into for sure.

One more question: what GRR version are you running?

Maybe the optimizer does not recognize that the where clause in the first update that you listed (transaction 704030533) is really 3 point lookups and instead begin a table scan. This would explain why it is 'reading rows' for a long time, and probably also read-locks the whole table. Then even the simple point inserts get stuck behind transactions of this form and the whole things grinds to a halt. A query plan for an update of the form first listed should tell us if this is happening.

Could you given us the output of:
EXPLAIN update ...
WARNINGS

And if your mysql is recent enough, "EXPLAIN ANALYZE update ..."? I don't know how mature the MySQL optimizer is, but if the 'OR' clauses cause it to resort to a table scan, we could experiment until we have a query which doesn't, or simply split it into 3 updates within the same transaction.

Edited (again) to add: Actually the data about that first transaction already pretty much confirms the hypothesis. It claims the transaction has 3.5m row locks, when it really should only need 3. The insert is then waiting for write locks on 2 rows, but cannot get them because of the first transaction's table scan.

I'm running the GRR 3.4.0.1 Docker images.

@bgalehouse I'll reproduce today and try to issue the statement with an EXPLAIN ANALYZE.

So, I did some experimentation. For some initial database state I managed to reproduce the thing that @bgalehouse mention. For query like:

UPDATE client_paths
   SET last_stat_entry_timestamp = now(6)
 WHERE (client_id = 10571145980656244476 AND path_type = 1 AND path_id = 'foo')
    OR (client_id = 10571145980656244476 AND path_type = 1 AND path_id = 'bar')
    OR (client_id = 10571145980656244476 AND path_type = 1 AND path_id = 'baz')

the database decided to use client_paths_idx index (rather than the primary key), which does not include the path_id column. The problem did not occur where there was only one WHERE clause (i.e. no ORs).

I also tried to populate the database with a large number of entries. However, even at 20 million rows I did not observe any serious performance loss, because the database started to pick the right index. Doing the same with a specific path prefix (that could roughly correspond to what your database might have looked like after recursive directory listing) also did not lead to any observable degradation in speed.

However, the optimizer decisions clearly can depend on database contents. Maybe in your case because of some very specific conditions it picks the wrong index. I would suggest forcing the index and check whether it helps. In the grr/server/grr_response_server/databases/mysql_paths.py file, in the _MultiWritePathInfos method, you could update both update queries to use the following:

     UPDATE client_paths
FORCE INDEX (PRIMARY)
        SET (...)

Hopefully, this fixes the issue for you. Let us know if it does not, we will need to investigate further.

For some time now GRR forces these indices, so I am going to close this issue. Feel free to reopen if the problem persists.