GC does not run unless "matcher.update_retention" is set explicitly

Question

GC does not run unless "matcher.update_retention" is set explicitly

majewsky opened this issue 3 years ago · comments

Description of Problem / Feature Request

Backstory: The database of my Clair instance is growing much faster than it used to. Unfortunately, I don't have metrics going back that long, so I cannot pinpoint when this started, but I set it up in February on a 50 GiB volume, and it then hovered around 40 GiB usage for most of that time (after the initial updater runs had completed and all vulnerabilities were entered into the database). In July, I got disk-full alerts and extended from 50 GiB to 100 GiB. This week, just 6 weeks later, the 100 GiB volumes were full yet again and I had to upsize the volumes another time. So while I don't have exact numbers, there is definitely a stronger growth in the last two months than before then. I upgraded from 4.0.x to 4.1.0 at the beginning of June, so there is a timing correlation there, but I'm not able to provide strong evidence one way or the other.

Expected Outcome

I would expect the DB size to grow slowly, but steadily (as new images get scanned and the updaters pull in new vulnerabilities).

Actual Outcome

As described above, the database is growing much faster than it used to. I had a look around in psql to see what is using all that storage: The sizes of most tables are rounding errors (including manifest and indexreport and such, since I only have about 25000 images in the DB at the moment). The vuln table takes up ~25% of all storage, at just shy of 10 million entries (which is a lot, but I don't think it's that unusual since I have a lot of updaters enabled). The remaining ~75% are occupied by the uo_vuln table:

clair=# SELECT COUNT(*) FROM uo_vuln;
   count
-----------
 723471358
(1 row)

My understanding of this table is limited by what I can grasp from the database schema, but over 700 million entries seem excessive given that there are just 10 million vuln entries and about 30000 update_operation entries. There is apparently a ton of duplication going on here. For example, I grabbed one update operation's fingerprint at random and got this:

clair=# SELECT (SELECT COUNT(*) FROM uo_vuln WHERE uo = id) AS uo_vuln_row_count, * FROM update_operation WHERE fingerprint = '"206c95-5c39b36cc655d-gzip"';
 uo_vuln_row_count |  id   |                 ref                  |        updater        |         fingerprint         |             date              |     kind
-------------------+-------+--------------------------------------+-----------------------+-----------------------------+-------------------------------+---------------
               946 | 41537 | 243cab3d-397d-4367-9aab-b1058f884562 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 10:44:59.386241+00 | vulnerability
               946 | 41547 | 1ccfad3e-c2d6-4afe-b309-88fdab6e245f | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 11:44:59.403178+00 | vulnerability
               946 | 41560 | 3dad7bda-c6d9-4993-a450-b8ce7b77ee0f | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 12:44:59.367305+00 | vulnerability
               946 | 41568 | 15d4307d-01ef-4851-9f14-639d217c31a1 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 13:14:59.441119+00 | vulnerability
               946 | 41543 | 5aba8d2d-acb4-433d-9da0-941bf5ccc89d | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 11:15:28.283515+00 | vulnerability
               946 | 41553 | 3b6db3cf-6b2c-4975-82e3-f84de1d4a7b5 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 12:14:59.472253+00 | vulnerability
               946 | 41576 | b5f68513-5366-44c5-afd8-917f62b1b2e5 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 13:44:59.429989+00 | vulnerability
               946 | 41533 | 58362dcb-1f11-433b-9aeb-33b4165f6713 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 10:14:28.178319+00 | vulnerability
(8 rows)

Those are 8 update operations with the same fingerprint, and for each of them, we have 946 entries in the uo_vuln table. That leads me to the following questions:

Is that kind of growth in the update_operation and uo_vuln tables expected by design, or the symptom of a bug?
Regardless of whether it's by design or a bug, what would be a safe method for pruning those tables of old entries?

Environment

Clair version/image: 4.1.1 (self-compiled from release tag on Alpine 3.13 with Go 1.16)
Clair client name/version: N/A
Host OS: Alpine in container, CoreOS on host
Kernel (e.g. uname -a): Linux $HOSTNAME 4.19.123-coreos #1 SMP Fri May 22 19:21:11 -00 2020 x86_64 Linux
Kubernetes version (use kubectl version): N/A
Network/Firewall setup: N/A

Joseph Crosland · Answer 1 · Thu Aug 12 2021 23:28:59 GMT+0800 (China Standard Time)

Can you share the non-sensitive parts of your Clair config?

The garbage collection should run to clean up older update operations (https://github.com/quay/clair/blob/main/Documentation/reference/config.md#matcherupdate_retention) and the default is to retain 10 updates (per updater). It sounds like it isn't running, so possibly a config issue or there should be logs in the matcher that might point to why that garbage collection isn't happening.

Stefan Majewsky · Answer 2 · Fri Aug 13 2021 21:49:54 GMT+0800 (China Standard Time)

Can you share the non-sensitive parts of your Clair config?

This is the template for the config file. (Templating is mostly only used for inserting passwords at startup time.) Clair is invoked as clair -conf $CONFIG_PATH -mode combo.

Based on the documentation that you linked to, I would assume that GC should be active since I don't set matcher.update_retention to another value explicitly.

or there should be logs in the matcher that might point to why that garbage collection isn't happening.

The only thing in the logs that looks related to garbage collection is this (repeating every 20 minutes):

{"level":"info","component":"notifier/keymanager/Manager.gc","time":"2021-08-13T13:27:02Z","message":"gc starting"}
{"level":"info","component":"notifier/keymanager/Manager.gc","deleted":0,"time":"2021-08-13T13:27:02Z","message":"gc complete"}
{

But "notifier/keymanager" does not sound related to the GC task in question. Would it help to increase the log level? (I'm currently at info.)

Anyway, I plan to move forward with two things:

I will set the matcher.update_retention value explicitly on my QA instance and check on Monday if that causes GC to run.
If that does not help, I will start cleaning up old update operations and vulnerabilities manually, following the strategy of claircore/internal/vulnstore/postgres/gc.go, in order to mitigate the current DB growth until the root cause of this issue is determined. I'm also considering decreasing the number of update operations by setting matcher.period to e.g. 1d instead of 30m.

Stefan Majewsky · Answer 3 · Fri Aug 13 2021 22:10:22 GMT+0800 (China Standard Time)

Okay, I have set matcher.update_retention to 5 now (instead of the default value which is supposed to be 10), and now I can see this in the log:

{"level":"info","component":"libvuln/updates/Manager.Run","retention":5,"time":"2021-08-13T13:54:13Z","message":"GC started"}
{"level":"info","component":"libvuln/updates/Manager.Run","remaining_ops":29497,"retention":5,"time":"2021-08-13T13:57:51Z","message":"GC completed"}

Joseph Crosland · Answer 4 · Mon Aug 16 2021 20:01:54 GMT+0800 (China Standard Time)

Thanks for sharing @majewsky, let us know if the DB storage plateaus with the more aggressive garbage collection settings.

Stefan Majewsky · Answer 5 · Wed Aug 18 2021 22:59:40 GMT+0800 (China Standard Time)

The GC has worked through the backlog by now, so I ran a full vacuum on the database:

$ df -h /postgresql # before vacuum
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        246G   99G  147G  41% /postgresql
$ df -h /postgresql # after vacuum
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        246G  8.9G  237G   4% /postgresql

So yeah, that's quite effective. The question remains why GC was not running at all.