quay / clair

Vulnerability Static Analysis for Containers

Home Page:https://quay.github.io/clair/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GC does not run unless "matcher.update_retention" is set explicitly

majewsky opened this issue · comments

Description of Problem / Feature Request

Backstory: The database of my Clair instance is growing much faster than it used to. Unfortunately, I don't have metrics going back that long, so I cannot pinpoint when this started, but I set it up in February on a 50 GiB volume, and it then hovered around 40 GiB usage for most of that time (after the initial updater runs had completed and all vulnerabilities were entered into the database). In July, I got disk-full alerts and extended from 50 GiB to 100 GiB. This week, just 6 weeks later, the 100 GiB volumes were full yet again and I had to upsize the volumes another time. So while I don't have exact numbers, there is definitely a stronger growth in the last two months than before then. I upgraded from 4.0.x to 4.1.0 at the beginning of June, so there is a timing correlation there, but I'm not able to provide strong evidence one way or the other.

Expected Outcome

I would expect the DB size to grow slowly, but steadily (as new images get scanned and the updaters pull in new vulnerabilities).

Actual Outcome

As described above, the database is growing much faster than it used to. I had a look around in psql to see what is using all that storage: The sizes of most tables are rounding errors (including manifest and indexreport and such, since I only have about 25000 images in the DB at the moment). The vuln table takes up ~25% of all storage, at just shy of 10 million entries (which is a lot, but I don't think it's that unusual since I have a lot of updaters enabled). The remaining ~75% are occupied by the uo_vuln table:

clair=# SELECT COUNT(*) FROM uo_vuln;
   count
-----------
 723471358
(1 row)

My understanding of this table is limited by what I can grasp from the database schema, but over 700 million entries seem excessive given that there are just 10 million vuln entries and about 30000 update_operation entries. There is apparently a ton of duplication going on here. For example, I grabbed one update operation's fingerprint at random and got this:

clair=# SELECT (SELECT COUNT(*) FROM uo_vuln WHERE uo = id) AS uo_vuln_row_count, * FROM update_operation WHERE fingerprint = '"206c95-5c39b36cc655d-gzip"';
 uo_vuln_row_count |  id   |                 ref                  |        updater        |         fingerprint         |             date              |     kind
-------------------+-------+--------------------------------------+-----------------------+-----------------------------+-------------------------------+---------------
               946 | 41537 | 243cab3d-397d-4367-9aab-b1058f884562 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 10:44:59.386241+00 | vulnerability
               946 | 41547 | 1ccfad3e-c2d6-4afe-b309-88fdab6e245f | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 11:44:59.403178+00 | vulnerability
               946 | 41560 | 3dad7bda-c6d9-4993-a450-b8ce7b77ee0f | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 12:44:59.367305+00 | vulnerability
               946 | 41568 | 15d4307d-01ef-4851-9f14-639d217c31a1 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 13:14:59.441119+00 | vulnerability
               946 | 41543 | 5aba8d2d-acb4-433d-9da0-941bf5ccc89d | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 11:15:28.283515+00 | vulnerability
               946 | 41553 | 3b6db3cf-6b2c-4975-82e3-f84de1d4a7b5 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 12:14:59.472253+00 | vulnerability
               946 | 41576 | b5f68513-5366-44c5-afd8-917f62b1b2e5 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 13:44:59.429989+00 | vulnerability
               946 | 41533 | 58362dcb-1f11-433b-9aeb-33b4165f6713 | debian-jessie-updater | "206c95-5c39b36cc655d-gzip" | 2021-05-31 10:14:28.178319+00 | vulnerability
(8 rows)

Those are 8 update operations with the same fingerprint, and for each of them, we have 946 entries in the uo_vuln table. That leads me to the following questions:

  1. Is that kind of growth in the update_operation and uo_vuln tables expected by design, or the symptom of a bug?
  2. Regardless of whether it's by design or a bug, what would be a safe method for pruning those tables of old entries?

Environment

  • Clair version/image: 4.1.1 (self-compiled from release tag on Alpine 3.13 with Go 1.16)
  • Clair client name/version: N/A
  • Host OS: Alpine in container, CoreOS on host
  • Kernel (e.g. uname -a): Linux $HOSTNAME 4.19.123-coreos #1 SMP Fri May 22 19:21:11 -00 2020 x86_64 Linux
  • Kubernetes version (use kubectl version): N/A
  • Network/Firewall setup: N/A

Can you share the non-sensitive parts of your Clair config?

The garbage collection should run to clean up older update operations (https://github.com/quay/clair/blob/main/Documentation/reference/config.md#matcherupdate_retention) and the default is to retain 10 updates (per updater). It sounds like it isn't running, so possibly a config issue or there should be logs in the matcher that might point to why that garbage collection isn't happening.

Can you share the non-sensitive parts of your Clair config?

This is the template for the config file. (Templating is mostly only used for inserting passwords at startup time.) Clair is invoked as clair -conf $CONFIG_PATH -mode combo.

Based on the documentation that you linked to, I would assume that GC should be active since I don't set matcher.update_retention to another value explicitly.

or there should be logs in the matcher that might point to why that garbage collection isn't happening.

The only thing in the logs that looks related to garbage collection is this (repeating every 20 minutes):

{"level":"info","component":"notifier/keymanager/Manager.gc","time":"2021-08-13T13:27:02Z","message":"gc starting"}
{"level":"info","component":"notifier/keymanager/Manager.gc","deleted":0,"time":"2021-08-13T13:27:02Z","message":"gc complete"}
{

But "notifier/keymanager" does not sound related to the GC task in question. Would it help to increase the log level? (I'm currently at info.)

Anyway, I plan to move forward with two things:

  1. I will set the matcher.update_retention value explicitly on my QA instance and check on Monday if that causes GC to run.
  2. If that does not help, I will start cleaning up old update operations and vulnerabilities manually, following the strategy of claircore/internal/vulnstore/postgres/gc.go, in order to mitigate the current DB growth until the root cause of this issue is determined. I'm also considering decreasing the number of update operations by setting matcher.period to e.g. 1d instead of 30m.

Okay, I have set matcher.update_retention to 5 now (instead of the default value which is supposed to be 10), and now I can see this in the log:

{"level":"info","component":"libvuln/updates/Manager.Run","retention":5,"time":"2021-08-13T13:54:13Z","message":"GC started"}
{"level":"info","component":"libvuln/updates/Manager.Run","remaining_ops":29497,"retention":5,"time":"2021-08-13T13:57:51Z","message":"GC completed"}

Thanks for sharing @majewsky, let us know if the DB storage plateaus with the more aggressive garbage collection settings.

The GC has worked through the backlog by now, so I ran a full vacuum on the database:

$ df -h /postgresql # before vacuum
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        246G   99G  147G  41% /postgresql
$ df -h /postgresql # after vacuum
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        246G  8.9G  237G   4% /postgresql

So yeah, that's quite effective. The question remains why GC was not running at all.