Remove dependence on BigQuery

Question

Remove dependence on BigQuery

ribbybibby opened this issue 2 years ago · comments

BigQuery is a barrier to entry for a couple reasons:

It requires the user to have a Google Cloud account, which not everyone does, of course.
BigQuery costs money, so it requires the user to pay. It can actually get quite expensive if you're processing more than a handful of BOMs with any regularity.

I think we can follow the example of trivy and grype here and regularly publish the parts of the datasets we need as a sqlite/bolt database.

If the compressed size is acceptable then tally can fetch it when it first runs and then only update it when a new db is published.

The design I have in my head at the moment is:

Add a tally db create command that will create and populate the database.
Add a tally db push command that pushes the database up to an OCI registry (i.e ghcr.io/jetstack/tally/db).
Run those two commands every 7 days or so in a Github Actions workflow.
Modify the root tally command to pull the database when it first runs.
Modify the root tally command to check for database updates.
Use the sqlite database rather than BigQuery to find results.

Rob Best · Answer 1 · Wed Sep 07 2022 18:31:24 GMT+0800 (China Standard Time)

I had a quick go at this and the size compares pretty favourably to grype:

# tally db uncompressed and compressed
$ du -hs /tmp/tally.db
532M    /tmp/tally.db
$ du -hs /tmp/tally.db.gz
127M    /tmp/tally.db.gz

# grype db uncompressed and compressed
$ du -hs ~/.cache/grype/db/4/vulnerability.db
810M    /home/ribbybibby/.cache/grype/db/4/vulnerability.db
$ du -hs ~/.cache/grype/db/4/vulnerability.db.gz
107M    /home/ribbybibby/.cache/grype/db/4/vulnerability.db.gz

There's probably some optimisations that we could do to get the sizes down a bit more too.

Rob Best · Answer 2 · Fri Nov 25 2022 20:39:42 GMT+0800 (China Standard Time)

The design I have in my head at the moment is:

Add a tally db create command that will create and populate the database.

Add a tally db push command that pushes the database up to an OCI registry (i.e ghcr.io/jetstack/tally/db).

Run those two commands every 7 days or so in a Github Actions workflow.

Modify the root tally command to pull the database when it first runs.

Modify the root tally command to check for database updates.

Use the sqlite database rather than BigQuery to find results.

This is now done:

There's a tally db create command which creates the database in .cache/tally/db
A Github Actions workflow runs every 7 days that:
- Generates the database with db create
- Pushes the database to ghcr.io/jetstack/tally/db:latest with oras
The tally command checks the latest tag for an updated database and pulls it down if there is one
There's also a tally db pull command that does the same
The tally command uses the pulled database to generate results