omnibor / spec

A draft standard for communicating a cryptographic record of build inputs for software artifacts.

Home Page:https://omnibor.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: Using gitbom to report/remediate vulnerable dependencies?

imjasonh opened this issue · comments

From gitbom.dev#why?

By constructing a complete, concise, and verifiable artifact tree for every software artifact, GitBOM enables:

  • Run-time detection of potential vulnerabilities, regardless of the depth in a dependency tree from which that vulnerability originated
  • Post-exploit forensics
    ...

In short, it would let anyone easily answer the question, “Does this product contain log4j?”

From reading around the rest of the site though, it's unclear to me how folks anticipate realizing that vision.

It's still early days, so I expect this is probably just an open area of active discussion and design, and if so, I'd love to hear ideas!


My understanding is that gitbom aims to take the hashes of source files (and inputs in general, but probably typically at the leafs, ideally that's source files?), and concatenates them in a Git-like form into a string, which is then also hashed to produce the ID of the collected thing. If any input contents change, its hash changes, and the hash of the concatenated data changes, so the output hash changes. Like how a Git commit changes based on changes to file contents, commit message, etc.

All of this is ideally done transparently by build tooling (excellent, love it ❤️), and the final single gitbom ID is available alongside (inside?) the artifact.

My question is, what am I then intended to do with this gitbom hash to determine if it contains log4j? There's no way of telling whether some opaque hash a1b2c3... "contains" any particular other component. Is there some index of these IDs that I should consult?

To complicate things even further, "log4j" could mean lots of things -- presumably I'm trying to identify some vulnerable version of log4j, but there are also presumably any number of perfectly acceptable log4j versions. "Versions" isn't even well defined; ideally I'd depend on a specific official release, but I might carry patches, or consume a released version from some intermediary that carries patches, or might depend on an unreleased codebase from head. I know you're aware of all this, and gitbom's approach absolutely seems like it makes the problem of IDing versions less painful, since you don't really care about "versions", just inputs/source files. But it still makes it hard to tell whether my artifact contains vulnerable inputs, since vulnerability reporting still tends to think in terms of released version ranges (vulnerability introduced in v1.2.3, fixed in v1.2.6)

Is the idea that vulnerability reporting should switch to source-based reporting (vulnerability exists in source file with sha f9c1d3...), and gitbom would let me lookup whether my artifact contains that source file? That likely gets infeasible, since trivial changes to the file (e.g., formatting, unrelated code changes) would change the hash without affecting the vulnerable code. A vulnerability report would have to report the hashes of vulnerable_code.h, vulnerable_code_with_one_trailing_whitespace.h, vulnerable_code_with_two_trailing_whitespaces.h, for every line, combinatorially, out to ~infinity. And that's just whitespace. Even more subtly, some other unrelated change could fix (or not!) the vulnerability, so every possibly-trivial change to the input would need to be inspected to tell whether it's vulnerable. If code is "vulnerable" when x == 5, then var x = 5 is vulnerable, as is var x = 2 + 3. But theoretically due to compiler shenanigans, maybe one could be vulnerable while the other isn't.

I see in the bomsh repo an example of detecting log4j given some gitbom data, but I'm not sure I understand yet how this answers the questions above. I haven't had a chance to dig deeper into it; if the answer is "RTFM" I'll accept that.😅

Anyway, at this point, I'm very likely missing something about how this is supposed to work end-to-end, for both BOM generation and inspection. A bunch of smart folks are thinking about it, and I trust y'all to have come up with something to make vulnerability reporting and remediation as simple as you're doing with putting BOM generation in build tools. Help educate me!

Thanks for the excellent summary on this issue.

Yes, in order to query CVE vulnerability for built artifacts, we need to consult a separate database (CVE database). The bomsh_create_cve.py script is developed to create such a CVE DB. For example, it can scan the official OpenSSL git repo, and creates the CVE DB for all the CVE-relevant source files. For the Fedora/Centos distros, they have their own git repos, which apply some additional patches on top of the official OpenSSL source tarballs, thus possibly creating new blob IDs that do not exist in the official OpenSSL git repo. For this scenario, the bomsh_create_cve.py script has been enhanced to scan the Fedora/Centos git repo and evaluate all the new source file blobs, with the help of CVE checking rules. In addition, the bomsh_hook2.py script can directly evaluate all source files during gitBOM tree generation, and generates CVE metadata for all the source files in the generated gitBOM tree. Please see issue #11 for more details.

This CVE checking rules work because: 1. the new blob created from back-ported CVE fixes is very close to the baseline blob; 2. If the source file has been refactored too much, then the CVE checking rules will not match, thus ignored in effect. And the refactored source files are not vulnerable to these CVEs indeed.
So the CVE checking rules work in both scenarios.

Using OpenSSL as an example, it has been shown that the CVE search using gitBOM is pretty accurate.

BTW, I envision the new way of searching CVEs with gitBOM, and the old way of search CVEs with versions, should co-exist. Both methods have some advantages and disadvantages, and they can complement each other.