CVEProject / cve-schema

Summary: the conflicting x_generator use cases complicate use of popular search tools, with a risk of full data loss for some CVE Records

Some parts of the JSON 5 schema allow x_ fields, e.g.,

cve-schema/schema/v5.0/CVE_JSON_5.0_schema.json

Lines 556 to 557 in f8f54d5

    
           "patternProperties": { 
        
               "^x_[^.]*$": {}

These can have different data types. The most common difference is that some CNAs use the x_generator field with the object data type, but other CNAs use the x_generator field with the string data type. Schema validation succeeds for either one, and this typically hasn't caused problems with publishing or retrieving CVE Records.

An issue is that some tools ingest sets of JSON documents, and don't allow situations where a field has one data type in some documents but a different data type in other documents.

It's important because it, in effect, causes many CVE Records to be missing if people try the simplest approaches for loading CVE data into some of the most popular search applications. Capturing 100% of CVE Records often requires writing custom code, not just a reconfiguration. One example, out of several affected applications, is https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html

Here, if an index contains one document with an x_generator object, and then there's an attempt to add a document with an x_generator string, the latter document is ignored because of an object mapping error. At present, anyone wishing to use CVE Record data with such a tool can delete the containers.cna.x_generator and containers.cna.x_legacyV4Record fields before ingest. No other fields are affected today.

The issue is somewhat more general than the data types of the x_ fields themselves. Anything nested under an x_ field can also be affected. For example, if one CNA uses:

"x_generator": {"engine": {"application": "A1", "OS": "Linux"}}

then the issue ("object mapping failure for containers.cna.x_generator.engine") occurs when the next CNA uses:

"x_generator": {"engine": "E1"}

As discussed in the #217 issue, this situation can also occur even without x_ - for example, both of these are allowed by the CVE JSON 5 schema but cause such tools to report "object mapping failure for containers.cna.references.myUnofficialField" if one CNA used the first format and then another CNA used the second format:

"references": [{"myUnofficialField": {"a": "b"}, "url": "https://example.com/1"}]

"references": [{"myUnofficialField": "c", "url": "https://example.com/2"}]

Options include:

The QWG could set up a way for providers to register x_ field names and their associated data formats. Even though alternate data formats would comply with the JSON 5 schema, any provider using a conflicting data format would be in violation of the CVE Program rules. This means, for example, that because x_generator was first used with an object value, any provider using a string value is in violation. Similarly, unofficial fields without an x_ would be a violation.
The QWG could predict that interest in x_ fields will be low enough that a registry isn't needed. Instead, the Secretariat could advise CNAs about data-type conflicts over email.
Advise users of these tools to traverse the complete JSON document before ingest, deleting all x_ fields and all undefined fields. Similarly, advise providers to use only official fields for any data that's potentially of widespread interest. This is sufficient but would discourage innovation in which seeing x_ content is desirable in a full text search.
Advise users of these tools to develop complex error handling, e.g., when an object mapping failure actually occurs, automatically rewrite some or all of the affected documents, or automatically rename fields, to force data-type consistency. This is also sufficient but probably more expensive to implement.
Ignore the general problem, and simply advise the tool users to delete the containers.cna.x_generator and containers.cna.x_legacyV4Record fields before ingest.

This will be handled by website team (likely solution ignore _x_generator during ES ingestion)
No schema changes required.

x_ usage interferes with some JSON applications