Many decomposition fields should be optional

Question

Many decomposition fields should be optional

morrone opened this issue a year ago · comments

Christopher J. Morrone commented a year ago

The decomposition json format is very verbose, and I think that many of the fields should be made optional.

In static decomposition:

dst: If "dst" is ommitted, we can simply default to using the same name as "src". If both "src" and "rec_member" are provided, default to rec_member name (or maybe ".<rec_member>")
type/array_len: LDMS already knows the data type. Making humans manually transpose that is just asking for trouble. As the documentation is written, it is not clear that these should ever be needed. But assuming they have some undocumented role..if those fields are omitted, they should just default to being the types that LDMS already knows.

That will probably cover 99% of use cases.

Christopher J. Morrone · Answer 1 · Thu Mar 30 2023 05:24:42 GMT+0800 (China Standard Time)

Another usability issue with ldms and decomposition is that "ldms_ls" doesn't show the data type for any of the fields in a record in a list (maybe not in a list at all). So users are expected to guess the type, or maybe read the source code?

"type" definitely needs to have a sane default, and not be a required field.

Tom Tucker · Answer 2 · Thu Mar 30 2023 05:59:07 GMT+0800 (China Standard Time)

@narategithub what is the purpose of the "type" attribute of the decomposition list? @morrone points out that the LDMS type of the source set could be used to infer the type of the destination row.

Narate Taerat · Answer 3 · Thu Mar 30 2023 09:24:45 GMT+0800 (China Standard Time)

@tom95858

I think it was driven by the heterogeneous meminfo use case (i.e. when meminfo schemas do not look the same across all meminfo sets due to heterogeneous system architecture).

For static decomposition, the idea is that the user statically define the storage schema so that only the selected metrics that the user cares about are being stored. The type has to be there too because otherwise in the heterogeneous LDMS schema (like meminfo) the storage schema information may be incomplete as an LDMS schema may not have all the metrics that the user cares about. And I think @tom95858 has the same concern as @morrone that leaving huge configuration effort to user is not good. I think this is one of the reason why @tom95858 mentioned about having a full static decomposition configuration file for each of the sampler we have as a starting point. Then, the user can take those files and remove the metrics he/she does not want.

The as_is decomposition does not require type as it take the set as-is and get the type from the LDMS set schema. To avoid heterogeneous schema name collision (e.g. same meminfo schema name but different collection metrics due to different system architecture), as_is decomposition appended a short hash after the schemaname, e.g. meminfo-a7b958.

Maybe we want something in-between? Like as_is but can filter only the metrics that the user want?

Christopher J. Morrone · Answer 4 · Thu Mar 30 2023 10:46:58 GMT+0800 (China Standard Time)

For static decomposition, the idea is that the user statically define the storage schema so that only the selected metrics that the user cares about are being stored.

Are we specifying the storage's "schema" (data type) It looks to me like we are specifying the source data type in all of the examples. And in the case of "store_csv", everything is a string in the "storage schema", but I still need to give u64/d64 source data types.

I think the "type" is the source type, correct?

The type has to be there too because otherwise in the heterogeneous LDMS schema (like meminfo) the storage schema >information may be incomplete as an LDMS schema may not have all the metrics that the user cares about.

Ah, so this tells me that the type is the source data type in the ldms metric set. It is not telling the end store what type to use in the end storage (e.g. database data type).

But keep in mind that the missing data type is the exception not the the rule. You can throw a warning and do nothing if you don't know the type, and that will let the user know they absolutely need to provide the type for that field. It is best in that case to check all of the fields, warning on any that have a missing type, so the user can fill them in all at once rather than iterating one-by-one.

In 99% of the cases I do not think that I am going to need that.

I think this is one of the reason why @tom95858 mentioned about having a full static decomposition configuration file for each of the sampler we have as a starting point.

As a user who needs to configure ldms, I think that would be a very unpleasant approach as well. It isn't even a very feasible approach for you, since you probably don't have every architecture and every piece of hardware supported by samplers. And even if you did, there is no way I want to have to manually edit a file with hundreds of fields that I'm not using (some samplers literally offer the possibility of hundreds of possible fields).

Using sane defaults is the only practical way that I can think to do this. Any approach that requires me to configure ldms and log in to every architecture in my center as a prerequisite to configuring ldms presents me with a frustrating chicken-and-egg situation.

Narate Taerat · Answer 5 · Thu Mar 30 2023 22:09:11 GMT+0800 (China Standard Time)

Yes, the type is the LDMS data type. static decomposition needs it for the reason described above. I have a hard time coming up with the "sane default" for static decomposition. The idea of the static decomposition is that you know exactly what you want. No guessing. If the complete list for static decomposition is provided, you can just use it and there is no need to log in to every architecture in the cluster, right?

Or, maybe use as_is decomposition to store everything.

For down-selecting metrics I think we can extend as_is decomposition to receive a list of metrics that the user want, and the as_is decomposition will create the rows that have metrics in both LDMS schema and the given list. And, we probably want to change the name of this decomposition altogether with this down-select capability, if it makes sense.

Benjamin Allan · Answer 6 · Thu Mar 30 2023 23:10:11 GMT+0800 (China Standard Time)

The "chicken and egg" problem, precisely in the cases of meminfo and vmstat and papi and high speed networking, is impossible to avoid. Putting such a burden on the sampler writer is also wrong, because many samplers are dealing with data that are not well-defined at the time the sampler is written.

Rather than manually curating some batch of example files that can never be up to date planetarily, it would be a trivial matter to add an option to ldmsd_controller or (better) ldms_ls that discovers the unique schemas within all existent sets and dumps the "default" static mapping for each. Such a dumper might also annotate metrics which are defined as 'meta' in ldms in some way (these tend to turn into index-related fields).

I agree with Chris that reviewing and editing schema dumps is potentially a time sink (order hours/year), but I don't see a way around it unless we changed to a pipeline that is entirely schema-less from the administrators' points of view.

Tom Tucker · Answer 7 · Fri Mar 31 2023 00:17:51 GMT+0800 (China Standard Time)

@morrone, I played with some of the ideas you mentioned and have this implemented for static

    "meminfo.*": {
      "type": "static",
      "rows": [
        {
          "schema": "meminfo_tom",
          "cols": [
            { "src":"timestamp" },
            { "src":"producer" },
            { "src":"instance" },
            { "src":"component_id" },
            { "src":"job_id" },
            { "src":"MemFree" },
            { "src":"MemAvailable" },
            { "src":"Active" },
            { "src":"MemTotal" }
          ],
          "indices": [
            { "name":"job_comp_time", "cols":["job_id", "component_id", "timestamp"] },
            { "name":"timestamp", "cols":["timestamp"] }
          ]
        }
      ]
    },
...

"dst" name defaults to "src", "type" defaults to the "src" metric's type. "array_len" defaults to the metric's array len and so on.

As @narategithub points out, if you're trying to use "fill", this won't work because "fill" stuffs values into the row when the metric is not present in the set. Obviously, if the metric's not present, you can't inherit it's type, etc...

In any event, is this more in line with what you are thinking?

Christopher J. Morrone · Answer 8 · Fri Mar 31 2023 02:19:29 GMT+0800 (China Standard Time)

@morrone, I played with some of the ideas you mentioned and have this implemented for static
[cut]
In any event, is this more in line with what you are thinking?

Yes, that is great!

I had another thought, which could perhaps be split into a separate ticket if we like.

It would probably be a good thing to add a field to each column to express whether or not that field is optional. The default value would be that the field is required.

When creating these column entries, a person already needs to think about which fields they expect to always be there, and which are optional (sometimes there, sometimes not), so I don't expect that to be much of a burden.

Currently, in effect, all fields are optional. If a human makes a typo of a "src" name, ldms interprets that currently as an optional field and add it to the data. I think most of us would prefer to get a warning that the field isn't valid when we are expecting the field to always be there, rather than insert a typo-ed field into the schema. It would also alerted us to when fields we genuinely expect to be there disappear for some reason, and give us a sign that we need to address it.

Something to consider at least.

Christopher J. Morrone · Answer 9 · Fri Mar 31 2023 02:23:15 GMT+0800 (China Standard Time)

Oh, what happens when "rec_member" is also supplied? Honestly, I would prefer that the rec_member name be the "dst" name any time there isn't a name conflict (as opposed to pre-pending "src" to the "rec_member" name in some way).

Christopher J. Morrone · Answer 10 · Fri Mar 31 2023 02:25:10 GMT+0800 (China Standard Time)

it would be a trivial matter to add an option to ldmsd_controller or (better) ldms_ls that discovers the unique schemas within all existent sets and dumps the "default" static mapping for each.

@baallan, that is none too trivial either when a number of samplers need to be running on the target hardware to probe available values to be able to determine their schema, and the schema is also influence by configuration values.

It is a tricky problem at the moment.

Tom Tucker · Answer 11 · Fri Mar 31 2023 02:33:18 GMT+0800 (China Standard Time)

@morrone, yeah, the maybe indication is a good point. Maybe not another tag but just an annotation "@src" : "could_be_there", or some such.

I'm also fixing up the parser error handling so that we'll get errors with line numbers, column numbers and source text in the output. Something like this:

Expected a ':' at line 421, column: 24
    { "src" "biffle", ... },
              ^

I don't know what to do about comments, other than:

     "__comment__" : "This is a comment...yuck",

Tom Tucker · Answer 12 · Fri Mar 31 2023 03:03:26 GMT+0800 (China Standard Time)

@morrone, I am open to simplifying the "rec_member" syntax. A problem is that LDMS is very permissive with metric names. But something like this would be more compact.
replace this:

{ "src" : "netdev_list", "rec_member" : "rx_packets", ... }

with this:

{ "src" : "netdev_list[rx_packets]", ... }

Christopher J. Morrone · Answer 13 · Fri Mar 31 2023 03:43:27 GMT+0800 (China Standard Time)

{ "src" : "netdev_list", "rec_member" : "rx_packets", ... }

I'm not opposed to that, but what I was getting at was more about what to do with "dst" in the default case where "dst' is not supplied. The as_is plugin would combine the above into a dst name of "netdev_list.rx_packets", whereas in the in the vast majority of cases I would prefer the dst name to be "rx_packets". I could see that being a top level option that applies to all columns if we don't all agree on which is preferable (it may largely subjective). Granted, if only the rec_member string is used for dst, there is a possibility that it conflicts, but the code could just throw a warning/error if the human hasn't addressed the conflict.