Kafka topic patterns should be per-schema

Question

Kafka topic patterns should be per-schema

morrone opened this issue a year ago · comments

Christopher J. Morrone commented a year ago

Right now we can only set the kafka topic name pattern one when store_avro_kafka is configure. It would be much better to allow us to configure different topic patterns per-schema, or per-decomposition, or maybe per strgp.

A particular use case we have is that data from some schemas we want to store in topics names that are independent across all clusters, and other schemas we want have all of the clusters feed into topics of the same name.

Tom Tucker · Answer 1 · Tue Jun 20 2023 05:00:45 GMT+0800 (China Standard Time)

This would be resolved by the multi-configuration plugin support. As a reminder, what is proposed is as follows:

load name=plugin
config name=plugin x=a y=b z=c
load name=plugin as=plugin-1
config name=plugin-1 x=d y=e z=f

We need this for streams subscription too. We are developing a generic "stream sampler" that consumes JSON and produces LDMS metric sets that can then be decomposed for storage. Consider that this sampler may want to consume from different streams and may need different configurations for each stream like you suggest.

Benjamin Allan · Answer 2 · Wed Jun 21 2023 01:12:03 GMT+0800 (China Standard Time)

If creating a stream to set transform plugin, please see also:
#1217
for needed functionalities to deal with json ambiguities.

Also, why are we creating a new keyword/usage "as=foo" when for plugins and objects in general we already had/have "name=foo" for the object pointer resolution and "plugin=bar" for the object type/instantiator (which have been shorthanded in v2/v3/v4 load implementation)?
Multiple instances of plugins should have different names just like all our other object types.
Repeated 'load' commands of the same library just return quietly for already loaded plugins.

Finally, it's not at all clear to me why I would want full-blown sets (and the implication that down-stream aggregators will see them and move them, with the timing and sizing issues it entails for user-derived streams) made out of streams. Are you proposing making pseudo-set objects for the purpose of the store plugin->store command and storage policy logic? Or are you proposing full-on set instance with all the inter-daemon work this entails?

Tom Tucker · Answer 3 · Wed Jun 21 2023 01:32:21 GMT+0800 (China Standard Time)

@baallan are you quibbling with the syntax? If so the reason is that using as="" allows us to be backward compatible with the existing load command syntax.

I suppose we could create a whole new command:

new-command name=<config-name> plugin=<plugin-name>

Personally, I think that would be more confusing than adding as= to the existing command. We could certainly change as to another keyword, maybe "config=" or something.

I think you are confused about load; it does not create a configuration object. That is what the proposed change would do.

I couldn't parse the remainder of your comment(s).

Tom Tucker · Answer 4 · Wed Jun 21 2023 01:57:28 GMT+0800 (China Standard Time)

@morrone I'm wrong. The multi-config thing doesn't fix this at all. I think you need additional syntax to bind schemas to topic formats. Something like:

topic="":"","":""

You would need this kind of binding with or without multiple configurations per plugin.

Tom Tucker · Answer 5 · Wed Jun 21 2023 01:58:50 GMT+0800 (China Standard Time)

@morrone Markdown ate my syntax

topic="schema-regex":"topic-fmt-str","schema-regex":"topic-fmt-str"

Benjamin Allan · Answer 6 · Wed Jun 21 2023 02:38:15 GMT+0800 (China Standard Time)

@tom95858 re "I couldn't parse the remainder of your comments."

The remainder of my comments (syntax quibbling aside-- you win there) could be:

If creating a stream to set transform plugin, please see also:
https://github.com/ovis-hpc/ovis/discussions/1217
for needed functionalities to deal with json ambiguities.
...

Finally, it's not at all clear to me why I would want full-blown sets (and the implication that down-stream aggregators will see them and move them, with the timing and sizing issues it entails for user-derived streams) made out of streams. Are you proposing making pseudo-set objects for the purpose of the store plugin->store command and storage policy logic? Or are you proposing full-on set instance with all the inter-daemon work this entails?

Are you saying the comments in #1217 are unparseable, or just the 'Finally ' paragraph?

If just the 'Finally' paragraph didn't parse, then more directly about a "stream to set" plugin:
(a) the 'set' argument to the store plugin commit function appears to be unnecessary in any of the current store plugins (store_csv excepted which uses it to get at udata values) which are using the strgp and commit() interface instead of store(). It appears that incoming stream messages could be converted to strgp and committed without any intermediate set instance being required. So why convert streams to sets? Streams have no udata.
(b) Currently we have a rule/implementation that sets are not visible to store plugins on their originating node; this would mean that I cannot convert a stream to a set on an aggregator and also have it stored there. This implies millions of transient-existence sets being handled across aggregation levels. So why convert streams to sets? Just create the arguments for commit( ldmsd_strgp_t strgp, NULL, ldmsd_row_list_t row_list, int row_count) and commit using plugins which support commit() at the storage destination.

Tom Tucker · Answer 7 · Wed Jun 21 2023 03:05:34 GMT+0800 (China Standard Time)

@baallan Are you talking about the generic JSON storage plugin? What does that have to do with the avro-kafka store?

Benjamin Allan · Answer 8 · Wed Jun 21 2023 03:08:07 GMT+0800 (China Standard Time)

@tom95858 yes i am referring to

We need this for streams subscription too. We are developing a generic "stream sampler" that consumes JSON and produces LDMS metric sets that can then be decomposed for storage. Consider that this sampler may want to consume from different streams and may need different configurations for each stream like you suggest.

Christopher J. Morrone · Answer 9 · Fri Jun 23 2023 02:22:46 GMT+0800 (China Standard Time)

@baallan, please have that discussion in another ticket.

Christopher J. Morrone · Answer 10 · Fri Jun 23 2023 02:37:22 GMT+0800 (China Standard Time)

Multi-config could almost solve it, because we can use different strgrps with different regex's to map to the different store configs. But strgrps regex only match LDMS schemas, not decomposed schemas. We might want different topic patterns for each decomposed schema.

In my ideal world, there would be a way in the decomposition configuration file to pass store-specific configuration options along to the store. For instance, under a row's mapping we might have an additional mapping named "store_options".

    { 
        "schema": "foo",
        "cols": [],
        "store_options": {
            "topic_pattern": ""
        }
    }

If we did it that way, we would have to come up with some rules for how we add store-specific options that don't conflict. Stores would just have to ignore options that they don't recognize, which means we basically can't catch typos if we can't throw errors on unrecognized options. So we could also get more verbose:

    {
        "schema": "foo",
        "cols": [],
        "store_options": {
            "store_avro_kafka": {
                "topic_pattern": ""
            }
        }
    }

We could also move "indices" under the new "store_options".