Add a write model strategy to inject Avro schema name and version when Schema Registry is used

Question

Add a write model strategy to inject Avro schema name and version when Schema Registry is used

nazr opened this issue 5 years ago · comments

When Avro/Schema Registry is used, any information used to de-serialize a message is lost.

While the result document structure still reflects the original Avro schema. Would be extremely useful to be able to track it back to the schema even after the message is sinked into MongoDB

Hans-Peter Grahsl · Answer 1 · Tue May 14 2019 15:51:47 GMT+0800 (China Standard Time)

Hi @nazr again :) Could you maybe elaborate a bit on your specific use case? typical scenarios I've seen so far need to get the correct data according to the defined AVRO schema into the sink i.e. mongodb collection. So far, no one expressed a similar need to yours but you may give me a concrete example how you want the resulting mongodb document to look like for what you have in mind. Irrespective of that it cannot be done with a specific write model since the write models are build just before the actual writing of the data which is given by the result of the converted and post-processed BsonDocuments. This means that the AVRO schema isn't available at this stage any longer. It might be done earlier though.

nazr · Answer 2 · Wed May 15 2019 08:27:43 GMT+0800 (China Standard Time)

In general, if you think about it, Avro schema name/version is a piece of meta information that gets lost by the sink connector. You can easily imagine a requirement when once data is written into MongoDB it would need to be retrieved/processed differently based on the message schema. Two messages based on two different schemas would result in different document structures if nothing else.

In our case, we have extra meta-data defined for each Avro schema. It's used to enrich the MongoDB historical view. A work around is to create extra fields and make producers populate them with Avro schema/version, duplicating the information and making it more error prone.

Ryan CrawCour · Answer 3 · Wed May 15 2019 09:38:51 GMT+0800 (China Standard Time)

Would you see this metadata being copied in to the actual document that is persisted in Mongo? If so, how would this happen? As a "header" or "schema" complex object?

nazr · Answer 4 · Wed May 15 2019 10:16:49 GMT+0800 (China Standard Time)

This metadata is stored separately and joined on schema name/version on retrieval from MongoDB.

Ryan CrawCour · Answer 5 · Wed May 15 2019 10:18:49 GMT+0800 (China Standard Time)

Stored separately in MongoDB, or elsewhere?

nazr · Answer 6 · Wed May 15 2019 10:29:47 GMT+0800 (China Standard Time)

In our case it's stored in a separate MongoDB collection at the moment

Hans-Peter Grahsl · Answer 7 · Thu Jun 20 2019 20:40:51 GMT+0800 (China Standard Time)

hi again @nazr + @ryancrawcour

so after investigating this a bit further I've come to the conclusion that as a first step it would be probably the best to just encode the version id of the AVRO schema (as it is identified in the schema registry) into the document structure. based on that you can run a separate sink connector instance that streams all records from the _schema topic itself into a separate MongoDB collection. then anytime you want you would join the actual data bearing collection with the schema collection to bring data+schema together for any purpose you want.

what do you think? would that work for you guys?

nazr · Answer 8 · Fri Jun 21 2019 13:21:48 GMT+0800 (China Standard Time)

hi @hpgrahsl, I didn't realise there was a _schema topic (it seems it's _schemas actually), thanks for pointing that out. Yes, that should work for us

Hans-Peter Grahsl · Answer 9 · Fri Jun 21 2019 15:54:15 GMT+0800 (China Standard Time)

Thx for you feedback @nazr! You may have seen there is an official mongo-kafka connector, the sink of which is a repackaged and slightly modified version but otherwise the same. Most likely this feature will be implemented there (as well).