hpgrahsl / kafka-connect-mongodb

**Unofficial / Community** Kafka Connect MongoDB Sink Connector -> integrated 2019 into the official MongoDB Kafka Connector here: https://www.mongodb.com/kafka-connector

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a write model strategy to inject Avro schema name and version when Schema Registry is used

nazr opened this issue · comments

commented

When Avro/Schema Registry is used, any information used to de-serialize a message is lost.

While the result document structure still reflects the original Avro schema. Would be extremely useful to be able to track it back to the schema even after the message is sinked into MongoDB

Hi @nazr again :) Could you maybe elaborate a bit on your specific use case? typical scenarios I've seen so far need to get the correct data according to the defined AVRO schema into the sink i.e. mongodb collection. So far, no one expressed a similar need to yours but you may give me a concrete example how you want the resulting mongodb document to look like for what you have in mind. Irrespective of that it cannot be done with a specific write model since the write models are build just before the actual writing of the data which is given by the result of the converted and post-processed BsonDocuments. This means that the AVRO schema isn't available at this stage any longer. It might be done earlier though.

commented

In general, if you think about it, Avro schema name/version is a piece of meta information that gets lost by the sink connector. You can easily imagine a requirement when once data is written into MongoDB it would need to be retrieved/processed differently based on the message schema. Two messages based on two different schemas would result in different document structures if nothing else.

In our case, we have extra meta-data defined for each Avro schema. It's used to enrich the MongoDB historical view. A work around is to create extra fields and make producers populate them with Avro schema/version, duplicating the information and making it more error prone.

Would you see this metadata being copied in to the actual document that is persisted in Mongo? If so, how would this happen? As a "header" or "schema" complex object?

commented

This metadata is stored separately and joined on schema name/version on retrieval from MongoDB.

Stored separately in MongoDB, or elsewhere?

commented

In our case it's stored in a separate MongoDB collection at the moment

hi again @nazr + @ryancrawcour

so after investigating this a bit further I've come to the conclusion that as a first step it would be probably the best to just encode the version id of the AVRO schema (as it is identified in the schema registry) into the document structure. based on that you can run a separate sink connector instance that streams all records from the _schema topic itself into a separate MongoDB collection. then anytime you want you would join the actual data bearing collection with the schema collection to bring data+schema together for any purpose you want.

what do you think? would that work for you guys?

commented

hi @hpgrahsl, I didn't realise there was a _schema topic (it seems it's _schemas actually), thanks for pointing that out. Yes, that should work for us

Thx for you feedback @nazr! You may have seen there is an official mongo-kafka connector, the sink of which is a repackaged and slightly modified version but otherwise the same. Most likely this feature will be implemented there (as well).