Unable to delete records using "mongodb.delete.on.null.values" property

Question

Unable to delete records using "mongodb.delete.on.null.values" property

limadelrey opened this issue 5 years ago · comments

Hey,

First of all let me thank you for this brilliant connector. I've been considering to use it long ago so as a first use case I decided to make a data pipeline that captures change data from a PostgreSQL database (source) and replicates that content into a MongoDB database (sink).

I've managed to capture change data by using debezium/debezium-connector-postgresql:0.9.4 connector and some simple configurations. It records every create, read, update and delete operations. This is an example of a read event saved on Kafka:

{ "topic": "ics.i2s.concept", "key": { "id": "c3d92200-07eb-4f23-abef-07bf63eff435" }, "value": { "before": null, "after": { "ics.i2s.concept.Value": { "id": "c3d92200-07eb-4f23-abef-07bf63eff435", "id_trans_label": "1d88ce47-40e8-4505-a5e6-8404429f578b", "id_classification_type": "subject of insurance" } }, "source": { "version": { "string": "0.8.3.Final" }, "name": "ics", "db": "ics", "ts_usec": { "long": 1560350975241000 }, "txId": { "long": 1312 }, "lsn": { "long": 77773136 }, "schema": { "string": "i2s" }, "table": { "string": "concept" }, "snapshot": { "boolean": true }, "last_snapshot_record": { "boolean": false } }, "op": "r", "ts_ms": { "long": 1560350975241 } }, "partition": 0, "offset": 0 }

In order to sink this very same content into MongoDB I'm using hpgrahsl/kafka-connect-mongodb:1.3.1 connector and I've applied the following properties:

name=mongodb-sink-test
connector.class=at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector
topics=ics.i2s.concept
tasks.max=1
mongodb.connection.uri=mongodb://root:example@db-functions:27017/ics?authSource=admin
mongodb.collection=test
transforms=unwrap,RenameField,ReplaceField

// Unwraps CDC event in order to sink only "after" object inside event's value
transforms.unwrap.type=io.debezium.transforms.UnwrapFromEnvelope

// Changes key field name from "id" to "_id"
transforms.RenameField.type=org.apache.kafka.connect.transforms.ReplaceField$Key
transforms.RenameField.renames=id:_id

// Removes value field name "id" in order to avoid duplicate data
transforms.ReplaceField.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.ReplaceField.blacklist=id

// Necessary properties in order to delete documents in MongoDB when receiving a delete event
mongodb.document.id.strategy=at.grahsl.kafka.connect.mongodb.processor.id.strategy.ProvidedInKeyStrategy
mongodb.delete.on.null.values=true

With this properties I've been able to sink the transformed content into MongoDB. This is an example of a generated MongoDB document:

{
    "_id" : "c3d92200-07eb-4f23-abef-07bf63eff435",
    "id_trans_label" : "1d88ce47-40e8-4505-a5e6-8404429f578b",
    "id_classification_type" : "subject of insurance"
}

However, I'm not being successful managing delete events. It works perfectly with read, create and update events though, but since this use case goal is to replicate the very same content from PostgreSQL in MongoDB I'm writing this issue with hope that you can help me finding my mistake.

Thank you in advance!

Hans-Peter Grahsl · Answer 1 · Thu Jun 13 2019 23:07:34 GMT+0800 (China Standard Time)

@limadelrey THX for you kind words!

From what I've seen you seem to be missing the fact that the sink connector has "native" support for Debezium CDC events. Please read again through this README section. Given that you configure the CDC handler for Postgres you don't need to take any special care about the deletion configuration. Likewise no need to use the Unwrap SMT or others. It should just work out of the box :)

Please let me know in case you succeed!

Francisco Lima · Answer 2 · Thu Jun 13 2019 23:20:10 GMT+0800 (China Standard Time)

Thank you for the prompt answer @hpgrahsl!

I tried to use it, but I'll give it another try.

From my understanding using this handler won't transform the event to the format I posted previously. I think that it just replicates the event saved by Debezium with the "before", "after" and several other fields.

Am I wrong?

Hans-Peter Grahsl · Answer 3 · Thu Jun 13 2019 23:27:50 GMT+0800 (China Standard Time)

From my understanding using this handler won't transform the event to the format I posted previously. I think that it just replicates the event saved by Debezium with the "before", "after" and several other fields.

Actually no. By native CDC support I mean that the end result should give you the actual data only instead of replicating the full Debezium event info. That's what it was built for after all. But again please report back if anything doesn't work as expected / documented. Happy to hear from you.

Hans-Peter Grahsl · Answer 4 · Thu Jun 13 2019 23:35:35 GMT+0800 (China Standard Time)

addendum: the only actual difference is that you would get a different representation for the _id field, like shown below:

{ "_id": {
       "id":"..."
   },
   ...
}

Francisco Lima · Answer 5 · Fri Jun 14 2019 00:07:03 GMT+0800 (China Standard Time)

So I went through the documentation and I ended up with the following properties:

name=test
connector.class=at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector
topics=ics.i2s.concept
tasks.max=1
mongodb.connection.uri=mongodb://root:example@db-functions:27017/ics?authSource=admin
mongodb.collection=test

key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

mongodb.change.data.capture.handler=at.grahsl.kafka.connect.mongodb.cdc.debezium.rdbms.postgres.PostgresHandler

However, I'm getting a NullPointerException:

org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.buildWriteModelCDC(MongoDbSinkTask.java:245)
at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.processSinkRecords(MongoDbSinkTask.java:142)
at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.lambda$null$0(MongoDbSinkTask.java:118)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.lambda$put$1(MongoDbSinkTask.java:117)
at java.util.HashMap.forEach(HashMap.java:1289)
at at.grahsl.kafka.connect.mongodb.MongoDbSinkTask.put(MongoDbSinkTask.java:112)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:564)
... 10 more

Am I using an erroneous configuration?

Hans-Peter Grahsl · Answer 6 · Fri Jun 14 2019 00:13:22 GMT+0800 (China Standard Time)

You might be hitting exactly this issue #82 (comment) is already resolved in the latest master thus not released yet. But I've shown a temporary workaround by means of re-configuration in the comment linked above.

Francisco Lima · Answer 7 · Fri Jun 14 2019 00:52:09 GMT+0800 (China Standard Time)

Hey again,

I was indeed hitting the same issue that was reported earlier. I ended up with the following properties and it was successful:

name=test-connector
connector.class=at.grahsl.kafka.connect.mongodb.MongoDbSinkConnector
topics=ics.i2s.concept
tasks.max=1
mongodb.connection.uri=mongodb://root:example@db-functions:27017/ics?authSource=admin

value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081

mongodb.collections=test
mongodb.collection.ics.i2s.concept=test
mongodb.change.data.capture.handler.test=at.grahsl.kafka.connect.mongodb.cdc.debezium.rdbms.RdbmsHandler

At this moment it is possible to sink the topic events into MongoDB by using CDC Handler Configuration, so let me thank you for your support! However, and just like you said, I'm getting documents with the embedded id:

{
    "_id": {
        "id" : "c3d92200-07eb-4f23-abef-07bf63eff435"
    },
    "id_trans_label": "1d88ce47-40e8-4505-a5e6-8404429f578b",
    "id_classification_type": "subject of insurance"
}

Is there a way to flatten the document before sinking in order to match the following structure?

{
    "_id" : "c3d92200-07eb-4f23-abef-07bf63eff435",
    "id_trans_label" : "1d88ce47-40e8-4505-a5e6-8404429f578b",
    "id_classification_type" : "subject of insurance"
}

Thank you once again!

Hans-Peter Grahsl · Answer 8 · Sat Jun 15 2019 13:44:13 GMT+0800 (China Standard Time)

@limadelrey glad to hear you got it working! as for the flat _id field structure I have to say no, it's not possible. this has a reason and is based on a design decision. the thing is that:
a) _id field itself is special and used in MongoDB only - unknown to other data bases and thus CDC sources -> if a source database would use a primary key column named xyz you would loose this information about the original column name which is typically not what you want. thus it gets wrapped as a sub-document into the _id field
b) it should work generically enough to also support different cases. think about compound primary keys in source databases: e.g. a PK spanning 2 columns (ca and cb). these then form a compound _id field by means of a subdocument in MongoDB as well.

I hope this clarifies the ideas behind the current behaviour of the CDC mode. If you have suggestions and want to work to improve this feel free to implement it and send a PR. do you think we can close this issue then? If so please do - THX.

Francisco Lima · Answer 9 · Sat Jun 15 2019 17:13:43 GMT+0800 (China Standard Time)

It does makes sense. The reality is that my use case is different from the desired behaviour of using CDC mode. Thank you so much for your explanations, I'll close the issue then. Cheers!

Hans-Peter Grahsl · Answer 10 · Mon Jun 17 2019 05:39:29 GMT+0800 (China Standard Time)

I hope this doesn't mean you won't try to leverage Kafka connect and the MongoDB sink connector since the _id structure not being a flat one actually doesn't change much. Others are doing it like this as well. And I think having a way to get proper CDC event replication towards MongoDB isn't something you can find easier and more reliable anywhere else :) thx also to the Debezium source connectors of course!

Francisco Lima · Answer 11 · Wed Jun 19 2019 17:25:15 GMT+0800 (China Standard Time)

Of course not. I've finished my use case of capturing change data from a PostgreSQL database (by using Debezium PostgreSQL source connector) and replicate its content into a MongoDB database (by using your very own MongoDB sink connector). I'm very happy with the end result.

Once again, thank you so much for sharing it with the community.