alibaba / MongoShake

MongoShake is a universal data replication platform based on MongoDB's oplog. Redundant replication and active-active replication are two most important functions. 基于mongodb oplog的集群复制工具,可以满足迁移和同步的需求,进一步实现灾备和多活功能。

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

当目标复制集群某成员状态异常时,增量同步直接夯住。日志正常打印但是数据不同步、ack ckpt均不刷新。

Steloj-shaw opened this issue · comments

mongoshake: v2.6.4
源端:mongodb 3.4 社区版 复制集群 1主1从1仲裁
目标端: mongodb 3.4 社区版 复制集群 1主1从1仲裁

状况:
当目标库复制集群的从节点宕机时(此时目标库为1主1仲裁,集群健康),mongoshake日志正常打印,但是数据不同步,ack ckpt均不刷新。
当我使用rs.remove() 将异常节点提出目标库集群时,mongoshake迅速恢复同步,
后续测试发现,目标库备库就算处于STARTUP2 RECOVERING等非正常状态时, mongoshake也会卡住,但是只要rs.remove踢出异常节点,mongoshake就迅速恢复。

源端已有10:56分的所需oplog:
ecpay:SECONDARY> db.oplog.rs.find({"ns":"online.---.trans"}).sort({ "$natural":-1}).limit(1)
{ "ts" : Timestamp(1686538562, 12), "t" : NumberLong(12), "h" : NumberLong("-4256625249891637272"), "v" : 2, "op" : "u", "ns" : "online.---.trans", "o2" : { "_id" : ObjectId("64868941a11a2c5a*****6") }, "o" : { "$set" : { "re---e" : "00", "res---age" : "Suss" } } }

日志打印状况:

[2023/06/12 10:56:26 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:56:31 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:56:36 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:56:41 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:56:46 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:56:51 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:56:56 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:01 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:06 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:11 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:16 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:21 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:26 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:31 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:36 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:41 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:46 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:51 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:57:56 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:01 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:06 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:11 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:16 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:21 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:26 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:31 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:36 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:41 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]
[2023/06/12 10:58:46 CST] [INFO] [name=default-0, stage=incr, get=309680, filter=303384, write_success=96, tps=0, ckpt_times=31, lsn_ckpt={7243627082469605381[1686538356, 5], 2023-06-12 10:52:36}, lsn_ack={7243627408887119885[1686538432, 13], 2023-06-12 10:53:52}]]

配置文件关键参数(已脱敏):
!目标库和检查点配置库是通一个 !

mongo_urls = mongodb://mor:passwd@bkk-.com:23636,bkkcom:23636,bkk***.com:23636/admin

tunnel = direct

tunnel.address = mongodb://moe:passwd@bkk-.com:23636,bkk-.com:23636,bkk*com:23636/settle

tunnel.message = raw

mongo_connect_mode = secondaryPreferred

filter.namespace.white = ^online.****.trans$

filter.ddl_enable = true
checkpoint.storage.url = mongodb://moe:passwd@bkk-.com:23636,bkk-.com:23636,bkk*com:23636/settle
checkpoint.storage.db = mongoshake
checkpoint.storage.collection = bkk_mgo2mgo_online2settle.log
transform.namespace = online:settle

你有没有试过异常的时候,自己手动往目标库写数据能成功吗?

你有没有试过异常的时候,自己手动往目标库写数据能成功吗?

试过,不能成功。 但是根据我们生产环境实际故障分析,我们在数小时的故障期间,他会有偶发性的几条oplog同步成功,数量极少。