openark / orchestrator

MySQL replication topology management and HA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

region or datacenter causes failover to fail

lazzyfu opened this issue · comments

触发bug的前置条件(满足任意一个或者同时启用)

  • PreventCrossRegionMasterFailover为true
  • PreventCrossDataCenterMasterFailover为true

触发时机

看运气

触发结果

切换失败

触发报错截图

image
image

分析报错

review下代码

image
发现代码in后面是有值的,而日志输出的这里的in后面是空值,导致判断不通过,触发了region保护,导致故障转移失败

如何复现

代码文件:go/inst/instance_dao.go
在instanceFound = true下面增加for循环(你想多少秒都行,不要太大,太大检测过慢,效果显现慢)
image

在Master节点上shutdown

这里要选好时机,时机就是调试代码的5秒逻辑,shutdown命令要在循环Loop 5秒期间执行

主库shutdown要在输出下面期间内执行
image

观察拓扑

此时拓扑恢复失败了
image

确认下日志

image

观察下ORC表的记录

image
发现宕机的Master节点10.10.1.220对应的region记录为空了

贴下解决办法吧(过程懒的讲了)

其实修复起来已经很简单了,调整下DetectRegionQuery代码顺序即可

image

添加代码的位置

image

补充代码

// Get datacenter、region etc
func() {
    var getMetaWaitGroup sync.WaitGroup
    if config.Config.DetectDataCenterQuery != "" && !isMaxScale {
        getMetaWaitGroup.Add(1)
        go func() {
            defer getMetaWaitGroup.Done()
            err := db.QueryRow(config.Config.DetectDataCenterQuery).Scan(&instance.DataCenter)
            logReadTopologyInstanceError(instanceKey, "DetectDataCenterQuery", err)
        }()
    }
    if config.Config.DetectRegionQuery != "" && !isMaxScale {
        getMetaWaitGroup.Add(1)
        go func() {
            defer getMetaWaitGroup.Done()
            err := db.QueryRow(config.Config.DetectRegionQuery).Scan(&instance.Region)
            logReadTopologyInstanceError(instanceKey, "DetectRegionQuery", err)
        }()
    }
    if config.Config.DetectPhysicalEnvironmentQuery != "" && !isMaxScale {
        getMetaWaitGroup.Add(1)
        go func() {
            defer getMetaWaitGroup.Done()
            err := db.QueryRow(config.Config.DetectPhysicalEnvironmentQuery).Scan(&instance.PhysicalEnvironment)
            logReadTopologyInstanceError(instanceKey, "DetectPhysicalEnvironmentQuery", err)
        }()
    }
 
    if config.Config.DetectInstanceAliasQuery != "" && !isMaxScale {
        getMetaWaitGroup.Add(1)
        go func() {
            defer getMetaWaitGroup.Done()
            err := db.QueryRow(config.Config.DetectInstanceAliasQuery).Scan(&instance.InstanceAlias)
            logReadTopologyInstanceError(instanceKey, "DetectInstanceAliasQuery", err)
        }()
    }
 
    if config.Config.DetectSemiSyncEnforcedQuery != "" && !isMaxScale {
        getMetaWaitGroup.Add(1)
        go func() {
            defer getMetaWaitGroup.Done()
            err := db.QueryRow(config.Config.DetectSemiSyncEnforcedQuery).Scan(&instance.SemiSyncPriority)
            logReadTopologyInstanceError(instanceKey, "DetectSemiSyncEnforcedQuery", err)
        }()
    }
    getMetaWaitGroup.Wait()
}()