vttablet: enable_replication_reporter should be configurable

Question

vttablet: enable_replication_reporter should be configurable

mdkent opened this issue 4 years ago · comments

Right now in vttablet, enable_replication_reporter defaults to yes in externalDatastoreFlags. This causes issues when we add a rdonly tablet backed by RDS - Vitess tries to check replication on a tablet that's not running it.

Anthony Yeh · Answer 1 · Thu Sep 24 2020 06:11:23 GMT+0800 (China Standard Time)

In our tests, I believe replication reporter still worked when the tablet was backed by RDS. @PrismaPhonic can you confirm?

I'm not too familiar with RDS myself. Is there a setting that might determine whether or not replication reporter works? If it's more common for RDS to be configured in such a way that replication reporter doesn't work, we should probably change the default for external datastores to off.

In the meantime, you can always override the flag with extraFlags, in case this is a blocker for you.

Peter Farr · Answer 2 · Thu Sep 24 2020 06:29:25 GMT+0800 (China Standard Time)

When I tested this it did seem to work fine. @mdkent what actual error state are you experiencing with replication reporter on?

Matthew Kent · Answer 3 · Sat Sep 26 2020 05:57:12 GMT+0800 (China Standard Time)

Sorry, I should have specified we're seeing this on Aurora.

When I tested this it did seem to work fine. @mdkent what actual error state are you experiencing with replication reporter on?

Sure! Given the following config:

        shards:
        - keyRange: {}
          databaseInitScriptSecret:
            name: foo-cluster-config
            key: init_db.sql
          replication:
            enforceSemiSync: false
          tabletPools:
          - cell: useast1
            type: externalmaster
            replicas: 1
            vttablet:
              extraFlags:
                db_charset: utf8mb4
                queryserver-config-pool-size: "5"
                queryserver-config-stream-pool-size: "5"
                queryserver-config-transaction-cap: "5"
              resources:
                requests:
                  cpu: 250m
                  memory: 256Mi
                limits:
                  memory: 384Mi
            externalDatastore:
              user: admin
              host: foo-vitess.cluster-xn3akgqwgtt8.us-east-1.rds.amazonaws.com
              port: 3306
              database: foo_production
              credentialsSecret:
                name: foo-cluster-config
                key: db_creds.json
          - cell: useast1
            type: externalrdonly
            replicas: 1
            vttablet:
              extraFlags:
                db_charset: utf8mb4
                queryserver-config-pool-size: "5"
                queryserver-config-stream-pool-size: "5"
                queryserver-config-transaction-cap: "5"
              resources:
                requests:
                  cpu: 250m
                  memory: 256Mi
                limits:
                  memory: 384Mi
            externalDatastore:
              user: admin
              host: foo-vitess.cluster-ro-xn3akgqwgtt8.us-east-1.rds.amazonaws.com
              port: 3306
              database: foo_production
              credentialsSecret:
                name: foo-cluster-config
                key: db_creds.json

The externalrdonly tablets that are hitting the read-only endpoint that aurora provides never enter normal service:

$ kubectl get pods
NAME                                           READY   STATUS    RESTARTS   AGE
foo-etcd-cf3bf24b-1                            1/1     Running   0          4m16s
foo-etcd-cf3bf24b-2                            1/1     Running   0          4m16s
foo-etcd-cf3bf24b-3                            1/1     Running   0          4m16s
foo-useast1-vtctld-305faec9-7db9949599-tcjmr   1/1     Running   2          4m16s
foo-useast1-vtgate-7ca66dd7-64bd485f47-26skv   1/1     Running   3          4m16s
foo-vttablet-useast1-0876642563-eb6ce985       1/1     Running   2          4m16s
foo-vttablet-useast1-2417040302-e6f2f418       0/1     Running   2          4m16s
foo-vttablet-useast1-2793830880-0ec53a62       0/1     Running   2          4m16s
foo-vttablet-useast1-3998625595-3da2ec65       1/1     Running   2          4m16s
vitess-operator-7f885997cb-ggdbd               1/1     Running   0          4m50s

vtgate says

I0925 21:34:12.242053 1 tablet_health_check.go:110] HealthCheckUpdate(Serving State): tablet: useast1-2417040302 (10.119.15.162) serving false => false for mainunsharded/- (RDONLY) reason: healthCheck update error: vttablet error: no slave status

This read-only endpoint is a bit tricky. In single node operation it just hits the primary, but will spread load to additional replicas as we add them. None of those replicas will ever respond to show slave status though.

With

"enable_replication_reporter": false,

in the operator I'm able to use these read-only endpoints normally.

Matthew Kent · Answer 4 · Sat Sep 26 2020 06:07:43 GMT+0800 (China Standard Time)

In the meantime, you can always override the flag with extraFlags, in case this is a blocker for you.

Didn't realize I could override a default option like this. Thanks!

Working via

          - cell: useast1
            type: externalrdonly
            replicas: 1
            vttablet:
              extraFlags:
                db_charset: utf8mb4
                queryserver-config-pool-size: "5"
                queryserver-config-stream-pool-size: "5"
                queryserver-config-transaction-cap: "5"
                enable_replication_reporter: "false"