medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support deployments that have multiple couch2pg writing to a single database

mrjones-plip opened this issue · comments

There is at least one major partner that has many CHT Core instances (>40) and they are running many couch2pg instances, but they're writing all the data to one database in Postgres instead of each CHT Core instance having their own database.

We should see if we can easily have couch2pg backlog monitoring support this set up so that alerting can easily be set up for each discrete CHT Core instance's backlog in such a setup.

As a POC, I:

  1. set up two docker helper CHT Core instances:
  2. in the cht-couch2pg repo, made a new compose file that's a copy :
     services:
         cht-couch2pg2:
             image: medicmobile/cht-couch2pg:main-node-10
             environment:
             COUCHDB_URL: ${COUCHDB_URL2:-https://medic:password@localhost:5988/medic}
             POSTGRES_USER_NAME: ${COUCH2PG_USER:-cht_couch2pg}
             POSTGRES_PASSWORD: ${COUCH2PG_USER_PASSWORD:-cht_couch2pg_password}
             POSTGRES_SERVER_NAME: ${POSTGRES_SERVER_NAME:-postgres}
             POSTGRES_DB_NAME: ${POSTGRES_DB_NAME:-cht}
             COUCH2PG_CHANGES_LIMIT: ${COUCH2PG_CHANGES_LIMIT:-100}
             COUCH2PG_SLEEP_MINS: ${COUCH2PG_SLEEP_MINS:-60}
             COUCH2PG_DOC_LIMIT: ${COUCH2PG_DOC_LIMIT:-1000}
             COUCH2PG_RETRY_COUNT: ${COUCH2PG_RETRY_COUNT:-5}
  3. started two couch2pg instances with:
     COUCH2PG_SLEEP_MINS=0.1 \                            
         COUCHDB_URL2=https://medic:password@192-168-68-17.local-ip.medicmobile.org:10460/medic \
         COUCHDB_URL=https://medic:password@172-17-0-1.local-ip.medicmobile.org:10464/medic \
         docker compose -f docker-compose.yml -f compose.just-couch2pg.yml up -d
    
  4. ran the default SQL for getting sequence counts, which failed as expected:
    sequence db
    132 medic
    72 medic-sentinel
    232 medic
    130 medic-sentinel
    3 medic-users-meta
    20 medic-logs
    6 medic-users-meta
    5 _users
    19 medic-logs
    4 _users
  5. but if we tweak the SQL, we can actually extract the CHT Instance for free (instead of depending on the cht-instances.yml or sql-servers.yml):
     SELECT
       split_part(seq,'-',1) as sequence,
       split_part(source,'/',2) as db,
       split_part(source,'/',1) as cht_instance
     FROM
       couchdb_progress
     WHERE
       source like '%/%' and
       seq like '%-%'
     ORDER BY
       cht_instance, db
  6. this results in a nice table, as expected:
    sequence db cht_instance
    232 medic 172-17-0-1.local-ip.medicmobile.org:10464
    19 medic-logs 172-17-0-1.local-ip.medicmobile.org:10464
    130 medic-sentinel 172-17-0-1.local-ip.medicmobile.org:10464
    6 medic-users-meta 172-17-0-1.local-ip.medicmobile.org:10464
    4 _users 172-17-0-1.local-ip.medicmobile.org:10464
    132 medic 192-168-68-17.local-ip.medicmobile.org:10460
    20 medic-logs 192-168-68-17.local-ip.medicmobile.org:10460
    72 medic-sentinel 192-168-68-17.local-ip.medicmobile.org:10460
    3 medic-users-meta 192-168-68-17.local-ip.medicmobile.org:10460
    5 _users 192-168-68-17.local-ip.medicmobile.org:10460

Next I'll see if we can update watchdog to natively use this informed approach, instead the prior naive approach, by default. This way the cht_instance will be ignored most the time because there only ever be one, but it will be forwards compatible in the rare circumstance where there's more than one.

Success! by updating the dashboard JSON to use the new field, we can reduce the amount of configuration in sql_instances.yml and still be backwards compatible.

So, in grafana needed target just needed to be updated to cht_instance:

sum(couch2pg_progress_sequence{cht_instance=~"$cht_instance", db=~"medic|medic-sentinel|medic-users-meta"})

Still outstanding to do is to get fake-cht working with new syntax