DR constantly growing disk space

Question

DR constantly growing disk space

doublex opened this issue 2 months ago · comments

FDB 7.1 server with disaster recovery:
Used disk space is constantly growing.

Statistics on the machine with DR:

Sum of key-value sizes - 339.880 GB
Disk space used        - 441.932 GB

DR target (same numbers if restoring from backup)

Sum of key-value sizes - 127.438 GB
Disk space used        - 181.399 GB

A similar problem has been reported:
https://forums.foundationdb.org/t/key-value-sizes-at-dr-source-and-destination-have-a-big-difference/3351

fdbcli --exec status (truncated):

{
  "cluster" : {
    [...]
    "layers" : {
      "_valid" : true,
      "backup" : {
        "blob_recent_io" : {
          "bytes_per_second" : 0,
          "bytes_sent" : 0,
          "requests_failed" : 0,
          "requests_successful" : 0
        },
        "instances" : {
          "f9f2d06cd5ded70cc0d60baf4e1ea6d8" : {
            "blob_stats" : {
              "recent" : {
                "bytes_per_second" : 0,
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              },
              "total" : {
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              }
            },
            "configured_workers" : 10,
            "id" : "f9f2d06cd5ded70cc0d60baf4e1ea6d8",
            "last_updated" : 1711573622.9675598,
            "main_thread_cpu_seconds" : 616332.49406500009,
            "memory_usage" : 141631488,
            "process_cpu_seconds" : 623611.24382099998,
            "resident_size" : 25047040,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573622.9675598,
        "paused" : false,
        "tags" : {
          "default" : {
            "current_container" : "file:///media/hdd4000/database/backup-2024-02-03-05-00-01.415519",
            "current_status" : "has been started",
            "mutation_log_bytes_written" : 0,
            "mutation_stream_id" : "8a41d0171e2fd8060cc8b682788c23a0",
            "range_bytes_written" : 0,
            "running_backup" : true,
            "running_backup_is_restorable" : false
          }
        },
        "total_workers" : 10
      },
      "dr_backup" : {
        "instances" : {
          "09f8ae181b62a48f843fd5be73881577" : {
            "configured_workers" : 10,
            "id" : "09f8ae181b62a48f843fd5be73881577",
            "last_updated" : 1711573633.2240255,
            "main_thread_cpu_seconds" : 332468.32462600002,
            "memory_usage" : 841007104,
            "process_cpu_seconds" : 336727.93240300001,
            "resident_size" : 724078592,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573633.2240255,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 218154876973,
            "mutation_stream_id" : "d04069c450c9ebea7158b3582ffc0be2",
            "range_bytes_written" : 115953778604,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : 0.61766700000000008
          }
        },
        "total_workers" : 10
      },
      "dr_backup_dest" : {
        "instances" : {
          "ba2549dfed10b9e11a2d8f6ee32be230" : {
            "configured_workers" : 10,
            "id" : "ba2549dfed10b9e11a2d8f6ee32be230",
            "last_updated" : 1711573727.3237493,
            "main_thread_cpu_seconds" : 8302.7225830000007,
            "memory_usage" : 198774784,
            "process_cpu_seconds" : 8827.1877419999983,
            "resident_size" : 23576576,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573727.3237493,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }
        },
        "total_workers" : 10
      }
    },
    [...]
  }
}

Jingyu Zhou · Answer 1 · Thu Mar 28 2024 09:08:12 GMT+0800 (China Standard Time)

This question is better to be raised on the https://forums.foundationdb.org/. GitHub issue is for tracking specific bugs or problems.

How much lag does the DR report when you run fdbdr status? When the destination cluster catches up (i.e., a few seconds lag), the data size should be about the same. If the lag is large, e.g., the destination cluster still has lots of data to copy, the big difference is expected.

The other possibility is mutation logs buffered at the source cluster, which can be estimated by the size of \xff\x02 keyspace.

Jingyu Zhou · Answer 2 · Thu Mar 28 2024 09:10:46 GMT+0800 (China Standard Time)

Oh, the status reports "backup_state" : "is differential",, so it might be the size of \xff\x02 keyspace is large, i.e., laggine a lot.

    "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }

Actually the status says lagging is large 1747510.757253 seconds behind, about 20days. Do you have DR agents running on the destination cluster?

doublex · Answer 3 · Thu Mar 28 2024 17:31:46 GMT+0800 (China Standard Time)

Sorry for the inconvenience.
Yes, there are DR agents running on the destination cluster (which again runs a DR agent).
Is this an invalid deployment?

doublex · Answer 4 · Thu Mar 28 2024 18:00:16 GMT+0800 (China Standard Time)

You are right. Totally my fault.
Thank you so much for your answer - and sorry for the inconvenience.

Jingyu Zhou · Answer 5 · Fri Mar 29 2024 01:10:47 GMT+0800 (China Standard Time)

Yes, there are DR agents running on the destination cluster (which again runs a DR agent).

DR agents are needed on the destination cluster. So maybe you didn't have enough number of agents and that cause the DR lag.