apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store

Home Page:https://apple.github.io/foundationdb/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DR constantly growing disk space

doublex opened this issue · comments

FDB 7.1 server with disaster recovery:
Used disk space is constantly growing.

Statistics on the machine with DR:

Sum of key-value sizes - 339.880 GB
Disk space used        - 441.932 GB

DR target (same numbers if restoring from backup)

Sum of key-value sizes - 127.438 GB
Disk space used        - 181.399 GB

A similar problem has been reported:
https://forums.foundationdb.org/t/key-value-sizes-at-dr-source-and-destination-have-a-big-difference/3351

fdbcli --exec status (truncated):

{
  "cluster" : {
    [...]
    "layers" : {
      "_valid" : true,
      "backup" : {
        "blob_recent_io" : {
          "bytes_per_second" : 0,
          "bytes_sent" : 0,
          "requests_failed" : 0,
          "requests_successful" : 0
        },
        "instances" : {
          "f9f2d06cd5ded70cc0d60baf4e1ea6d8" : {
            "blob_stats" : {
              "recent" : {
                "bytes_per_second" : 0,
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              },
              "total" : {
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              }
            },
            "configured_workers" : 10,
            "id" : "f9f2d06cd5ded70cc0d60baf4e1ea6d8",
            "last_updated" : 1711573622.9675598,
            "main_thread_cpu_seconds" : 616332.49406500009,
            "memory_usage" : 141631488,
            "process_cpu_seconds" : 623611.24382099998,
            "resident_size" : 25047040,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573622.9675598,
        "paused" : false,
        "tags" : {
          "default" : {
            "current_container" : "file:///media/hdd4000/database/backup-2024-02-03-05-00-01.415519",
            "current_status" : "has been started",
            "mutation_log_bytes_written" : 0,
            "mutation_stream_id" : "8a41d0171e2fd8060cc8b682788c23a0",
            "range_bytes_written" : 0,
            "running_backup" : true,
            "running_backup_is_restorable" : false
          }
        },
        "total_workers" : 10
      },
      "dr_backup" : {
        "instances" : {
          "09f8ae181b62a48f843fd5be73881577" : {
            "configured_workers" : 10,
            "id" : "09f8ae181b62a48f843fd5be73881577",
            "last_updated" : 1711573633.2240255,
            "main_thread_cpu_seconds" : 332468.32462600002,
            "memory_usage" : 841007104,
            "process_cpu_seconds" : 336727.93240300001,
            "resident_size" : 724078592,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573633.2240255,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 218154876973,
            "mutation_stream_id" : "d04069c450c9ebea7158b3582ffc0be2",
            "range_bytes_written" : 115953778604,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : 0.61766700000000008
          }
        },
        "total_workers" : 10
      },
      "dr_backup_dest" : {
        "instances" : {
          "ba2549dfed10b9e11a2d8f6ee32be230" : {
            "configured_workers" : 10,
            "id" : "ba2549dfed10b9e11a2d8f6ee32be230",
            "last_updated" : 1711573727.3237493,
            "main_thread_cpu_seconds" : 8302.7225830000007,
            "memory_usage" : 198774784,
            "process_cpu_seconds" : 8827.1877419999983,
            "resident_size" : 23576576,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573727.3237493,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }
        },
        "total_workers" : 10
      }
    },
    [...]
  }
}

This question is better to be raised on the https://forums.foundationdb.org/. GitHub issue is for tracking specific bugs or problems.

How much lag does the DR report when you run fdbdr status? When the destination cluster catches up (i.e., a few seconds lag), the data size should be about the same. If the lag is large, e.g., the destination cluster still has lots of data to copy, the big difference is expected.

The other possibility is mutation logs buffered at the source cluster, which can be estimated by the size of \xff\x02 keyspace.

Oh, the status reports "backup_state" : "is differential",, so it might be the size of \xff\x02 keyspace is large, i.e., laggine a lot.

    "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }

Actually the status says lagging is large 1747510.757253 seconds behind, about 20days. Do you have DR agents running on the destination cluster?

Sorry for the inconvenience.
Yes, there are DR agents running on the destination cluster (which again runs a DR agent).
Is this an invalid deployment?

You are right. Totally my fault.
Thank you so much for your answer - and sorry for the inconvenience.

Yes, there are DR agents running on the destination cluster (which again runs a DR agent).

DR agents are needed on the destination cluster. So maybe you didn't have enough number of agents and that cause the DR lag.