flyingcircusio / backy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

support migration of backups on multiple servers

ctheune opened this issue · comments

we regularly need to retire old servers. instead of copying data around we could just let an old server sit around for i.e. 3 months and start placing new backups on a new server. however, operator personnel may stumble over the issue of suddenly not finding older backups and having to hunt them down manually.

it would be nice if a backy status call would also show backups that are located on other backup servers (maybe even in remote locations?!?) so they can then go there to operate on them

and in this distributed setup we need to expire the old backups on the other servers and not just let them sit there infinitely.

this could also help rebalance backups between servers automatically.

An interesting architecture could be:

  • only the schedulers talk to each other (lets do something simple. authentication can be based off our directory shared password generator)
  • the primary scheduler of a backup polls other schedulers about info that they have regarding existing revisions
  • we extend the revision file format to keep information about which scheduler they are placed on and replicate the revision files from there
  • the primary scheduler can mark those for purging by setting a tombstone marker in the local revision file (which then can also be done manually). upon the next sync those tombstones will cause the remote scheduler to actually purge them and then the local tombstones can be removed.

The beauty here is that the CLI tools never have to wait for remote server interaction and only operate on local files. Locking needs to be respected, however.

Aaaaand. This also means that the schedulers will have an API that we might want to integrate with the directory so that admins and customers could check which backups exist and in the future may even be able to browse backups, extract files, etc.

api:

  • fetch revs, filter location == local, optional: update mtime of backup dir
  • update_tags(rev, old, new), accept only if not active, always accept new manual tags, validate old == actual, delete rev if new == empty

todo:

  • mark servers as "lost" -> all servers should delete all revisions/tombstones with location == lost server

There's also the integration code for the platform that we need to prepare before rolling this out!

It's a long story, but we're getting there.

What we identified during review today:

  • gracefully dealing with two servers that consider themselves active for a job requires some coordination, we could likely get away with doing a server-based spread and ensure that we don't violate the SLA at that point. This would reduce the risk of interfering backups. This needs the servers to check their neighbours for running backups. It's not as perfect as doing locking but reduces complicated interdependencies.
  • consider what happens to snapshot names that are generated and cleaned up from both servers, also we can only integrate diffs between revisions that are locally available
  • we currently only have a single backup server in DEV, we'll need to add a second one to properly try this out beyond the NixOS functional testing