dgraph-io / badger

Fast key-value DB in Go.

Home Page:https://dgraph.io/badger

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Have a way to create backup of Badger

manishrjain opened this issue · comments

Look into how RocksDB is doing things. Use that knowledge to inform our case of how we should be doing backups in Badger. We don't do versioning, but we do write ahead logs, which can be used to ensure that our backup is consistent up to a certain point.

Some rough notes:

• Unsorted iteration in Badger
◦ Go over the value log first, then check the keys in LSM.
◦ If sync=true, then this should work.
◦ If sync=false, then some key-values won’t be in vlog.
◦ Then iterate over LSM tree as well.

These say unsorted iteration because sorted iteration might be too slow (verify that using GOMAXPROCS=128, or 256).

We could write this out to just one file. Or, we could write this out to a bunch of files, which might be consistent with how Badger stores things (LSM + value log). If we do the latter, we could potentially open up one Badger store and just write to it -- but then, compactions would slow us down.

OTOH, we should allow a way to do incremental backups as well. How would that work?

For incremental:

  • Our global incrementing CAS counter is generally useful for filtering values / tombstones that were already backed up.
  • Tombstone entries need to be kept around at the highest numbered level. I don't know off the top of my head whether we do that. (Or: Tombstone entries with CAS counter greater than some threshold -- that of the last successful backup, as informed by the user -- need to be kept around.)
  • It will be useful to retain some general knowledge of the max CAS value seen in any given .sst file, so we can skip those.

Tombstone entries need to be kept around at the highest numbered level. I don't know off the top of my head whether we do that.

For sync=true, they're written to value log. So, they'd be around (unless garbage collection causes them to be removed). In fact, the whole backup thing works nicely if we can count on value log (need to think through how GC would affect all this a bit more).

Had some ideas about making successive backups fast if there are little data changes.

• For the first run, divide up Badger data into key ranges of equal size. Have one file per key range.
• Second run onwards, determine the key ranges from the existing files.
• When Backup starts, divide up the task into these key ranges, one goroutine per KR.
• Before outputting anything, create a checksum for this key range, by doing a fast key-only iteration.
• Compare this against a checksum stored in the KR file.
• If no change, then we’re done for the KR.
• Also, run an estimate for how big the KR is, using EstimateSize while key-only itearation.
• If it exceeds the size, then do a binary split of the KR.
• Write the KR.
• This design ensures that we don’t rewrite every key range every time. We only write data if this key-range had changes since the last backup.
• This would run well with external tools like rsync etc.

This design assumes that we're doing a key-value iteration for the backup. With GOMAXPROCS set to 128 or so, our key-value iteration is pretty fast. So, we don't need to create a value log iterator. We can use the existing iterator for this.

Tracking the max cas counter value of each .sst file (or perhaps key ranges) will suffice as an identically behaving alternative to computing hash values -- any key range with a newer cas counter value will have a non-matching hash anyway (because the hash computation must include cas counter values in their input).

So just by looking at the new sst's we can see the keys of new writes. And we can precisely filter out new vs. old writes based on the cas counter value.

So my plan is this:

  1. Track the max cas counter value of each table, keeping that info in the manifest file
  2. Double-check that tombstone entries have cas counter values.
  3. Double-check that tombstone entries aren't purged in the highest level.
  4. When performing a backup, with the last successful backup being at cas value C, we iterate all keys (iterating all levels simultaneously as you'd do) but for any key K only output the combination of entries with cas value > K. (In doing so, we can skip any table files whose max cas counter is <= C.)
  5. Also we make checkpoints saying we've successfully backed up key ranges such as A..M at cas value C. This way we don't lose all our work if we fail halfway through. So our last backup, that we're resuming from, is not just from "cas value C" -- it's from some contiguous sequence of key ranges KR1, KR2, KR3, ... with associated max cas values C1, C2, C3, ...

Edit: Also, about restoration:

A backup archive will be able to answer the question, "the state of the store on key range KR and at cas value C was " for selected values of KR and C. But for the user API we're only interested when KR = ["", infinity) and we have a single cas counter for the entire key range.

I'm looking at this user API (abstractly):

// makes empty backup store for KV store KV
MakeBackupStore(*KV) BackupStore
// backup new changes from KV into BackupStore.  Returns a value identifying the backup snapshot
CreateBackup(*KV, *BackupStore) BackupSpec
// creates a new KV from the backup at a given backup state (in the directory kvDir)
RestoreBackup(*BackupStore, BackupSpec) *KV

Edit 2: Another thing we'll want is to prevent misapplied backups. We will want a unique identifier on each Badger store. RestoreBackup creates a new KV with a new identifier. The BackupStore we restore from could then switch to the new identifier, or keep the old one. (It could also support forking into a tree of histories, and it would be doable, but not something planned for the first version -- instead you have to copy the backup if you want to continue both badger histories.)

What if somebody just cp -R's a badger directory and tries to backup from it to the same backup store? We can put a unique identifier on each backup checkpoint that both the KV and backup store know about. That way a copy of the Badger dir won't be able to screw up backup state by having two conflicting backup forks.

Looking into this PR and i'm trying to figure out how a backup opperation will effect any already running opperations. Will it block writes until it's done or just ignore any changes to the LSM tree with a higher cas counter?

Moving this back to v1.0 milestone. In short, we need a way to backup pre v0.8 data, and post v0.8 data. For pre v0.8 data we need to backport the backup api to the 0.8 branch.