sourcefrog / conserve

🌲 Robust file backup tool in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parallelize `referenced_blocks`

sourcefrog opened this issue · comments

referenced_blocks builds a set of all blocks referenced by all indexes. In 0.6.8 this is done on a single thread but it could fairly easily be parallelized:

  • Read multiple bands in parallel.
  • Read hunks of indexes in parallel.

This will significantly help performance of conserve gc and conserve delete.

Not strictly the same but related:

  • There's no need to stitch indexes when finding referenced blocks, because this amounts to reading the same hunk twice. (However, truncated indexes are probably fairly rare so this is not too high of a priority.)

conserve/src/archive.rs

Lines 238 to 261 in 9f405dc

pub fn referenced_blocks(&self) -> Result<BTreeSet<BlockHash>> {
self.iter_referenced_blocks().map(Iterator::collect)
}
/// Iterate all blocks referenced by all bands.
///
/// The iterator returns repeatedly-referenced blocks repeatedly, without deduplicating.
///
/// This shows a progress bar as indexes are iterated.
fn iter_referenced_blocks(&self) -> Result<impl Iterator<Item = BlockHash>> {
let archive = self.clone();
let mut progress_bar = ProgressBar::new();
progress_bar.set_phase("Find referenced blocks...".to_owned());
let band_ids = self.list_band_ids()?;
let num_bands = band_ids.len();
Ok(band_ids
.into_iter()
.enumerate()
.inspect(move |(i, _)| progress_bar.set_fraction(*i, num_bands))
.map(move |(_i, band_id)| Band::open(&archive, &band_id).expect("Failed to open band"))
.flat_map(|band| band.iter_entries().expect("Failed to iter entries"))
.flat_map(|entry| entry.addrs)
.map(|addr| addr.hash))
}