Deleting objects is very slow (and single threaded)
arifd opened this issue · comments
Arif Driessen commented
Please take a look at this example:
// [dependencies]
// lopdf = { version = "=0.29.0", default-features = false, features = ["nom_parser", "rayon"] }
use lopdf::{Document, Object, ObjectId};
use std::time::Instant;
pub fn strip_annotations(
doc: &Document,
page_id: ObjectId,
) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
let mut doc = doc.clone();
let mut result = Vec::new();
let mut annots = Vec::new();
let now = Instant::now();
if let Ok(page) = doc.get_dictionary(page_id) {
match page.get(b"Annots") {
Ok(Object::Reference(ref id)) => doc
.get_object(*id)
.and_then(Object::as_array)
.unwrap()
.iter()
.flat_map(Object::as_reference)
// .flat_map(|id| doc.get_dictionary(id))
.for_each(|a| annots.push(a)),
Ok(Object::Array(ref a)) => a
.iter()
.flat_map(Object::as_reference)
// .flat_map(|id| doc.get_dictionary(id))
.for_each(|a| annots.push(a)),
_ => {}
}
}
println!("annotaions parsed duration: {:?}", now.elapsed());
let now = Instant::now();
let annots_len = annots.len();
for (i, a) in annots.into_iter().enumerate() {
println!("deleting annotation: {}/{}", i + 1, annots_len);
let now = Instant::now();
doc.delete_object(a);
println!(
"deleting annotation {}/{} duration: {:?}",
i + 1,
annots_len,
now.elapsed()
);
}
println!("stripping annotations duration: {:?}", now.elapsed());
let now = Instant::now();
doc.save_to(&mut result)?;
println!("saving doc duration {:?}", now.elapsed());
Ok(result)
}
fn main() {
let now = Instant::now();
let pdf = std::fs::read("2022-AI-Index-Report_Master.pdf").unwrap();
println!("bytes read from FS duration: {:?}", now.elapsed());
let now = Instant::now();
let doc = Document::load_mem(&pdf).unwrap();
println!("loading doc duration: {:?}", now.elapsed());
let now = Instant::now();
let p52 = *doc.get_pages().get(&52).unwrap();
println!("getting pages duration: {:?}", now.elapsed());
let now = Instant::now();
let _pdf = strip_annotations(&doc, p52).unwrap();
println!("stripping annots duration: {:?}", now.elapsed());
}
On my reasonably performant CPU (AMD Ryzen 5 4600G)
This is my output
bytes read from FS duration: 20.853687ms
loading doc duration: 2.885832266s
getting pages duration: 926.86µs
annotaions parsed duration: 22.698µs
deleting annotation: 1/3
deleting annotation 1/3 duration: 101.403896297s
deleting annotation: 2/3
deleting annotation 2/3 duration: 100.875888265s
deleting annotation: 3/3
deleting annotation 3/3 duration: 102.366654197s
stripping annotations duration: 304.646500919s
saving doc duration 1.506718584s
stripping annots duration: 307.510035521s
Could deleting objects be made any faster?
Have compressed the attached file to get around the 25MB file size limit
2022-AI-Index-Report_Master.pdf.tar.gz
Arif Driessen commented
Closing because stupidly, I forgot to run in release build, and in release it's much more respectable.
bytes read from FS duration: 16.376765ms
loading doc duration: 242.326198ms
getting pages duration: 248.774µs
annotaions parsed duration: 7.543µs
deleting annotation: 1/3
deleting annotation 1/3 duration: 915.52336ms
deleting annotation: 2/3
deleting annotation 2/3 duration: 915.440878ms
deleting annotation: 3/3
deleting annotation 3/3 duration: 921.86179ms
stripping annotations duration: 2.752877291s
saving doc duration 168.660159ms
stripping annots duration: 3.250700536s