J-F-Liu / lopdf

A Rust library for PDF document manipulation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deleting objects is very slow (and single threaded)

arifd opened this issue · comments

Please take a look at this example:

// [dependencies]
// lopdf = { version = "=0.29.0", default-features = false, features = ["nom_parser", "rayon"] }

use lopdf::{Document, Object, ObjectId};
use std::time::Instant;

pub fn strip_annotations(
    doc: &Document,
    page_id: ObjectId,
) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
    let mut doc = doc.clone();
    let mut result = Vec::new();
    let mut annots = Vec::new();

    let now = Instant::now();
    if let Ok(page) = doc.get_dictionary(page_id) {
        match page.get(b"Annots") {
            Ok(Object::Reference(ref id)) => doc
                .get_object(*id)
                .and_then(Object::as_array)
                .unwrap()
                .iter()
                .flat_map(Object::as_reference)
                // .flat_map(|id| doc.get_dictionary(id))
                .for_each(|a| annots.push(a)),
            Ok(Object::Array(ref a)) => a
                .iter()
                .flat_map(Object::as_reference)
                // .flat_map(|id| doc.get_dictionary(id))
                .for_each(|a| annots.push(a)),
            _ => {}
        }
    }
    println!("annotaions parsed duration: {:?}", now.elapsed());

    let now = Instant::now();
    let annots_len = annots.len();
    for (i, a) in annots.into_iter().enumerate() {
        println!("deleting annotation: {}/{}", i + 1, annots_len);
        let now = Instant::now();
        doc.delete_object(a);
        println!(
            "deleting annotation {}/{} duration: {:?}",
            i + 1,
            annots_len,
            now.elapsed()
        );
    }
    println!("stripping annotations duration: {:?}", now.elapsed());

    let now = Instant::now();
    doc.save_to(&mut result)?;
    println!("saving doc duration {:?}", now.elapsed());

    Ok(result)
}

fn main() {
    let now = Instant::now();
    let pdf = std::fs::read("2022-AI-Index-Report_Master.pdf").unwrap();
    println!("bytes read from FS duration: {:?}", now.elapsed());
    let now = Instant::now();
    let doc = Document::load_mem(&pdf).unwrap();
    println!("loading doc duration: {:?}", now.elapsed());
    let now = Instant::now();
    let p52 = *doc.get_pages().get(&52).unwrap();
    println!("getting pages duration: {:?}", now.elapsed());
    let now = Instant::now();
    let _pdf = strip_annotations(&doc, p52).unwrap();
    println!("stripping annots duration: {:?}", now.elapsed());
}

On my reasonably performant CPU (AMD Ryzen 5 4600G)
This is my output

bytes read from FS duration: 20.853687ms
loading doc duration: 2.885832266s
getting pages duration: 926.86µs
annotaions parsed duration: 22.698µs
deleting annotation: 1/3
deleting annotation 1/3 duration: 101.403896297s
deleting annotation: 2/3
deleting annotation 2/3 duration: 100.875888265s
deleting annotation: 3/3
deleting annotation 3/3 duration: 102.366654197s
stripping annotations duration: 304.646500919s
saving doc duration 1.506718584s
stripping annots duration: 307.510035521s

Could deleting objects be made any faster?

Have compressed the attached file to get around the 25MB file size limit
2022-AI-Index-Report_Master.pdf.tar.gz

Closing because stupidly, I forgot to run in release build, and in release it's much more respectable.

bytes read from FS duration: 16.376765ms
loading doc duration: 242.326198ms
getting pages duration: 248.774µs
annotaions parsed duration: 7.543µs
deleting annotation: 1/3
deleting annotation 1/3 duration: 915.52336ms
deleting annotation: 2/3
deleting annotation 2/3 duration: 915.440878ms
deleting annotation: 3/3
deleting annotation 3/3 duration: 921.86179ms
stripping annotations duration: 2.752877291s
saving doc duration 168.660159ms
stripping annots duration: 3.250700536s