J-F-Liu / lopdf

A Rust library for PDF document manipulation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there any more docs or help available?

markusdd opened this issue · comments

Hi,

thanks for providing this lib, but I find myself unable to use it.

I need a small tool to crop a certain area out of a PDF page, rotate it, and save it into a new pdf.

This library seemed like a good match, but the modify example is so simple it doesn't help me at all and the documentation isn't very verbose as well.

I'm already struggling to understand why the page_ids are a BTreeMap and how to get pages out of that.

Is there any additional help available?

It is probaby catastrophically wrong what I am trying here but how is this supposed to work.
I managed to guess pages is somehow at the arcane entry (7,0), no I am trying to somehow get to the MediaBox property to modify it.
When I print the arcane LinkedHashMap, all the keys are lists that look like they are the ASCII encoded characters of what the keys actually should be.
I mean...this can't be the recommended way to work with this?
image

So, simple question for an example: I got a pdf (assume single page only), which I need to rotate 90 degrees to the right and then essentially crop out a predefined area.

If it helps here is my code that parses a PDF and extracts all page text:

use std::collections::BTreeMap;

#[derive(Debug, Default)]
pub struct PdfPages {
    pub pages: BTreeMap<usize, Page>,
    pub errors: Vec<lopdf::Error>,
}

impl PdfPages {
    pub fn open(file_path: impl AsRef<std::path::Path>) -> Result<Self, Box<dyn std::error::Error>> {
        let doc = lopdf::Document::load_filtered(file_path, filter_func)?;
        Self::extract_document_pages(doc)
    }
    fn extract_document_pages(doc: lopdf::Document) -> Result<Self, Box<dyn std::error::Error>> {
        let mut pages: Vec<Page> = Default::default();
        let mut errors: Vec<lopdf::Error> = Default::default();
        for (page_num, _) in doc.get_pages().into_iter() {
            match doc.extract_text(&[page_num]) {
                Ok(text) => {
                    pages.push(Page {
                        page_number: page_num as usize,
                        page_content: text,
                    });
                }
                Err(error) => {
                    errors.push(error);
                }
            }
        }
        let pages = pages
            .into_iter()
            .map(|page| (page.page_number, page))
            .collect::<BTreeMap<_, _>>();
        let payload = Self { pages, errors };
        Ok(payload)
    }
}

#[derive(Debug, Clone)]
pub struct Page {
    pub page_number: usize,
    pub page_content: String,
}

static IGNORE: &[&str] = &[
    "Length",
    "BBox",
    "FormType",
    "Matrix",
    "Resources",
    "Type",
    "XObject",
    "Subtype",
    "Filter",
    "ColorSpace",
    "Width",
    "Height",
    "BitsPerComponent",
    "Length1",
    "Length2",
    "Length3",
    "PTEX.FileName",
    "PTEX.PageNumber",
    "PTEX.InfoDict",
    "FontDescriptor",
    "ExtGState",
    "Font",
    "MediaBox",
    "Annot",
];

fn filter_func(object_id: (u32, u16), object: &mut lopdf::Object) -> Option<((u32, u16), lopdf::Object)> {
    if IGNORE.contains(&object.type_name().unwrap_or_default()) {
        return None;
    }
    if let Ok(d) = object.as_dict_mut() {
        d.remove(b"Font");
        d.remove(b"Resources");
        d.remove(b"Producer");
        d.remove(b"ModDate");
        d.remove(b"Creator");
        d.remove(b"ProcSet");
        d.remove(b"XObject");
        d.remove(b"MediaBox");
        d.remove(b"Annots");
        if d.is_empty() {
            return None;
        }
    }
    Some((object_id, object.to_owned()))
}

You use rust iterators to iterate over most data structures, e.g:

let pdf_path = std::path::PathBuf::from("tmp/MyPdf.pdf");
let pdf_pages = pdf2text::PdfPages::open(&pdf_path)?;

for page in pdf_pages.pages.values() {
    println!("Page #{}", page.page_number)
}

Overall the API is very easy to understand but its very canonical Rust, you have to know the rust std and general conventions and whatnot.