jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent results when cropping an already cropped page

samkit-jain opened this issue · comments

Describe the bug

When cropping an already cropped page, the objects are not preserved.

Code to reproduce the problem

import pdfplumber

# Make sure the file is downloaded at file.pdf
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[0]

# Crop and save the top page and keep only the bottom 20%.
bottom = page.crop((0, 0.8 * float(page.height), page.width, page.height))
im = bottom.to_image(resolution=150)
im.save("bottom.png", format="PNG")

# Now, crop and save the left half of the cropped page.
bottom_left = bottom.crop((0, 0, 0.5 * float(bottom.width), bottom.height))
im = bottom_left.to_image(resolution=150)
im.save("bottom_left.png", format="PNG")

# Now, crop and save the right half of the cropped page.
bottom_right = bottom.crop((0.5 * float(bottom.width), 0, bottom.width, bottom.height))
im = bottom_right.to_image(resolution=150)
im.save("bottom_right.png", format="PNG")

PDF file

examples/pdfs/ag-energy-round-up-2017-02-24.pdf

Expected behavior

  • bottom.png - The bottom portion of the page is saved.
  • bottom_left.png - The left half of bottom portion of the page is saved.
  • bottom_right.png - The right half of the bottom portion of the page is saved.

Actual behavior

  • bottom.png - The bottom portion of the page is saved.
  • bottom_left.png - The left half of the top portion of the page is saved.
  • bottom_right.png - The right half of the top portion of the page is saved.

Screenshots

bottom.png
bottom

bottom_left.png
bottom_left

bottom_right.png
bottom_right

Environment

  • pdfplumber version: 0.5.22
  • Python version: 3.8.2
  • OS: Ubuntu 18.04 LTS

Additional context

The issue was found when working on #244

Thanks for flagging this, @samkit-jain! I think this is what's happening:

  • pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

  • The second and third crop commands in the example assume (understandably) that the coordinates have been adjusted after the initial crop.

I can see a handful of potential solutions:

  • (a) Do nothing, but communicate better to users that .crop's bbox should be in terms of the original PDF, not the crop.

  • (b) Add a parameter to .crop, such as relative = True, that would let users indicate that they're providing relative coordinates, not absolute ones.

  • (c) Change the default, so that .crop assumes a relative-position bbox, but provide a parameter (e.g., relative = False) that reverts to the original approach.

  • (d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

Thoughts?

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

(d) Make it so that pdfplumber automatically adjusts all coordinates (not just of the page's bbox, but of all extracted objects as well) when cropping.

If by this you mean that the cropped page would be treated as a "real" page and all the operations like extract_text, extract_words, etc can be run on it as if it was the parent page, then yes, I'd prefer option D.

pdfplumber does not adjust coordinates after a crop (this is intentional, but open to discussion).

What is the reasoning behind this?

Great question; I should have elaborated earlier. The answer: Mostly for simplicity. A page's bounding box should, I think, exist in the same coordinate system as each object on the page. So changing the coordinates of the bounding box would mean changing the coordinates of all objects in the resulting cropped page. To me, ensuring that all coordinates relevant to all objects were moved post-crop seemed a tricky task.

To take an oversimplified example: Let's say a page's original bounding box was 0, 0, 20, 20, and we have a single point 10, 10. If we crop the page to 10, 10, 20, 20 ...

  • Currently: the cropped page's bounding box would be 10, 10, 20, 20 and the point would remain at 10, 10

  • If we readjusted all coordinates post-crop: the cropped page's new bounding box would be 0, 0, 10, 10 and the point would need to be moved to 0, 0.

The more I think about this, I think I prefer the following combination of improvements:

  • Keeping the current cropping system as-is — i.e., it does not alter the coordinate system

  • Throwing an error (or warning) if a user tries to crop part of a cropped page (or any page) that is not fully within the page's bounding box

  • Adding a relative parameter to both .crop and .within_bbox that allows users to pass a relative bounding box instead of using the absolute-coordinate system. E.g., to crop the example I mentioned above, to get the bottom half of the cropped page, they could call either cropped_page.crop((10, 15, 20, 20)) or cropped_page.crop((0, 5, 10, 10), relative = True) — the two would be equivalent.

Thanks for the explanation. +1 for the warning. Also, adding a new relative parameter would also be the most backwards compatible.

Closing now, as resolved in two commits referenced above and available in v0.5.23. Thanks again for raising the issue!