jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The rotation configuration set to IndirectObject, which is preventing the PDF from being uploaded.

zzhangyun opened this issue · comments

Describe the bug

The rotation configuration for the PDF file is set to IndirectObject(12, 0, 4419697344). When uploading this file, it reports below error:

2024-07-19 14:40:21,146 - werkzeug - INFO - 10.131.0.1 - - [19/Jul/2024 14:40:21] "POST /api/v1/project/6698c33997df8cbbfd8770c0/document/?collection=6698c40f97df8cbbfd8770c1 HTTP/1.0" 500 -
Traceback (most recent call last):
File "/app/server/service/pdf_svc.py", line 767, in get_doc_pages_raw_data
for num_page_index in range(0, len(pdf.pages)):
File "/usr/local/lib/python3.9/site-packages/pdfplumber/pdf.py", line 142, in pages
p = Page(self, page, page_number=page_number, initial_doctop=doctop)
File "/usr/local/lib/python3.9/site-packages/pdfplumber/page.py", line 226, in init
self.rotation = _rotation % 360
TypeError: unsupported operand type(s) for %: 'NoneType' and 'int'

Have you tried repairing the PDF?

No

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Split_Part_1.pdf.zip

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

  • pdfplumber version: [e.g., 0.5.22]
  • Python version: [e.g., 3.8.1]
  • OS: [e.g., Mac, Linux, etc.]

Additional context

Add any other context/notes about the problem here.

Hmmmm, as far as I'm aware, IndirectObject(12, 0, 4419697344) is not a valid value for a PDF's rotation. Given that the PDF loads without without problems when using pdfplumber.open(path, repair=True), I'm closing this issue, but feel free to continue the discussion here.

commented

Generally any value in a PDF can be given either by a direct or indirect object with some exceptions explicitly mentioned in the spec.
For the page rotation no restriction is mentioned in the spec, so it may be indirect.

Thank you, @mkl-public and my apologies @zzhangyun; I misunderstood the issue, thinking that the value was literally set to IndirectObject(12, 0, 4419697344). This should now be fixed in c20cd3b.