pdfcpu / pdfcpu

A PDF processor written in Go.

Home Page:http://pdfcpu.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pdfcpu form list: text value not extracted

sbourlon opened this issue · comments

  • Your issue is based on the latest commit
% git --no-pager log --oneline -1
fd34b05 (HEAD -> master, origin/master, origin/HEAD) Fix #821
  • State your OS and OS version
% lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 23.10
Release:	23.10
Codename:	mantic
  • When reporting a problem with a specific PDF input file please avoid stating the organization responsible for the PDFWriter - just refer to the PDFWriter

Hello Horst, I just found that pdfcpu does not extract the text form "form1[0].Page1[0].P2_sf[0].P2_Ln305_sf[0].Ln305_inpt[0]" from the pdf https://www.canada.ca/content/dam/cra-arc/formspubs/pbg/t2sch141/t2sch141-fill-23e.pdf. However, pdfcpu is able to retrieve the value after decompressing the pdf document with pdftk.

PDF Info:

% go run cmd/pdfcpu/* info ~/tmp/t2sch141-fill-23e.pdf 
/home/stefan/tmp/t2sch141-fill-23e.pdf:
              Source: /home/stefan/tmp/t2sch141-fill-23e.pdf
         PDF version: 1.7
          Page count: 2
          Page sizes: 612.00 x 792.00 points
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
               Title: General Index of Financial Information (GIFI) - Additional Information (2021 and later tax years)
              Author: 
             Subject: 
        PDF Producer: Designer 6.3
     Content creator: Designer 6.3
       Creation date: D:20230123141724-05'00'
   Modification date: D:20230822133657-04'00'
        Viewer Prefs: DisplayDocTitle = true
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
              Tagged: Yes
              Hybrid: No
          Linearized: No
  Using XRef streams: Yes
Using object streams: Yes
         Watermarked: No
          Thumbnails: No
                Form: Yes
     SignaturesExist: Yes
          AppendOnly: Yes
            Outlines: No
               Names: Yes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
           Encrypted: Yes
         Permissions:
permission bits: 101100110100 (xB34)
Bit  3: true (print(rev2), print quality(rev>=3))
Bit  4: false (modify other than controlled by bits 6,9,11)
Bit  5: true (extract(rev2), extract other than controlled by bit 10(rev>=3))
Bit  6: true (add or modify annotations)
Bit  9: true (fill in form fields(rev>=3)
Bit 10: true (extract(rev>=3))
Bit 11: false (modify(rev>=3))
Bit 12: true (print high-level(rev>=3))

pdfcpu form list:

% go run cmd/pdfcpu/* form list ~/tmp/t2sch141-fill-23e.pdf | grep P2_Ln305
     Textfield │ 365.367.374.403.404 │ form1[0].Page1[0].P2_sf[0].P2_Ln305_sf[0].Ln305_inpt[0]         │       │ 

pdfcpu form list after decompressing the pdf document:

% pdftk t2sch141-fill-23e.pdf output t2sch141-fill-23e.uncompressed.pdftk.pdf uncompress
WARNING: The creator of the input PDF:
   t2sch141-fill-23e.pdf
   has set an owner password (which is not required to handle this PDF).
   You did not supply this password. Please respect any copyright.

% go run cmd/pdfcpu/* form list ~/tmp/t2sch141-fill-23e.uncompressed.pdftk.pdf | grep P2_Ln305                                                                   
     Textfield │ 531.533.538.552.344 │ form1[0].Page1[0].P2_sf[0].P2_Ln305_sf[0].Ln305_inpt[0]         │ line305 │

Thanks for reporting this!