pdfcpu form list: text value not extracted
sbourlon opened this issue · comments
Stefan Bourlon commented
- Your issue is based on the latest commit
% git --no-pager log --oneline -1
fd34b05 (HEAD -> master, origin/master, origin/HEAD) Fix #821
- State your OS and OS version
% lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 23.10
Release: 23.10
Codename: mantic
- When reporting a problem with a specific PDF input file please avoid stating the organization responsible for the PDFWriter - just refer to the PDFWriter
Hello Horst, I just found that pdfcpu does not extract the text form "form1[0].Page1[0].P2_sf[0].P2_Ln305_sf[0].Ln305_inpt[0]" from the pdf https://www.canada.ca/content/dam/cra-arc/formspubs/pbg/t2sch141/t2sch141-fill-23e.pdf. However, pdfcpu is able to retrieve the value after decompressing the pdf document with pdftk.
PDF Info:
% go run cmd/pdfcpu/* info ~/tmp/t2sch141-fill-23e.pdf
/home/stefan/tmp/t2sch141-fill-23e.pdf:
Source: /home/stefan/tmp/t2sch141-fill-23e.pdf
PDF version: 1.7
Page count: 2
Page sizes: 612.00 x 792.00 points
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Title: General Index of Financial Information (GIFI) - Additional Information (2021 and later tax years)
Author:
Subject:
PDF Producer: Designer 6.3
Content creator: Designer 6.3
Creation date: D:20230123141724-05'00'
Modification date: D:20230822133657-04'00'
Viewer Prefs: DisplayDocTitle = true
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tagged: Yes
Hybrid: No
Linearized: No
Using XRef streams: Yes
Using object streams: Yes
Watermarked: No
Thumbnails: No
Form: Yes
SignaturesExist: Yes
AppendOnly: Yes
Outlines: No
Names: Yes
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Encrypted: Yes
Permissions:
permission bits: 101100110100 (xB34)
Bit 3: true (print(rev2), print quality(rev>=3))
Bit 4: false (modify other than controlled by bits 6,9,11)
Bit 5: true (extract(rev2), extract other than controlled by bit 10(rev>=3))
Bit 6: true (add or modify annotations)
Bit 9: true (fill in form fields(rev>=3)
Bit 10: true (extract(rev>=3))
Bit 11: false (modify(rev>=3))
Bit 12: true (print high-level(rev>=3))
pdfcpu form list:
% go run cmd/pdfcpu/* form list ~/tmp/t2sch141-fill-23e.pdf | grep P2_Ln305
Textfield │ 365.367.374.403.404 │ form1[0].Page1[0].P2_sf[0].P2_Ln305_sf[0].Ln305_inpt[0] │ │
pdfcpu form list after decompressing the pdf document:
% pdftk t2sch141-fill-23e.pdf output t2sch141-fill-23e.uncompressed.pdftk.pdf uncompress
WARNING: The creator of the input PDF:
t2sch141-fill-23e.pdf
has set an owner password (which is not required to handle this PDF).
You did not supply this password. Please respect any copyright.
% go run cmd/pdfcpu/* form list ~/tmp/t2sch141-fill-23e.uncompressed.pdftk.pdf | grep P2_Ln305
Textfield │ 531.533.538.552.344 │ form1[0].Page1[0].P2_sf[0].P2_Ln305_sf[0].Ln305_inpt[0] │ line305 │
Horst Rutter commented
Thanks for reporting this!