J-F-Liu / lopdf

A Rust library for PDF document manipulation.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to extract text from PDF generated by Word.

msuiche opened this issue · comments

I saved a Word document as a PDF, and when I try to extract the text I get the following errors:

[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }

And the output content looks like this:

"R\n\"\n\"\n.0$((\" A*\" &1\" $++&’&51\" ’5\" $((\" 5’0*,\" ,*-*+&*.\" $2$&($A(*\" $’\" ($>\" 5,\" &1\" *K/&’=F\" ./A]*:’\" ’5\" $1=\" *Q%,*..\" \n*Q:(/.&51.\"5,\"(&-&’$’&51. \n\"\n&1\"’0&.\"B3,**-*1’\"’5\"’0*\":51’,$,=C \n\"\n\"\n\"\n!L\n\"\n!\n!’1$3’&%&><(&)’ \n+\n\"\n9*,2&:*\" ;,52&+*,\" .0$((\" +*4*1+F\" &1+*-1&4=\" $1+\" 05(+\" 0$,-(*..\" \n’0*\" #5-%$1=\" \n$1+\"\n&’. \n\"\n./A.&+&$,&*.F\" \n$44&(&$’*.F\" $1+\" ,*.%*:’&2*\" 544&:*,.F\" +&,*:’5,.F\" *-%(5=**.F\" $3*1’.F\" . \n/::*..5,.\" $1+\" %*,-&’’*+\" $..&31. \n\"\nG*$:0F\" $ \n\"\n7\n#5-%$1= \n\"\n@1+*-1&’**8H\" 4,5-\" $1+\" $3$&1.’\" $((\" (5..*.F\" +$-$3*.F\" (&$A&(&’&*.F\" +*4&:&*1:&*.F\" \n$:’&51.F\"]/+3-*1’.F\"&1’*,*.’F\"$>$,+.F\"%*1$(’&*.F\"4&1*.F\":5.’.\"5,\"*Q%*1.*.\"54\ (...)

I tried using pdfutil with the extract_text subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?

Can you share the document?

Ádding myself here. It looks like Word generates different PDFs.

; curl https://www.africau.edu/images/default/sample.pdf

%PDF-1.3
%����

1 0 obj
<<
...

Now, one generated with Word (original source URL):

; head ./Sozialismusvorstellungen-der-DKP.pdf
%����1.2
�treamr /LZWDecode
 ��P�[�������7�8����d6��+�шҸ6�ׅ�1���m����T�#���̆�(��;:Pgf3Ft�l��������=�M��Y��i:`k�A�s���,Ƞú����HO��+�
                   WRgy��������-��<lAZ��
�̰�p�2�pb�.��Z#��2����
streamr /LZWDecode    ��v5�Ø�7�Ø�9B�`¥%n@���
...

I imagine that there needs to be additional decoding?

Similarly, docs generated with LibreOffice seems to also not work.
For example, running this PDF from Richard Stallman website through extract_text will output this:

{"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]}

Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract

I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine.

Although both of these types of PDFs work fine with Python based PDF libraries.

Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues.