Unable to extract text from PDF generated by Word.

Question

Unable to extract text from PDF generated by Word.

msuiche opened this issue 2 years ago · comments

I saved a Word document as a PDF, and when I try to extract the text I get the following errors:

[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }
[2023-01-24T13:57:48Z ERROR lopdf::reader] Object load error: Parse { offset: 0 }

And the output content looks like this:

"R\n\"\n\"\n.0$((\" A*\" &1\" $++&’&51\" ’5\" $((\" 5’0*,\" ,*-*+&*.\" $2$&($A(*\" $’\" ($>\" 5,\" &1\" *K/&’=F\" ./A]*:’\" ’5\" $1=\" *Q%,*..\" \n*Q:(/.&51.\"5,\"(&-&’$’&51. \n\"\n&1\"’0&.\"B3,**-*1’\"’5\"’0*\":51’,$,=C \n\"\n\"\n\"\n!L\n\"\n!\n!’1$3’&%&><(&)’ \n+\n\"\n9*,2&:*\" ;,52&+*,\" .0$((\" +*4*1+F\" &1+*-1&4=\" $1+\" 05(+\" 0$,-(*..\" \n’0*\" #5-%$1=\" \n$1+\"\n&’. \n\"\n./A.&+&$,&*.F\" \n$44&(&$’*.F\" $1+\" ,*.%*:’&2*\" 544&:*,.F\" +&,*:’5,.F\" *-%(5=**.F\" $3*1’.F\" . \n/::*..5,.\" $1+\" %*,-&’’*+\" $..&31. \n\"\nG*$:0F\" $ \n\"\n7\n#5-%$1= \n\"\n@1+*-1&’**8H\" 4,5-\" $1+\" $3$&1.’\" $((\" (5..*.F\" +$-$3*.F\" (&$A&(&’&*.F\" +*4&:&*1:&*.F\" \n$:’&51.F\"]/+3-*1’.F\"&1’*,*.’F\"$>$,+.F\"%*1$(’&*.F\"4&1*.F\":5.’.\"5,\"*Q%*1.*.\"54\ (...)

I tried using pdfutil with the extract_text subcommand` and I get the same errors. Any recommendations on the steps I can do to debug the code to understand why parsing fails?

Jeff Muizelaar · Answer 1 · Mon Feb 13 2023 03:01:45 GMT+0800 (China Standard Time)

Can you share the document?

Cthulhux · Answer 2 · Tue Mar 07 2023 09:54:23 GMT+0800 (China Standard Time)

Ádding myself here. It looks like Word generates different PDFs.

; curl https://www.africau.edu/images/default/sample.pdf

%PDF-1.3
%����

1 0 obj
<<
...

Now, one generated with Word (original source URL):

; head ./Sozialismusvorstellungen-der-DKP.pdf
%����1.2
�treamr /LZWDecode
 ��P�[�������7�8����d6��+�шҸ6�ׅ�1���m����T�#���̆�(��;:Pgf3Ft�l��������=�M��Y��i:`k�A�s���,Ƞú����HO��+�
                   WRgy��������-��<lAZ��
�̰�p�2�pb�.��Z#��2����
streamr /LZWDecode    ��v5�Ø�7�Ø�9B�`¥%n@���
...

I imagine that there needs to be additional decoding?

Mike Myers · Answer 3 · Tue Apr 11 2023 05:37:23 GMT+0800 (China Standard Time)

Similarly, docs generated with LibreOffice seems to also not work.
For example, running this PDF from Richard Stallman website through extract_text will output this:

{"text":{"1":["","","","","","","","","","","","","!\"#$%&","\"#","’(\"#","\"#",")","*\"#","\"#+,","","!-./","\"#","0112\"#345267","\"#","","","*","8","","9",":","","","#","&",";","$(","&<<<<))","%","","8$=:3%","","","","’.>&","&<<<<","!015550575?.(!!\"@1",""]},"errors":[]}

Jeff Muizelaar · Answer 4 · Tue Apr 11 2023 22:28:47 GMT+0800 (China Standard Time)

Both of those pdfs now work with https://github.com/jrmuizel/pdf-extract

Gregory King · Answer 5 · Fri May 12 2023 17:50:03 GMT+0800 (China Standard Time)

I have also noticed that if you create the PDF from Word using the Print option, Microsoft Print to PDF versus exporting or saving the file as a PDF, you get two different types of PDF, the latter works fine.

Although both of these types of PDFs work fine with Python based PDF libraries.

Shiv Jha-Mathur · Answer 6 · Mon Jul 31 2023 05:02:27 GMT+0800 (China Standard Time)

Here’s another that won’t work, taken from https://old.cbic.gov.in/htdocs-cbec/customs/cs-act/notifications/notfns-2023/cs-nt2023/csnt44-2023.pdf: csnt44-2023.pdf. I can open it in Acrobat with no issues.