Branch poppler-rewrite does not extract any text

Question

Branch poppler-rewrite does not extract any text

lpla opened this issue 5 years ago · comments

Leopoldo Pla Sempere commented 5 years ago

I tested poppler-rewrite included java (runnable-jar/PDFExtract.jar) in a machine with Ubuntu 16.04 (as of today, the only OS in which it works because #22 ) with several PDFs I own and some Internet Archive files and I only get:

<html>
<head>
<defaultLang abbr="en" />
<languages>
</languages>
</head>
<body>
<div id="page0" class="page">
</div>
</body>
</html>

Is it just me? Master code (based on pdfbox) works.

Dion Wiggins · Answer 1 · Fri Feb 21 2020 06:32:57 GMT+0800 (China Standard Time)

Please provide the PDFs. We have tested about 20K files without issue so need to reproduce on the file that you are having an issue with. It could be protected or a number of other reasons that it returns empty. With the sample we can diagnose.

Leopoldo Pla Sempere · Answer 2 · Fri Feb 21 2020 07:47:09 GMT+0800 (China Standard Time)

This one, for example, extracts text with master code (pdf-box) but nothing in poppler-rewrite: https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf

Leopoldo Pla Sempere · Answer 3 · Fri Feb 21 2020 17:23:53 GMT+0800 (China Standard Time)

The command I use is:

~/pdf-extract$ java -jar runnable-jar/PDFExtract.jar -I ~/forcada16j.pdf -O test

Dion Wiggins · Answer 4 · Sat Feb 22 2020 12:05:30 GMT+0800 (China Standard Time)

This is resolved in the latest release above.