Branch poppler-rewrite does not extract any text
lpla opened this issue · comments
I tested poppler-rewrite
included java (runnable-jar/PDFExtract.jar
) in a machine with Ubuntu 16.04 (as of today, the only OS in which it works because #22 ) with several PDFs I own and some Internet Archive files and I only get:
<html>
<head>
<defaultLang abbr="en" />
<languages>
</languages>
</head>
<body>
<div id="page0" class="page">
</div>
</body>
</html>
Is it just me? Master code (based on pdfbox) works.
Please provide the PDFs. We have tested about 20K files without issue so need to reproduce on the file that you are having an issue with. It could be protected or a number of other reasons that it returns empty. With the sample we can diagnose.
This one, for example, extracts text with master
code (pdf-box) but nothing in poppler-rewrite
: https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf
The command I use is:
~/pdf-extract$ java -jar runnable-jar/PDFExtract.jar -I ~/forcada16j.pdf -O test
This is resolved in the latest release above.