bitextor / pdf-extract

PDF parser and converter to HTML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Branch poppler-rewrite does not extract any text

lpla opened this issue · comments

I tested poppler-rewrite included java (runnable-jar/PDFExtract.jar) in a machine with Ubuntu 16.04 (as of today, the only OS in which it works because #22 ) with several PDFs I own and some Internet Archive files and I only get:

<html>
<head>
<defaultLang abbr="en" />
<languages>
</languages>
</head>
<body>
<div id="page0" class="page">
</div>
</body>
</html>

Is it just me? Master code (based on pdfbox) works.

Please provide the PDFs. We have tested about 20K files without issue so need to reproduce on the file that you are having an issue with. It could be protected or a number of other reasons that it returns empty. With the sample we can diagnose.

This one, for example, extracts text with master code (pdf-box) but nothing in poppler-rewrite: https://www.dlsi.ua.es//~mlf/docum/forcada16j.pdf

The command I use is:

~/pdf-extract$ java -jar runnable-jar/PDFExtract.jar -I ~/forcada16j.pdf -O test

This is resolved in the latest release above.