bitextor / pdf-extract

PDF parser and converter to HTML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sentence join fails when using a batch file

zuny26 opened this issue · comments

When using PDFExtract-2.0.jar with -B option to process a list of files, sentence join model is only applied to the first file. After that, PDFExtract writes the following error message to stdout for each line that is passed to sentence join:

execute sentence join [es] failed. ... ,Stream closed 

(where ... is the content of the line)

Processing same files separately works fine, so it looks like sentence join process is closed after finishing with the first file.

My config file is: pdfextract.json.txt
Sentence join models downloaded from http://data.statmt.org/paracrawl/sentence-join/
The PDFs that I tested with are: one and two

Hi @zuny26 ,
please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks

Hi @ramoelee
Thank you, the issue is solved now for the batch file use case. However if this function is used:

public ByteArrayOutputStream Extract(ByteArrayInputStream inputStream, int keepBrTags, int getPermission)

the issue still persists. It would be nice to fix this function as well, because this is what we use for our python wrapper and c++ wrapper (currently in development)

Hi @zuny26 ,
please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks

Yes, it seems to be working now, thank you!
The only remaining problem that I see is that when verbose mode is activated, PDFExtract is printing a lot of lines that just say "null"

Hi @zuny26 ,
please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks