bitextor / pdf-extract

PDF parser and converter to HTML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bad redirection of kenlm stderr

Proyag opened this issue · comments

proc.redirectErrorStream(true); // setting true
redirects sentence-join subprocess stderr to stdout.

sentence-join stderr gets stuff like

This binary file contains trie with quantization and array-compressed pointers.

from kenlm.

StringBuilder sbError = _inputStreamGobbler.getOutputBuffer();
if (_inputStreamGobbler.GetErrorFlag()) {
throw new Exception(sbError.toString());
}else {
interprets these outputs as errors starting sentence-join and throws an exception that looks like:

java.lang.Exception: This binary file contains This binary file contains trie with quantization and array-compressed pointerstrie with quantization and array-compressed pointers..

    at pdfextract.SentenceJoin.start(SentenceJoin.java:106)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1579)
    at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1083)
    at pdfextract.PDFExtract.Extract(PDFExtract.java:384)

Hi @Proyag ,
appreciate your help to provide the pdf file that hit the above error.

Thank you.

I can't reproduce this deterministically, so my diagnosis was probably over-simplified, but this pops up when processing large numbers of files, for example while running many instances of bitextor-warc2htmlwarc.py in parallel.

I can work around this issue for now by redirecting kenlm error output to /dev/null here - I'll produce a reproducible example if I run into it again.