Bad redirection of kenlm stderr
Proyag opened this issue · comments
sentence-join stderr gets stuff like
This binary file contains trie with quantization and array-compressed pointers.
from kenlm.
pdf-extract/src/pdfextract/SentenceJoin.java
Lines 103 to 107 in b24fc2d
java.lang.Exception: This binary file contains This binary file contains trie with quantization and array-compressed pointerstrie with quantization and array-compressed pointers..
at pdfextract.SentenceJoin.start(SentenceJoin.java:106) at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1579) at pdfextract.PDFExtract.sentenceJoin(PDFExtract.java:1083) at pdfextract.PDFExtract.Extract(PDFExtract.java:384)
Hi @Proyag ,
appreciate your help to provide the pdf file that hit the above error.
Thank you.
I can't reproduce this deterministically, so my diagnosis was probably over-simplified, but this pops up when processing large numbers of files, for example while running many instances of bitextor-warc2htmlwarc.py in parallel.
I can work around this issue for now by redirecting kenlm error output to /dev/null here - I'll produce a reproducible example if I run into it again.