jalan / pdftotext

Simple PDF text extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intercept poppler errors from appearing in stderr

mrooding opened this issue · comments

Hi

We're using pdftotext within a Docker container and we're noticing that a lot of poppler errors are popping up in stderr. I verified in the source code of pdftotext and poppler that they really originate from poppler. For example, we frequently see this one

The poppler errors are spamming our logs quite heavily and I'd like to be able to intercept them and silence them based on a specific log level. The Python unfortunately isn't able to do so. I've tried several ways including the technique described here but nothing seems to prevent the errors from popping up in the containers stdout/stderr.

I'm not sure if there's anything pdftotext can do about this so please do let me know if I should raise an issue with poppler itself.

Thank you!

Marc

Huh, I thought I had already silenced these via

#if POPPLER_CPP_AT_LEAST_0_30_0
static void do_nothing(const std::string&, void*) {}
#endif

and

    #if POPPLER_CPP_AT_LEAST_0_30_0
    poppler::set_debug_error_function(do_nothing, NULL);
    #endif

Maybe poppler's new version scheme broke my version detection. So I can reproduce, what's the base docker container you are using and what version of poppler? And could you attach a PDF that causes the issue, if you are able? Thanks.

Hey Jason, thanks for getting back to me so quickly.

I thought I'd help by setting up a tiny example. By doing so, I noticed that it wasn't actually the call to pdf = pdftotext.PDF(f) but the call to list(pdf) which throws the following error:

poppler/error (121146): No current point in closepathpoppler/error (109249): No current point in closepath

The service in which we first noticed this is based on an internal centos image which I cannot share with you but I can share the demo which contains a Dockerfile based on the official python:3.7 image. However, the PDF itself is an internal document template so I'd prefer to share that with you privately.

By finding the actual line, I was able to suppress the logs from appearing by using the technique described in the article I shared earlier. However, if this is something that pdftotext should suppress anyway, it may be worth looking into.

Let me know if and how I can share my demo setup with you.

Yeah, I'd like to take a look and suppress the warning output if I can. If it reproduces on the python:3.7 image, then all I need is the PDF file to try it out.

I'll send you an email to see about the PDF.

Thanks for the files to reproduce. I confirm that the problem happens on a centos:7 image. That makes sense, since centos 7 has poppler 0.26.5 (from 2014). Poppler added the poppler::set_debug_error_function function in version 0.30.0, and this python library already detects that and uses it to suppress the warnings when available 👍

Thanks for getting back to me so quickly. That's pretty sad to hear. Do you know if there's an easy way to get a newer version on centos without building it from source? If not, feel free to close the issue and thanks a lot for the support!

Hmm, the only easy way I know of to get a newer poppler on centos is to upgrade to centos 8 😄