kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Home Page:https://grobid.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Grobid server with latest CRF-only docker image lfoppiano/grobid:0.7.3 fails to clean up pdfalto processes exhausting memory

sanchay-hai opened this issue · comments

  • What is your OS and architecture? Windows is not supported and Mac OS arm64 is not yet supported. For non-supported OS, you can use Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/) ---- Amazon Linux

  • What is your Java version (java --version)? ---- I used the 0.7.3 docker image lfoppiano/grobid:0.7.3

We ran a server and processed a few batches of files using the grobid_client_python. After a while the server stopped responding and we saw errors like Out of Memory. There was plenty of ram available. On further investigation we saw around 37k+ defunct pdfalto processes. Restarting the server cleaned up the defunct processes.

Hi @sanchay-hai !

Thank you for the issue. I reproduced this problem of the defunct pdfalto processes indeed:

  1. it appears only when running with docker image version lfoppiano/grobid:0.7.3 - stopping the container sends the zombies to the limbos.

  2. the problem was not present with lfoppiano/grobid:0.7.2 (at least that's a solution ;)

  3. building and running the current server without docker does not result in this problem, pdfalto subprocesses are killed normally (so it's another short term solution) - so it's only with docker

to be continued...

  1. using the recommended docker image grobid/grobid:0.7.3, the problem is also not apparent from my tests. The pdfalto zombies seem happening only with the particular image lfoppiano/grobid:0.7.3.

As an immediate solution, using grobid/grobid:0.7.3 provides better results thanks to Deep Learning models, but you can also run it just with CRF models so save memory and time, with the default Grobid config file:

docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro  grobid/grobid:0.7.3
  1. I rebuild CRF-only docker image (I am personally never using it):
docker build -t grobid/grobid-crf:0.8.0-SNAPSHOT --build-arg GROBID_VERSION=0.8.0-SNAPSHOT --file Dockerfile.crf .

And the zombies problem appears. It might be related to the change of base runtime image of the CRF-only docker image.

The problem for this particular image is that we removed tini, so the container needs to be launched with the --init parameter:

  --init                           Run an init inside the container that forwards signals and reaps processes

So:

docker run -t --rm --init -p 8070:8070 lfoppiano/grobid:0.7.3

Then the pdfalto process are properly terminated.

I updated the documentation to include the --init when running the container. We might need to add back tini in the future to avoid using --init.