kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Home Page:https://grobid.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question on config

zakariabouachra opened this issue · comments

Ubuntu
java 8

I have a machine with 32CPU and 64go of memory and I can't extract my pdf.
79545 pdf I want to extract with grobit client python code

ERROR [2023-11-12 07:18:28,906] org.grobid.service.process.GrobidRestProcessFiles: Could not get an engine from the pool within configured time. Sending service unavailable.
70.49.92.133 - - [12/Nov/2023:07:18:28 +0000] "POST /api/processFulltextDocument HTTP/1.1" 503 0 "-" "python-requests/2.31.0" 1794

Hi @zakariabouachra !

You can use the python client https://github.com/kermitt2/grobid_client_python - then you don't need to understand how to use the service web API directly.

Otherwise, the 503 error is well documented and use to manage parallel requests, see https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessfulltextdocument

Yes, I used this one but I don't know how to configure the server concurrency, the n and the config file

I read the documentation but if you can give me a little explanation it would be appreciated

You don't have a lot of PDF and a large machine, so the default settings will run just fine, e.g. use n=10 in the client command line.
If you really want to use all GPU, change concurrency in the server config and n to something like 30.

In every cases, use the large Docker image (docker pull grobid/grobid:0.7.3) rather than the small one or your own build, using Deep Learning models will bring more accurate results. Even without GPU, it should run fine with 80K PDF only.