kermitt2 / entity-fishing

A machine learning tool for fishing entities

Home Page:http://nerd.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running concurrent clients

mitchelldehaven opened this issue · comments

Is there any documentation on the correct way to run concurrent clients? The README.md contains runtime performance using 6 concurrent clients, but looking through the documentation I didn't see anything on this.

Hello @mitchelldehaven !

I made the runtime benchmarks using shell scripts and I am using the service with various Java tools, but there are at least two clients managing concurrent calls that could help you more easily:

(disclamer: I've not tested them)

I'm wanting to run this HPC environment to process thousands of PDFs, but when attempting to run on different worker nodes from the same project directory, maven seems dislike this. The naive approach would be to copy the project directory several times, but the project directory is like ~100gb, so I'm unsure if the approach you were using would avoid this.

Sorry, I think I found the mistake I was making. It was unrelated to concurrent threads. Thanks!

@mitchelldehaven I am actually also trying to run the tool in an HPC environment. It's challenging because the tool is seen more as a service deployed in an environment like a AWS cloud. The issue with the 100GB resource space is that a shared disk will harm the performance a lot. It is working fine on an attached SSD because it used memory mapped files, but with shared disk access, it could be a disaster :)
So I am interested in your feedback on this!

Also note that there is a new release with updated resource dbs (now as of end of May 2020 Wikidata and Wikipedia) and some fixes, and gradle is now used instead of maven.