Some questions on folder structure

Question

Some questions on folder structure

BartChris opened this issue 2 years ago · comments

Christoph Bartmann commented 2 years ago

Hi,

i was able to test your setup with an installation of Kitodo (without Docker), great work so far. I did some local adjustments to make it work for me. While doing that a few questions arised.

I tried to reduce the copy operations which if i am not mistaken right now are doing the following:

copy the images from Kitodo process folder to the "WORKDIR" which is located on the manager server
copy the images from the "WORKDIR" to the "REMOTE_DIR" on the processing server
after the OCR is done copy the whole OCR data back to the "WORKDIR"
copy the OCR results (ALTO) from the "WORKDIR" to the Kitodo process folder

What is the rationale behind the "WORKDIR" for example and why does the data have to be copied so many times? I reduced the number of copy processes by using shared volumes between the Servers and e.g. copied directly from the process folder to the remote folder, but i would like to be sure that i am not violating some deeper architectural ideas here.

I am running the ocrd_manager standalone right now. For that i run docker compose up for the ocrd_manager component and for the ocrd_monitor component seperateley. The idea is probably that both services are using a shared volume to store job data

https://github.com/markusweigelt/ocrd_manager/blob/317de6b17e6f1701ea2f6d1bda16277d9eaaf24a/docker-compose.yml#L40-L41

https://github.com/markusweigelt/ocrd_manager/blob/317de6b17e6f1701ea2f6d1bda16277d9eaaf24a/docker-compose.yml#L36

But right now i have two folders in /var/lib/docker/volume named ocrd_manager_shared and ocrd_monitor_shared. What can i do that both services are actually using the same shared folder?

Thanks a lot for the support.

Robert Sachunsky · Answer 1 · Fri Sep 30 2022 00:06:11 GMT+0800 (China Standard Time)

Hi @BartChris,

thanks a lot for your report and questions!

As to 1., please bear in mind that the data is not actually copied each time:

copy the images from Kitodo process folder to the "WORKDIR" which is located on the manager server

This uses reflink copies if possible on the FS. Essentially, that is a lazy copy (just another inode pointing to the same blocks).

Of course, if the user starts up Kitodo and Manager on distinct physical filesystems, then it will take the full cost of copy I/O.

copy the images from the "WORKDIR" to the "REMOTE_DIR" on the processing server

That's unavoidable. (But if you configure the Controller and Manager to use the same volume on the same host, then rsync will not do any actual work.)

after the OCR is done copy the whole OCR data back to the "WORKDIR"

Yes, but that means, it will skip all data that have already been there (including full-size images). Retrieving the results from the remote side (where storage might be fast but short-lived) is also not avoidable.

copy the OCR results (ALTO) from the "WORKDIR" to the Kitodo process folder

Yes, for the moment it's only the ALTO files. In the future, we might try to add more (like structMap or other file formats), perhaps in a later workflow stage, when Production already exported the final METS.

The ALTO files are small, so that should do no harm.

The actual OCR-D workspace, on the other hand, will be preserved – it might be needed for re-processing with another workflow, or visual inspection in the Monitor. We have not decided yet when to delete these workspaces from the Manager. (It would probably make sense to tie them to the lifetime of the process in Kitodo, and then archive or delete.)

What is the rationale behind the "WORKDIR" for example and why does the data have to be copied so many times? I reduced the number of copy processes by using shared volumes between the Servers and e.g. copied directly from the process folder to the remote folder, but i would like to be sure that i am not violating some deeper architectural ideas here.

See above answers. And you don't need to change any code to reduce the amount of copying/synchronization: just set up your environment variabes (see make help and .env) to suit your all-local use-case.

2. I am running the ocrd_manager standalone right now. For that i run docker compose up for the ocrd_manager component and for the ocrd_monitor component seperateley. The idea is probably that both services are using a shared volume to store job data

Yes, that's exactly the reason.

But right now i have two folders in /var/lib/docker/volume named ocrd_manager_shared and ocrd_monitor_shared. What can i do that both services are actually using the same shared folder?

You probably did not use the top-level repo https://github.com/markusweigelt/kitodo_production_ocrd for the integrated docker-compose. The top-level is where we provide most documentation and the easiest makefile entrypoints. The submodules only have very limited documentation and flexibility. (In this case, you would need to combine docker-compose.yml and ocrd_monitor/docker-compose.yml as one compose call, so they get the same network and volumes.)

Hope that helps.

Robert Sachunsky · Answer 2 · Tue Oct 04 2022 21:06:19 GMT+0800 (China Standard Time)

@BartChris so I think the main issue here is slub/ocrd_kitodo#35

Would you agree? Can we close here?

Christoph Bartmann · Answer 3 · Tue Oct 04 2022 23:15:42 GMT+0800 (China Standard Time)

yes, sounds good, i think you can close here. Thanks!