Some questions on folder structure
BartChris opened this issue · comments
Hi,
i was able to test your setup with an installation of Kitodo (without Docker), great work so far. I did some local adjustments to make it work for me. While doing that a few questions arised.
- I tried to reduce the copy operations which if i am not mistaken right now are doing the following:
- copy the images from Kitodo process folder to the "WORKDIR" which is located on the manager server
- copy the images from the "WORKDIR" to the "REMOTE_DIR" on the processing server
- after the OCR is done copy the whole OCR data back to the "WORKDIR"
- copy the OCR results (ALTO) from the "WORKDIR" to the Kitodo process folder
What is the rationale behind the "WORKDIR" for example and why does the data have to be copied so many times? I reduced the number of copy processes by using shared volumes between the Servers and e.g. copied directly from the process folder to the remote folder, but i would like to be sure that i am not violating some deeper architectural ideas here.
- I am running the ocrd_manager standalone right now. For that i run
docker compose up
for the ocrd_manager component and for the ocrd_monitor component seperateley. The idea is probably that both services are using a shared volume to store job data
But right now i have two folders in /var/lib/docker/volume
named ocrd_manager_shared
and ocrd_monitor_shared
. What can i do that both services are actually using the same shared folder?
Thanks a lot for the support.
Hi @BartChris,
thanks a lot for your report and questions!
As to 1., please bear in mind that the data is not actually copied each time:
- copy the images from Kitodo process folder to the "WORKDIR" which is located on the manager server
This uses reflink copies if possible on the FS. Essentially, that is a lazy copy (just another inode pointing to the same blocks).
Of course, if the user starts up Kitodo and Manager on distinct physical filesystems, then it will take the full cost of copy I/O.
- copy the images from the "WORKDIR" to the "REMOTE_DIR" on the processing server
That's unavoidable. (But if you configure the Controller and Manager to use the same volume on the same host, then rsync will not do any actual work.)
- after the OCR is done copy the whole OCR data back to the "WORKDIR"
Yes, but that means, it will skip all data that have already been there (including full-size images). Retrieving the results from the remote side (where storage might be fast but short-lived) is also not avoidable.
- copy the OCR results (ALTO) from the "WORKDIR" to the Kitodo process folder
Yes, for the moment it's only the ALTO files. In the future, we might try to add more (like structMap or other file formats), perhaps in a later workflow stage, when Production already exported the final METS.
The ALTO files are small, so that should do no harm.
The actual OCR-D workspace, on the other hand, will be preserved – it might be needed for re-processing with another workflow, or visual inspection in the Monitor. We have not decided yet when to delete these workspaces from the Manager. (It would probably make sense to tie them to the lifetime of the process in Kitodo, and then archive or delete.)
What is the rationale behind the "WORKDIR" for example and why does the data have to be copied so many times? I reduced the number of copy processes by using shared volumes between the Servers and e.g. copied directly from the process folder to the remote folder, but i would like to be sure that i am not violating some deeper architectural ideas here.
See above answers. And you don't need to change any code to reduce the amount of copying/synchronization: just set up your environment variabes (see make help
and .env) to suit your all-local use-case.
2. I am running the ocrd_manager standalone right now. For that i run
docker compose up
for the ocrd_manager component and for the ocrd_monitor component seperateley. The idea is probably that both services are using a shared volume to store job data
Yes, that's exactly the reason.
But right now i have two folders in
/var/lib/docker/volume
namedocrd_manager_shared
andocrd_monitor_shared
. What can i do that both services are actually using the same shared folder?
You probably did not use the top-level repo https://github.com/markusweigelt/kitodo_production_ocrd for the integrated docker-compose. The top-level is where we provide most documentation and the easiest makefile entrypoints. The submodules only have very limited documentation and flexibility. (In this case, you would need to combine docker-compose.yml and ocrd_monitor/docker-compose.yml as one compose call, so they get the same network and volumes.)
Hope that helps.
@BartChris so I think the main issue here is slub/ocrd_kitodo#35
Would you agree? Can we close here?
yes, sounds good, i think you can close here. Thanks!