OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

library or not: ocrd_network + reference WebAPI implementation

MehmedGIT opened this issue · comments

Discussion origin.

According to my previous understanding, all endpoints (workspace, workflow, discovery...) of the WebAPI were going to be part of the ocrd_network package. Not just the processing endpoint. That's why we were talking about completely separating the Deployer from the Processing Server here. The Deployer would then be responsible to deploy everything (including Workspace and Workflow servers) programmatically the way the RabbitMQ and MongoDB containers are deployed based on a single configuration file.

@tdoan2010 has a different opinion on that. Only the processing endpoint should stay inside the ocrd_network. Other endpoints (workspace, workflow...) should be implemented separately by other projects by forking the latest state of the reference WebAPI implementation.

In our meeting today it was decided that more thought is needed before reaching a final conclusion based on the advantages/disadvantages of both approaches.
@kba, please join the discussion.

According to my previous understanding, all endpoints (workspace, workflow, discovery...) of the WebAPI were going to be part of the ocrd_network package

No, that was never the intention. As written in the spec from the beginning, only the processing endpoints are provided via OCR-D/core. Since the implementation of other endpoints are very different depending on the project, it doesn't make sense to include them in OCR-D/core.

That's why we were talking about completely separating the Deployer from the Processing Server here.

I was not aware of it, and I actually don't care, because separation or not is just the implementation detail. From the decision viewpoint, it was decided that: the Processing Server should not be responsible for managing the queue and database anymore, and we're trying to create a docker-compose file for our setup.

To the original discussion, to me, the ocrd-webapi-implementation is not a library in its current state. If it wants to be a library at some point in the future, I would expect at least that:

  1. It must be versioned.
  2. It must expose some "public" methods to manage the server (e.g. start, stop, add routes, disable routes, etc.)
  3. Users must be able to inherit the server base class and customize its behaviors (e.g. add/delete/override routes)

It cannot be considered as a library just because some methods in the lower level (database/queue communication, workflow/workspace manager) are re-used somewhere else (like OPERANDI). It makes more sense to me if a library for these methods is created instead (something like ocrd-network-utils).

I am more than happy with the WebAPI not being a library and I am not debating in favor of making the WebAPI a library. Just trying to clarify things. Agree with you that keeping things as they are now is a lot easier and more pleasant variant. However, it's unfortunate the least that this topic has been raised just now...

No, that was never the intention. As written in the spec from the beginning, only the processing endpoints are provided via OCR-D/core.

That spec has been modified so many times that I no longer recall what were the previous states. Have to check all the edit history. However, the architecture itself hints at the need for Workflow and Workspace servers. Which I would assume are provided by the ocrd-network package. The processor endpoint was not implementable separately such as the workflow and workspace endpoints in the reference ocrd-webapi-implementation only because it was impossible without forking and modifying the core. Another reason was the architecture change by reusing a process queue.

Since the implementation of other endpoints are very different depending on the project, it doesn't make sense to include them in OCR-D/core.

Well, exactly. Neither it makes sense to integrate the WebAPI into implementation projects or implement the processor endpoint in the case of Operandi only because it is in the WebAPI endpoints. We had problems on the Operandi side to integrating the WebAPI implementation (i.e., the different workflow endpoint and missing processor endpoint) as a library and it was mentioned in the past many times during Operandi reviews that we are trying to reuse the WebAPI implementation as a library. This was also mentioned in coordination meetings a few times as well. Would have been a lot better if that was raised back then that using the WebAPI as a library is a bad idea.

I was not aware of it, and I actually don't care, because separation or not is just the implementation detail. From the decision viewpoint, it was decided that: the Processing Server should not be responsible for managing the queue and database anymore

You may not care, but it's important to clarify things to not waste time with fixes and patches later. It would have been especially harder to do the separation now if the Deployer module was not introduced to separate the responsibilities between Processing and Deployment. I don't agree that it's just some implementation detail. Not when taking into account the overall error handling of all endpoints functioning together - #1015.

  1. It must be versioned.

It is and has been for some time. Not sure when you last checked the repo in more detail. Maybe not versioned like core but still there is some versioning via GitHub tags. On those tags actually Operandi currently relies on when installing specific WebAPI versions to reuse the routers and managers of the specific version.

  1. It must expose some "public" methods to manage the server (e.g. start, stop, add routes, disable routes, etc.)

True, it does not expose "public" methods to manage the server itself. It exposes "public" routers, managers, DB methods, and RabbitMQ methods for the low-level implementation details to ease the implementers when extending the routes. Routers that each implementation project can reuse directly will be used. The endpoints which cannot be reused directly can be overwritten by overwriting the same routing path.

  1. Users must be able to inherit the server base class and customize its behaviors (e.g. add/delete/override routes)

It cannot be considered as a library just because some methods in the lower level (database/queue communication, workflow/workspace manager) are re-used somewhere else (like OPERANDI). It makes more sense to me if a library for these methods is created instead (something like ocrd-network-utils).

It's not just the lower-level methods... The routers are still reusable/extendible/overridable. In order to add/override routes it's not needed to inherit the server base class. At least in my experience, it does not require it. However, there may be other technical issues that I am not aware of yet.

Anyway, in conclusion - the good part at least is that Operandi can continue development without relying on the WebAPI implementation and all the integration problems coming from there anymore. The endpoints will be implemented separately instead of reusing the endpoints provided by the WebAPI implementation.

I was under the assumption that the ocrd_network package would eventually contain the reference implementation and serve as both a library (with extendable/overrideable endpoints/routers/managers) and a standalone application (via ocrd network CLI).

Considering where we are now in the project progression, we need to be pragmatic and avoid wasting time however. If it helps us if ocrd_network only contains the processing endpoints but have a well-documented separate reference implementation, so be it.