OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Get not available ocrd_tool in ocrd_utils/os.py

joschrew opened this issue · comments

The function get_ocrd_tool_json(executable) tries to load the ocr-d tool with a processor call. If the function is called with a non available processor (typo for example) then it is imo too hard to distinguish that from a successful call. Because by default the returned dict is filled with some default values, but I would expect it to be empty when the processor is not available. So I would move the call to extend the ocrd_tool-dict inside the try block and return an empty dict if the processor is not available. But I don't know if that can cause other problems.

def get_ocrd_tool_json(executable):
"""
Get the ``ocrd-tool`` description of ``executable``.
"""
executable_name = Path(executable).name
try:
ocrd_tool = loads(run([executable, '--dump-json'], stdout=PIPE).stdout)
except (JSONDecodeError, OSError) as e:
getLogger('ocrd_utils.get_ocrd_tool_json').error(f'{executable} --dump-json produced invalid JSON: {e}')
ocrd_tool = {}
if 'resource_locations' not in ocrd_tool:
ocrd_tool['resource_locations'] = ['data', 'cwd', 'system', 'module']
return ocrd_tool

Yes, that's a problem IMO. I don't recall exactly why we decided to catch this error in the first place. (It could have been that some processors misbehaved, and we did not want them to spoil the overall result of a ocrd resmgr list-available "*" for example. Also, we cannot be sure whether every ocrd-* program in PATH actually is a processor CLI...)

We currently use the function in

  • cli.resmgr.download – where we require the resource_locations entry, so the proposed change would currently break (i.e. a misbehaving or non-processor would trigger an uncaught exception, preventing the command from running through)
    if not location:
    location = get_ocrd_tool_json(this_executable)['resource_locations'][0]
    elif location not in get_ocrd_tool_json(this_executable)['resource_locations']:
    log.error("The selected --location {location} is not in the {this_executable}'s resource search path, refusing to install to invalid location")
    sys.exit(1)
  • OcrdResourceManager.list_available – (for the case of dynamic discovery based on the registered resources decentralised via tool json) where we default to an empty result, so nothing would break
    self.log.debug(f"Inspecting '{exec_path} --dump-json' for resources")
    ocrd_tool = get_ocrd_tool_json(exec_path)
    for resdict in ocrd_tool.get('resources', ()):
  • ProcessorTask.validate – (so far used only for prior syntax checking of workflows in ocrd process) where an empty tool json simply results in an empty parameter specification, having the validator deny any runtime parameters, so again nothing would break
    param_validator = ParameterValidator(self.ocrd_tool_json)
    report = param_validator.validate(self.parameters)

Judging by that I'd say yes, we should change as you proposed, but then need to handle the case graciously when ocrd resmgr download "*" finds a ocrd-* that misbehaves. (For example, by rewriting the dict lookup as a .get('resource_locations', RESOURCE_LOCATIONS) – allowing every location and preferring data.)

At first I wanted to use get_ocrd_tool_json to verify that a processor is available. Then I created this issue. Later I realized that this doesn't fit my needs anyway but I didn't want/forgot to close/remove the issue again.
I wondered if there is a reason why the dict-extension is not part of the try-block, I thought is just an easy fix. But your findings show me I should just have searched for the function invocation to see that the change could cause some errors. Now I wonder if the correction of this "feature" is worth the effort.

Now I wonder if the correction of this "feature" is worth the effort.

It's a minute change.

Also, we cannot be sure whether every ocrd-* program in PATH actually is a processor CLI...)

That's the main reason why we have this check, tools like ocrd-cis-data or other non-processors break it.

I agree that returning an empty dict for non-processor ocrd-* executables seems better. IIUC the only change in the function would be to not set resource_locations in that case. We'd have to adapt cli.resmgr to handle this though, by just defaulting to data for location.

If so, let's move ahead.

I agree that returning an empty dict for non-processor ocrd-* executables seems better. IIUC the only change in the function would be to not set resource_locations in that case. We'd have to adapt cli.resmgr to handle this though, by just defaulting to data for location.

Yes, that's what I meant by …

(For example, by rewriting the dict lookup as a .get('resource_locations', RESOURCE_LOCATIONS) – allowing every location and preferring data.)