OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

workspace clone: always copy local-only file paths

bertsky opened this issue · comments

When you ocrd workspace clone /some/path/to/mets.xml (without the indiscriminate download option) on a workspace which contains local files, the following happens:

  1. a mets:file with remote FLocat will still keep its (now defunct) local FLocat
  2. a mets:file with only local path FLocat will not be copied

IMO, either workspace clone from a relative path should either always copy all local files, or at least the ones in 2 (and removing the local refs in 1).

Copying of the content files itself could also attempt to do CoW (zero-cost) copies, in case the filesystem permits it.

Also:

When you ocrd workspace clone --download /some/path/to/mets.xml (with the download option) on a workspace which contains local files, the following happens:

  1. a mets:file with only local path FLocat will get an additional remote FLocat with an absolute path (combining the baseurl prefix with the relative path).

@kba this is a severe problem IMO.

Another example of this (trying to get ocrd_tesserocr tests to work on v3):

    @fixture
    def workspace_kant_binarized(tmpdir):
        initLogging()
        with pushd_popd(tmpdir):
>           yield Resolver().workspace_from_url(METS_KANT_BINARIZED, dst_dir=tmpdir, download=True)

test/conftest.py:15: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../core/src/ocrd/resolver.py:229: in workspace_from_url
    workspace.download_file(f)
../core/src/ocrd/workspace.py:222: in download_file
    f.local_filename = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E               FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: 'OCR-D-GT-WORD/INPUT_0017.xml

So because METS_KANT_BINARIZED is only a local workspace to "download" from, the baseurl mechanism does not work. So at the time the download is tried, there is already no information on where the absolute path was.