workspace clone: always copy local-only file paths
bertsky opened this issue · comments
When you ocrd workspace clone /some/path/to/mets.xml
(without the indiscriminate download option) on a workspace which contains local files, the following happens:
- a mets:file with remote FLocat will still keep its (now defunct) local FLocat
- a mets:file with only local path FLocat will not be copied
IMO, either workspace clone
from a relative path should either always copy all local files, or at least the ones in 2 (and removing the local refs in 1).
Copying of the content files itself could also attempt to do CoW (zero-cost) copies, in case the filesystem permits it.
Also:
When you ocrd workspace clone --download /some/path/to/mets.xml
(with the download option) on a workspace which contains local files, the following happens:
- a mets:file with only local path FLocat will get an additional remote FLocat with an absolute path (combining the baseurl prefix with the relative path).
@kba this is a severe problem IMO.
Another example of this (trying to get ocrd_tesserocr tests to work on v3):
@fixture
def workspace_kant_binarized(tmpdir):
initLogging()
with pushd_popd(tmpdir):
> yield Resolver().workspace_from_url(METS_KANT_BINARIZED, dst_dir=tmpdir, download=True)
test/conftest.py:15:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../core/src/ocrd/resolver.py:229: in workspace_from_url
workspace.download_file(f)
../core/src/ocrd/workspace.py:222: in download_file
f.local_filename = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
E FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: 'OCR-D-GT-WORD/INPUT_0017.xml
So because METS_KANT_BINARIZED
is only a local workspace to "download" from, the baseurl
mechanism does not work. So at the time the download is tried, there is already no information on where the absolute path was.