OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D

Home Page:https://ocr-d.de/core/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[feature request] Enhance `ocrd workspace bulk-add` to add URL entries instead of files

stweil opened this issue · comments

ocrd workspace bulk-add can be used to add ALTO XML files to an existing METS file (see example). The new entries refer to the local files which were added, but for the production use on a web server the METS file must contain a URL for each ALTO XML file. So after running the ocrd command, additional processing like in the example is required. It would be nice if that additional processing could be avoided.

Good point. I need to fix the bulk-add mechanism to be compatible with #1079, I'll implement in such a way, that it will be possible to set both the local filename and remote URL.

It took some time, but now I could use the new code in a real use case. Running ocrd bulk-add works fine as long as I provide both --url and --local-filename arguments. It then writes two mets:FLocat entries for each page, one with the file path and one with the URL. As the presentation only needs the URL, but not the file path, I then tried running the command without a --local-filename argument. I had expected that it would write only the desired mets:FLocat entry with the URL, but it wrote again the other entry, too – in this case with a completely unusable file path:

    <mets:fileGrp USE="FULLTEXT">
      <mets:file ID="file-alto-idp251240016" MIMETYPE="text/xml">
        <mets:FLocat xlink:href="      &lt;mets:div ID=&quot;struct-physical-idp251240016&quot; CONTENTIDS=&quot;http:/diglib.hab.de/drucke/li-1876-1/start.htm?image=00001&quot; TYPE=&quot;page&quot; ORDER=&quot;1&quot;&gt;" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
        <mets:FLocat xlink:href="https://ub-backup.bib.uni-mannheim.de/~stweil/d-gt/data/DE-23/urn_nbn_de_gbv_23-drucke_li-1876-12/alto/00001.xml" LOCTYPE="URL"/>
      </mets:file>
    [...]

Would it be reasonable to change that, so that only a URL entry is added if only the --url argument is given?

Note: you can do that with mm-update, too. (It's what we use in the OCR-D Manager.

But I am also for more flexibility of bulk-add. Also mentioned in #1150. (And somewhat related: #1179.)