altoxml / schema

ALTO XML schema - latest and all former versions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fragment identifier API for ALTO

jpmoreux opened this issue · comments

The ALTO Fragment Identifier API is a proposal for a web service that, in response to a standard HTTP or HTTPS request:

  • references arbitrary content within an ALTO file through the use of fragment identifiers (referencing),
  • returns the XML contents referenced by such identifiers (dereferencing).

This service aims to facilitate reuse of ALTO resources in digital librairies (bookmarks, annotations...). It could be used to embody the concept of hyperlinking within ALTO documents, and to access to the content itself.

The URI could specify any portion of ALTO file (paragraph, string, illustration...) referenced by various mechanisms (ID, spatial offset, order...), range of contents (paragraphs 2 to 5), etc.

Note : the ALTO schema is not impacted. The whole idea is to edit a specification to be implemented by digital libraries (if they are willing to).

Use cases

See: http://prezi.com/6fvgzri_z3b3/?utm_campaign=share&utm_medium=copy

a. A digital library user wants to reference a specific marginalia on a specific page of a digital document, given its spatial position:
-> http://gallica.bnf.fr/ark:/12148/bpt6k96006893/f20.alto/id/@89:485
RETURNS a list of block IDs : ("PAG_00000020_TB000010")

-> http://gallica.bnf.fr/ark:/12148/bpt6k96006893/f20.alto/xml/TextBlock[ID=PAG_00000020_TB000010]
RETURNS: the TextBlock XML element
<TextBlock ID="PAG_00000020_TB000010" WIDTH="1386" HEIGHT="287" VPOS="1090" HPOS="1303" STYLEREFS="TXT_18" LANG="fr"
<TextLine ID="PAG_00000020_TL000016" WIDTH="1383" HEIGHT="63" VPOS="1090" HPOS="1304" STYLEREFS="TXT_18" <String ID="PAG_00000020_ST000071" ...

b. An application wants to list all the images on a specific page of a digital document:
-> http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26.alto/id/Illustration
RETURNS a list of block IDs: ("PAG_00000026_IL000001")

-> http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26.alto/xml/Illustration[ID=PAG_00000026_IL000001]
RETURNS the XML element:
<Illustration ID="PAG_00000026_IL000001" HPOS="744" VPOS="707" HEIGHT="3410" WIDTH="819"/

From this XML content, the application can then extract the illustration using IIIF:
-> http://gallica.bnf.fr/iiif/ark:/12148/bpt6k96128443/f26/744,707,819,3569/full/0/native.jpg

c. An application wants to extract all the text within the print space of a specific page:
-> http://gallica.bnf.fr/ark:/12148/bpt6k96128443/f26.alto/id/PrintSpace/*[@CONTENT]
RETURNS a list of block IDs: ("PAG_00000026_TB000002","PAG_00000026_TB000003","PAG_00000026_TB000004"...)

From this IDs, the application can then extract the XML elements and filter the text blocks to access the text itself.

Inspiration

IIIF Image API (http://iiif.io/api/image/2.0) specifies a web service that returns an image. The HTTP request can specify the region, size, rotation, quality characteristics and format of the requested image
-> http://gallica.bnf.fr/iiif/ark:/12148/bpt6k65372641/f1/1165.4351015801358,833.7189616252821,969.8363431151238,964.1647855530472/171,170/0/native.jpg

EPUB format as a recommended specification on Fragment Identifiers ( http://www.idpf.org/epub/linking/cfi/epub-cfi.html) that helps to express paths to specific locations within the content:
->
book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)

Related work:
http://pro.europeana.eu/blogpost/europeana-aligns-with-the-international-image-interoperability-framework-iiif
http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/Europeana_Cloud/Deliverables/D4.4%20Recommendations%20For%20Enhancing%20EDM%20to%20Support%20Research%20Oriented%20Content.pdf

Actions

  1. Use cases survey
  2. Contact with IIIF ?
  3. Syntax specs

In IIIF Presentation API, segments of XML files may be extracted with URL-embedded XPath expressions.
See http://iiif.io/api/presentation/2.1/#segments

IIIF Newspaper Implementation Notes: http://bit.ly/2a63PR6

IIIF Issues: https://github.com/IIIF/iiif-stories/issues
See #77, #78, #79, #80

Issue renamed and repurposed. Closed.