WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

Home Page:https://openverse.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Attribution: XML/RDF/Turtle please.

midijohnny opened this issue · comments

It would be handy if there was some additional formats for the "Credit the creator" section.
In particular - I would suggest at least it should include a simple well-formed XML record.

Better: one that corresponds to the Dublin Core specification : https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

Or RDF in general - including a 'Turtle format'.

(Although publishing in Dublin Core XML format would be enough for others to automatically translate this to other forms of RDF probably).

I would suggest this would also encourage more compliance with attribution , since it would be easier for the author to automatically credit creators.

I added "frontend" label because I think this refers to the frontend single result page's "Credit the creator" section, not the API's "attribution" property.

Attribution formats like XML should be supported by the API as well imo, so having both is 👌 .

DC sounds great! CC REL already uses DC terms, and the rich-text/HTML version of the attribution would be relatively easy to translate into a DC XML fragment.

I'd love to take on this.
Quick question off the top of my head, Should the generation of the XML attributions be done on the API level or on the frontend?

Currently all frontend attribution generation happens in JavaScript: https://github.com/WordPress/openverse/blob/main/frontend/src/utils/attribution-html.ts. The python openverse-attribution package also exists, but we can back-port this feature to there later on, if it's needed. For now, just add it to the frontend.

The frontend's attribution-html module generates the HTML for each type of attribution. Rich text is the same as the HTML, but we render the HTML directly, rather than displaying the HTML as code to copy. Plain text is the same, but without any markup.

The XML snippet should just be another option of output. You can use the existing methods for generating HTML to generate the XML.

Are you familiar with DC or RDF @madewithkode? There are a lot of resources online about both, but DublinCore's own documentation tends to be the best, and here's their documentation about RDF/XML specifically: https://www.dublincore.org/specifications/dublin-core/usageguide/#rdfxml and https://www.dublincore.org/specifications/dublin-core/dc-xml-guidelines/

The snippet there already gives a good idea of how to add the parts we'd need, it's essentially 1:1 with that, except we'd also populate dc:rights. Something like this, using https://openverse.org/image/feb91b13-422d-46fa-8ef4-cbf1e6ddee9b?q=galah as an example:

<rdf:RDF 
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:>

   <rdf:Description rdf:about="https://www.flickr.com/photos/126953422@N04/40593461235">

      <dc:creator>Graham Winterflood</dc:creator>
      <dc:title>Galah in Darwin (Eolophus roseicapilla)</dc:title>
      <dc:rights>"Galah in Darwin (Eolophus roseicapilla)" by Graham Winterflood is licensed under CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/?ref=openverse.</dc:rights>

   </rdf:Description> 
</rdf:RDF>

That interprets dc:rights as the broadest possible rights statement, and makes things relatively "uncomplicated" for us, when it comes to deciding how to represent CC with just DC. If we want to bring in CC REL, that's a separate story. I believe we could offer that, but if we want just the most basic RDF representation with just DC, this is probably it. Users can edit down dc:rights to whatever makes sense for their use case. This also has us ignoring a bunch of DC's recommendations for how to format DC XML, including not using DC (with XSI) to designate the type of resource, the type of resource identifier, and more detailed information about the rights statement.

However, I think we shouldn't create the full RDF XML, and instead, just offer the DC elements as XML (and we could follow this up by offerring different formats like Turtle or JSON-LD in the future, as separate issues). So then, we'd just have a copyable snippet, with some explanatory text. Maybe like this:

<dc:creator>Graham Winterflood</dc:creator>
<dc:title>Galah in Darwin (Eolophus roseicapilla)</dc:title>
<dc:rights>"Galah in Darwin (Eolophus roseicapilla)" by Graham Winterflood is licensed under CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/?ref=openverse.</dc:rights>
<dc:identifier>https://www.flickr.com/photos/126953422@N04/40593461235</dc:identifier>
<dc:type>StillImage</dc:type>

dc:type should be Sound for audio.

It can be that simple, if we like. @midijohnny please let me know if I've got this wrong... I'm basing this on just 6 months of Library and Information Services courses I took recently, and only did a small amount of DC, but never anything in XML.

I don't think we should try to use DCMI terms (like implementing RightsStatements) because ultimately DC is so flexible, every institution or system realistically has its own approach to how they want to use it. Listing the DC terms like this as an XML snippet is my guess at the most flexible version of what we could do here.

Assigned to you, @madewithkode, but it's probably a good idea to wait for @midijohnny to give more input before going to strongly in one direction (snippet, full RDF, which terms to use, etc). I do think it's best to stick with just an XML snippet for this first issue.

I agree with you @sarayourfriend, any more extra/specific details regarding what's required would be appreciated. And thank you for the really indepth insights on this topic, I'd be sure to checkout the resources you shared as I do not have any prior experience with all the other markups/specifications being discussed asides XML. I'd standby on this a bit to see if @midijohnny has anything more to add before getting started.

Great discussion ! I'm not an expert in RDF or Dublin Core either - but I would say the example above ("Maybe like this...") is going to be good enough - with one minor alteration - to include a root element with a namespace identifier.
That way : we would have a well-formed XML document in a specific namespace.

So something like:

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
	<dc:creator>Graham Winterflood</dc:creator>
	<dc:title>Galah in Darwin (Eolophus roseicapilla)</dc:title>
	<dc:rights>"Galah in Darwin (Eolophus roseicapilla)" by Graham Winterflood is licensed under CC BY-SA 2.0. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/2.0/?ref=openverse.</dc:rights>
	<dc:identifier>https://www.flickr.com/photos/126953422@N04/40593461235</dc:identifier>
	<dc:type>StillImage</dc:type>
</metadata>

It doesn't have to be 'metadata' - it could be (say) 'attribution' or whatever you think it best.

Having a well-formed document like this - with the namespace included (so people can look up the vocabulary based on the namespace) would provide a large benefit I think.

It means (for instance) somebody downstream can build an XSLT to transform this to what suits them.
You could even consider using this (or something similar) as the 'base' information and use XSLT to transform to the HTML/plain-text format to be displayed on the website - but that is just a suggestion.

For my purposes: I was collecting images to display in an XHTML (i.e. well-formed XML) environment, so if I had the format above it would have made my life easier.

For additional context - here's why I logged the original request.
I was building a small example that needed some example images and I wanted to make sure I displayed the attribution (of course) - I had to build my own representation in a file images.xml, but if the original attribution information was already available in a relatively simple well-formed document, I would have just been able to use that (perhaps with minor edit) straight-off.

Perfect, thanks very much @midijohnny! I was wondering how best to include the namespace, that looks great. And makes things more flexible for the future if we want to implement CC REL.

@madewithkode how do you feel about starting on this, when you have time? Do you feel you have enough to go on to get started?

Sorry I'm late guys, been battling a flu. Really great insights and extra contexts @midijohnny
@sarayourfriend sure, I should be able to start off something with the information at hand, once I'm fully back.

No worries at all, take your time and get well soon! There's no rush or pressure with this.

commented

I wanted to share some prior art here concerning XML. The Dublin Core we're adding in #4499 looks good. I also remembered today that Creative Commons' own License Chooser offers Extensible Metadata Platform (XMP) format, which is XML in a .xmp file.

Fun fact: @obulat implemented it a few years ago in this PR: creativecommons/chooser#272. A small change was made to that implementation shortly after.

I wonder if we should support that format as well?

What's the use case for downloading an XMP snippet? Wouldn't you use your image editor software to add that data, either embedded as an EXIF extension or as a sidecar file? I didn't know it could (or would) be used for attribution, I'm only familiar with it for use to describe the immediate work. I guess if you can add arbitrary additional metadata, attributions would just go there? I'm not sure how you would structure attributions in DC. An array of RightsStatements?

commented

The only (possible) use case I can think of is in the context of remixing works, where you might want to store or modify the original XMP for your new, derived work.

Even that feels somewhat contrived though. Probably best to wait on that until someone with a clear use case asks for it, as happened with dublin core here! 😄