Aligning Citation File Format and CodeMeta

Question

Aligning Citation File Format and CodeMeta

sdruskat opened this issue 6 years ago · comments

Aligning Citation File Format and CodeMeta

TL;DR

The Citation File Format is a software citation metadata input format that is tailored to support credit-based use cases (1,2,15) as described in the principles paper, and enforce adoption of the principles. CodeMeta is a general exchange format for software metadata. Both can be used to provide citation metadata, and concerns have been voiced about possible re-duplication of efforts. In my opinion, however, the formats do different things.

I propose that it's fine to have both for the initial provision of citation metadata - and let the user pick her/his favourite -, and that downstream in the citation workflow existing CFF files should simply be converted to codemeta.json to leverage its advantages as a multi-purpose exchange format.

Introduction

At the recent [SCIWG meetup in Berlin (during RDA)] (https://github.com/force11/force11-sciwg/blob/master/meetings/20180322-Notes.md), we've discussed the relationship between CodeMeta and the Citation File Format (CFF). I've wanted to do this for quite some time as I felt the two were too close in at least a subset of their purposes to simply ignore their co-existence, and to make an effort to align/reconcile the formats, and their respective places in the software citation workflow.

During the meeting itself I feel I've failed to create enough understanding of the purpose of CFF and the difference between it and CodeMeta. Subsequently, I've discussed their relationship at the SSI Collaborations Workshop 2018 both within a dedicated mini-workshop and in several personal discussions.

In this issue I'll try to summarize what's been discussed as necessary, and would like for the working group to continue the discussion here to find the optimal way for the formats to be aligned with each other, in order to avoid unnecessary reduplication of efforts. I'll also make the case that both can co-exist without necessarily harming each others' progress and uptake.

Background

I guess that most members of the working group will be familiar with CodeMeta, but possibly know little about CFF, so what follows is a little background information and a brief comparison.

CFF is a YAML-based format for software citation metadata. It's been the indirect outcome of a discussion group at WSSSPE5.1, which looked at replacing free text CITATION files in plain-text with something that is machine-readable.

CFF focuses on the "simpler" citation use cases (1, 2, 15) from the principles paper. It enforces application of the principles by requiring specific keys. It provides context and is "self-descriptive" by including 1) a mandatory message key which should contain usage instructions, 2) scopes for secondary references for a software (e.g., a software paper, a paper describing an algorithm implemented in the software). It is compatible with CodeMeta in that it has a column in the crosswalk table. It is both "more generic" (@danielskatz) and more specific than CodeMeta, in that a) it doesn't specify what it can relate to, which can be more than just "a software/version/object with a DOI", i.e., packages within a project, single source code files, even specific LOC, single commits, etc.; b) it provides more fine-grained keys for, e.g., commits vs. software versions.

There are some tools for CFF available from the GitHub org (doi2cff resolver/converter, github/-lab2cff extractor, generic converter (CFF to BibTeX, CodeMeta, Endnode, RIS); Python, Ruby, Java tooling). A generator web app prototype has been created during the CW18 hack day (release forthcoming).

There has also been some uptake, particularly by the Netherlands eScience Center, where CFF is used for providing citation metadata for their software directory.

Discussion

During the SCIWG working meeting mentioned above, concerns were voiced that, for a small community such as ours, developing and maintaining two different metadata formats might be too much of a strain on resources. What I have therefore taken away from that meeting are three options for CFF to align with CodeMeta:

Let CFF die
Transform CFF into a CodeMeta YAML representation
Achieve and maintain full compatibility

Before and during the Collaborations Workshop 2018, I have juggled pros and cons of these options and have discussed them, partly in great depth, with CW18 attendees.

Considerations including feedback from CW18

CFF as CodeMeta YAML representation

As for the above point 2 (Transform CFF into a CodeMeta YAML representation), this is something that @mfenner had suggested during the SCIWG working meeting. As YAML is a superset of JSON, it would be great if CFF could represent CodeMeta as codemeta.yaml, and be convertible via base libraries for JSON/YAML in programming languages. However, YAML is not a representation format for schema.org, and hence it's impossible to convert without loss, or - for YAML to JSON-LD conversion - without manipulation during the conversion process.

On the other hand, one of the next steps for CFF will be to create a "CFF-Meta" module (cf. discussion in this issue), which will add support for those fields in CodeMeta that aren't yet represented in CFF, hence allowing for lossless conversion* (although not simple transformation) between the two formats.

This leaves us with two options: no CFF / fully compatible CFF

Discard CFF

The simplest option, arguably, but I'd like to make a case against it for the following reasons.

CFF and CodeMeta are not the same thing and are not doing the same thing

While CodeMeta is an exchange format, CFF is an input/provision/"documentation" format.
As such, one of CFF's use cases as direct successor of the free text CITATION files, is to be distributed with artifacts, similar to a README or LICENSE file.
Additionally, CFF is self-descriptive in that it must contain a message, to be used to tell the user what to do with the provided metadata.

Some of the feedback collected during CW18 suggested that while some communities would not know where to start with a codemeta.json file, they'd be happy to write CFF files for their software. This is of course highly subjective, but as this has been mentioned quite often may stand as a valid point. So perhaps I should rephrase: CodeMeta is the better exchange format (undeniably), CFF is the better input format.

CodeMeta is a multi-purpose format, CFF (Core) is very much citation-centric.
I think the strongest support for this claim is that CFF actually enforces application of the software citation principles via requiring data for the basic requirements from the principles paper, table 2 for the use cases it is meant to mainly support.
Additionally, it supports the provision of citation metadata not only for whole software projects/versions, but also for smaller units (see above), and, e.g., single commits.
With fine-grained key sets for, e.g., different types of repositories, CFF is attractive for corner cases such as providing citation metadata for, e.g., legacy software (as suggested by @drjwbaker).
And, CFF supports lists of scoped secondary references, e.g., algorithm papers, etc. (see above).
CFF is human-centric (in terms of writability/readability), CodeMeta is - arguably - more machine-centric, by design.
The community does actually want CFF to exist!
Apart from the "simplicity" feedback noted above, this claim is mostly based on personal feedback from CW18. CFF is recognized as a thing to use for providing software citation metadata, partly by virtue of its name (which has been described - not by me - as sounding "official, longstanding, authoritative"), whereas the same understanding of CodeMeta did not seem to have permeated the group of attendees at CW18. (This is obviously not a very strong point as it is a matter of publicity to change this.)

In addition to this, there has already been decent uptake, see above.

Proposal: Let CFF and CodeMeta co-exist in the primary tier of software citation

During the CW18 mini workshop, @danielskatz has provided the following comment, which I think is very much to the point:

I want to figure out how we put CFF and CodeMeta together, so we don’t have two unrelated duplicative things running around at the same time.

I'd like to make the following proposal to solve this, up for discussion.

Let both formats do what they do best, as alternative solutions, while enabling downstream conversion to CodeMeta.

In my opinion, there are no downsides to letting the user choose which format to use for the primary provision of software citation metadata. If a user feels that s/he prefers one over the other, that's perfectly fine. If I was forced to pick which one should be preferred, I'd say CFF just because (IMHO) it is more user-friendly - and thus makes the whole software citation workflow more accessible to possibly less informed individuals - and better suited specifically to the referred simpler citation use cases, but I strongly believe that this is a decision to be made by the actual user.

Also, I think it's fine that either format can inform end user-facing tools that process the provided information, such as code platforms, reference managers, or applications themselves (which may read, format and display via cite() calls or similar).

As stated, I believe it is fine to have both options at the primary stage, i.e., direct or mediated provision of software citation metadata by the initial supplier ("authors") of a software. However, as CodeMeta is clearly the exchange format of choice, the crucial factor in all this is that conversion from CFF to CodeMeta should be implemented as soon as metadata exchange is in preparation, or actually happens.

So: Users should be able to choose which format to write, or generate, initially, but should be encouraged and supported in transforming CFF to CodeMeta downstream.

This can happen via user-initiated conversion, and there's already a tool to do that. More importantly though, this should be automatable at certain steps in the development/release/share workflow, e.g., at deploy time (Maven Release Plugin, twine, etc.), CI/CD (Travis, Jenkins, etc.), the GitHub-Zenodo bridge, etc. etc. Some efforts related to this have already been made, others are underway. And I don't think that these efforts actually drain resources from the SCIWG, but instead they seem to help with onboarding further parties to software citation implementation.

* "Lossless" conversion in that all of the actual software metadata can be converted. I'm not sure whether CodeMeta supports multiple (scoped) secondary references as is, so perhaps we should discuss whether this is something that could be useful to have in CodeMeta as well.

Martin Fenner · Answer 1 · Wed Apr 11 2018 17:14:17 GMT+0800 (China Standard Time)

Thank you @sdruskat. Very helpful. I agree with your basic conclusions, but I think it is important that everyone clearly understands when to use CFF, and when codemeta. In my experience choice is not always preferable, as it can create fragementation and confusion. CFF as a user-friendly input format makes a lot of sense to me.

Stephan Druskat · Answer 2 · Wed Apr 11 2018 17:58:00 GMT+0800 (China Standard Time)

Thanks @mfenner, glad (and somewhat relieved) that you agree :)!

D'accord, we must make clear when to use which. I'll see to updating, e.g., cite.research-software.org to include this very clearly. Will also think about how to represent it in CFF's user-facing tooling somehow.

Robert Haines · Answer 3 · Wed May 09 2018 15:26:09 GMT+0800 (China Standard Time)

I'm keen to work on this. I have written a CFF ruby library, which may be of some use in integration with, e.g. bolognese? https://github.com/citation-file-format/ruby-cff

Jurriaan H. Spaaks · Answer 4 · Thu Sep 13 2018 19:59:47 GMT+0800 (China Standard Time)

When people ask me about the difference between CodeMeta and CFF I tell them the following:

CodeMeta is used to describe what some Thing is
CFF is used to help cite the Thing

The first is useful for example when a search engine indexer/crawler comes along and wants to know what the Thing is, in order to know when to return it as a possible search result.

The second is useful if you e.g. developed a piece of software and you want to facilitate people who use that work so they can easily give you credit. Arguably, doing so is already possible using only CodeMeta, but in my view CFF is more precise/ more suitable for software citation. Additionally, it offers:

(scoped) referencing
potential transitive crediting (although I'm not aware of e.g. a credit property that you can assign to each reference)

Anyway, because of their different purposes, I don't see why we should get rid of one or the other.

I share Stephan's view that CFF is a little bit easier to write and read, so I normally just write CFF and then generate a codemeta.json (and a .zenodo.json) using cffconvert. I realize that this gives me a relatively small subset of the CodeMeta spec, but so far that's been sufficient (but maybe I just don't know what I'm missing out on). Others may prefer to go the reverse route, writing CodeMeta and generating the other files based on that, which should be fine as long as we have some tools that convert between them.