mozillascience / code-research-object

Project between GitHub, figshare and Mozilla Science Lab.

Home Page:https://mozillascience.github.io/code-research-object/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How should software be cited?

hubgit opened this issue · comments

Here's an example of a software citation. Does it include all the appropriate information, and can it be improved? When a snapshot of the code has been archived somewhere, how should that be included in the citation?

Goddard TD, Kneller DG. 2007. SPARKY 3 (v3.114, Windows). San Francisco: University of California. Available from http://www.cgl.ucsf.edu/home/sparky/

The citation in JATS XML:

<element-citation publication-type="software">
  <person-group person-group-type="author">
    <name><surname>Goddard</surname><given-names>TD</given-names></name>
    <name><surname>Kneller</surname><given-names>DG</given-names></name>
  </person-group>
  <source>SPARKY 3</source>
  <edition designator="3.114">v3.114, Windows</edition>
  <year iso-8601-date="2007">2007</year>
  <publisher-loc>San Francisco</publisher-loc>
  <publisher-name>University of California</publisher-name>
  <comment>Available from <uri>http://www.cgl.ucsf.edu/home/sparky/</uri></comment>
</element-citation>

We currently capture this on figshare based on the data citation principles as follows. We will follow the advice of the community here:

Sparks, Adam (2014): Global-Late-Blight-Modelling. figshare.
http://dx.doi.org/10.6084/m9.figshare.963593

The DataCite metadata page for that code/dataset has a link for the XML that describes it:

<?xml version="1.0" encoding="UTF-8"?>
<resource xmlns="http://datacite.org/schema/kernel-2.2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://datacite.org/schema/kernel-2.2
    http://schema.datacite.org/meta/kernel-2.2/metadata.xsd">
  <identifier identifierType="DOI">10.6084/M9.FIGSHARE.963593</identifier>
  <creators>
    <creator>
      <creatorName>Adam Sparks</creatorName>
    </creator>
  </creators>
  <titles>
    <title>Global-Late-Blight-Modelling</title>
  </titles>
  <publisher>Figshare</publisher>
  <publicationYear>2014</publicationYear>
</resource>

Some comments:

  • A piece of software might be created by dozens, if not thousands of contributors. Do all of them get cited? This seems like the classic "attribution stacking" problem. Perhaps it might be better to have variations, with or without authors and/or with a pointer to contributors
  • Publisher name/location is similarly difficult. If I collaborate on a project with another developer and its published on github then who is the publisher and which location do we use? Is it github? Is it the repo owner (which might be an organisation)?
  • For software I'd be interested in getting a snapshot (for archival purposes) but I'm also interested in the main, live repo (if there is one) as that is likely to have more context, e.g. for identifying/reporting bugs, future collaboration, etc.
  • For edition, what if that's just a github commit version?

It seems that it is not clear if we discuss software ( = relatively stable, executable package) or code base (rapidly evolving, versioned object). While in many case referring to the software is enough, in other cases there is actually no software, just an evolving code base.

@seinecle that's a good point, maybe it would be useful to identify what the citation is intended to support? E.g:

  • accessing a pre-built piece of software, e.g. in order to re-run some analysis, a simulation, or open some files
  • accessing a code base, e.g. in order to inspect and possibly build/run the application/code/analysis
  • finding a code base in order to perform new analyses with the same software

The final one is perhaps not a typical goal for scientific citations, but I think for both data and software citations, the "live"/current version ought to be discoverable from the citation.

With the fidgit project progressing parallel to this, would make sense to include a meta pointer to the assigned DOI from figshare? This would eliminate having to put a link to the code in a tag manually.

One comment from looking at the metadata, and especially in light of the comments about the minimal information we need to capture in a previous thread from @bobbledavidson, @npch & others, but is the other project metadata being captured anywhere? The intermediary page currently asks you to input language, platform, maintainer, description, and (probably most importantly) license, so was that input into the above example?

@IDodds RE citation etiquette, I don't know how many authors it can handle before the system breaks down (we've added >90 to some DataCite DOIs), but if you follow the practice of journals and the style of the human genome project, you could just list as the author of a massive group of contributors [x consortia] or [x community of developers]. DataCite metadata has the ability to set different levels of granularity to research objects if you wanted to credit separate units of code. Their RelatedIdentifier field can precisely describe relationships to other research objects through values like IsSupplementTo/IsContinuedBy/IsNewVersionOf/IsDocumentedBy/IsCompiledBy/etc.

@ldodds:

A piece of software might be created by dozens, if not thousands of contributors. Do all of them get cited? This seems like the classic "attribution stacking" problem. Perhaps it might be better to have variations, with or without authors and/or with a pointer to contributors

There's also the difference between the maintainer(s) of a project (who is/are currently responsible for it), and the contributors (those who have committed code to the project, or made other contributions). I imagine it would be the maintainers (and possibly also previous maintainers) that would be cited, though there are bound to be exceptions.

Publisher name/location is similarly difficult. If I collaborate on a project with another developer and its published on github then who is the publisher and which location do we use? Is it github? Is it the repo owner (which might be an organisation)?

I think this can be optional, but might be useful when software is produced solely by a specific university or software company. A better analogy to book publishers, though, might be the code hosting (e.g. "GitHub") or archiving (e.g. "fighare") service. The geographic location is probably irrelevant, unless it's necessary for distinguishing between multiple entities with the same name.

For software I'd be interested in getting a snapshot (for archival purposes) but I'm also interested in the main, live repo (if there is one) as that is likely to have more context, e.g. for identifying/reporting bugs, future collaboration, etc.

Yes, I think being able to cite the snapshot as well as providing details of the current codebase (even if it's just the equivalent of an "Available from" or "Accessed at" URL) needs to be in there.

For edition, what if that's just a github commit version?

I guess the version would be the hash, in that case, and it would be nice to add a URL for it…

@pbulsink Yes, the DOI should definitely be in there, though there is perhaps ambiguity between whether it's an identifier for the snapshot, an identifier for the specific release, or an identifier for the software as a whole (which could be assigned a separate DOI, linked to DOIs for specific releases using versioning metadata).

@ScottBGI Some of the metadata that could be attached to the project is most useful for discovery, rather than citation (which just needs to identify the software specifically enough that a reader could find it). "platform" should be in the citation, I think, and possibly "maintainer" (see the comment above) but the code language, description and license probably don't need to be.

It looks like we are discussing a couple of different problems in parallel:

  1. How should specifically code be cited, i.e. what are the differences with respect to citing papers or datasets?
  2. How should work in progress be cited?
  3. How to cite work by a large and/or indefinite group of authors?

As for 1), I see two distinct cases. Software cited for the scientific record (we used package X) should exist in an archive and be cited with a DOI. The reference should be to a precise version. Software cited as a recommendation for use (we implemented our algorithm in package X, ...) should be on a development site such as GitHub, and referenced there.

Point 2) is already a standard situation in citing Web resources such as Wikipedia. The habit is to state the date at which the resource was consulted.

Point 3) doesn't have a good solution in the academic tradition. We are very attached to citing specific people, maybe companies, but not communities.

@mikej888 did some work for the Software Sustainability Institute looking at citing software in traditional outputs: http://software.ac.uk/so-exactly-what-software-did-you-use

This includes a summary of what various journals ask for, as well as some software platforms like R.

@seinecle and @khinsen comments are very insightful - the "citation" metadata associated with a piece of software conflates a number of issues.

Taking @khinsen points in order:

  1. There are many similarities to datasets, which also have a sense of "versions" however typically datasets have clearer boundaries / collection hierarchy and authorship (though may also suffer from many authors). There's also not always a direct analogy to for publisher ("self published")?However I do think that if we disregard the "who was an author on this code and what contribution did they play", then code can be cited using the following metadata:

Author List
Code naming identifier (some human readable name for the "the code")
Code version identifier (e.g. a tag, a release version, ideally uniquely identifying a set of files which collectively form "the code")
Code location identifier (e.g. a DOI or URl that can be dereferenced to get to the code
[optionally] A release date

Now this means that for 2) work in progress, citation is no different - the code version identifier and code location identifier will just point to a work in progress version. However by advertising that version through a citation, you're effectively identifying a new version of the code. Given that most repositories (like GitHub) enable some sort of hash identifier for each commit, you could simply use that as an (automatically generated) identifier.

  1. Is always going to be an issue. Should an author drop off if all their contributions have been removed from the code base? I know that some projects insist on only naming the project, and then maintain the author list on the project website, but my issue with that approach is that it's not easily machine understandable

I don't think that platform, language or potentially even license information should be part of the citation metadata (though I might be willing to budge on license). When we were undertaking the SoftwareHub project for Jisc looking at creating "showcase catalogues" of software funded by Jisc, we quickly realised that things like platform or programming language were not useful either for citation or first level discovery. They are useful for categorisation and filtering, but it they aren't as useful as they first appear.

Another example:

The PERMANOVA+ add-on for PRIMER is often referenced as a book citation:

Anderson MJ, Gorley RN, Clarke KR. 2008. PERMANOVA+ for PRIMER: guide to software and statistical methods. PRIMER-E Ltd.

The makers of the software don't provide citation examples for that specific add-on, but do provide citation examples for the main software package, by citing the user manual:

Clarke, KR, Gorley, RN, 2006. PRIMER v6: User Manual/Tutorial. PRIMER-E, Plymouth.

The citation most commonly used for the PERMANOVA add-on doesn't include information about which version of the software was used, or on which platform - this is usually described in the Methods section instead.

Users of R are also asked to cite the manual:

@Manual{,
  title        = {R: A Language and Environment for Statistical
                  Computing},
  author       = {{R Core Team}},
  organization = {R Foundation for Statistical Computing},
  address      = {Vienna, Austria},
  year         = 2013,
  url          = {http://www.R-project.org}
}

JATS 1.1 provides <version> and <data-title> elements:

<element-citation publication-type="software">
  <person-group person-group-type="author">
    <name><surname>Goddard</surname><given-names>TD</given-names></name>
    <name><surname>Kneller</surname><given-names>DG</given-names></name>
  </person-group>
  <data-title>SPARKY 3</data-title><!-- could be "software-title"? -->
  <version designator="3.114">3.114, Windows</version><!-- needs a "platform" element or attribute? -->
  <!-- use a "source" element for the host, e.g. "GitHub"? -->
  <year iso-8601-date="2007">2007</year>
  <publisher-loc>San Francisco</publisher-loc>
  <publisher-name>University of California</publisher-name>
  <uri>http://www.cgl.ucsf.edu/home/sparky/</uri>
</element-citation>

Does this still stand as the best source for this? I will recommend to JATS4R your suggestions

@Melissa37 This probably needs updating to take into account the Force11 Software Citation Principles.