fairtracks / fairtracks_standard

FAIRtracks is a JSON Schema defining a minimal standard for genomic track metadata.

Home Page:https://fairtracks.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create make target to create new releases?

sveinugu opened this issue · comments

Issue to fix: Versioned reference URLs cause the documentation build to fail (doesn't recognize sub-schema types).

There should probably be a field where to note the version of FAIRtracks used to validate a document. Perhaps even two fields, earliest version and latest version?

Using absolute URLs for ids causes issues for releases that then needs to to fix those absolue URLs to a particular version. However relative URLs breaks the doc generation.

If we adopt versioning principles similar (but not equal) to the ones on semantic versioning we could achieve it.

For instance, if we are in x.y.z version, "minor" fixes on restrictions could lead to render already validated tracks as invalid. Then, those changes should not lead to an update in the fix number, but an update in the minor number, for instance (i.e. x.y+1.0).

We could adopt version numbering similar to the one used in Vulkan API definition. Any change (fix, disruptive feature, whatever), leads to an increase in the z number, and depending on how backward compatible are the changes, no change on y and z, an increase on y, or an increase in x and reset y to 0. So, z number becomes the number of releases of the API definition.

@jmfernandez I am a bit unsure which issue you propose a fix for. The abovementioned comments are mostly technical issues to fix later. To be more clear:

  • Currently, internal URLs are absolute, thus we need to fix them for every tagged release (they cannot use v1/current links, as the schema pointed to by these might change). In order to automate this, we could write a make target for creating releases
  • However, we should really move from absolute to relative URLs for the internal references, but this currently breaks the documentation generator. Relative URLs should make extra release changes unnecessary, so a make target is probably not needed then. A release would then just be a git tag, nothing more, which was my idea initially.
    - [ ] Independently of this, there should be one or two properties in the JSON docs to annotate the minimal (and perhaps maximal) FAIRtracks schema version required to validate the schema.

Sorry about the confusion, the notes were primarily written to myself, and thus not easy to understand.

Edit: removed point about minimal version annotation. I don't know whether we will need this. Time will show

I see no problems adopting Semantic Versioning (https://semver.org/spec/v2.0.0.html) directly, if used strictly, as argued e.g. in https://medium.com/javascript-scene/software-versions-are-broken-3d2dc0da0783, as follows:

Breaking.Feature.Fix

Breaking: backwards-incompatible changes. Meaning that is a FAIRtracks-document that validates under a version, stop to do so, the breaking version should increase
Feature: additional features that does not break validation of documents
Fix: bug-fixes

Which means that what has until now been called v1.1 of FAIRtracks should really be v2.0.0, as it includes breaking changes. The major problem with this approach is that there is no way to signify the size of a breaking change. Let's say that we introduce a version v3.0.0 after a series of user feedbacks, with a lot of changes. How to state that this version is a major update as compared to relatively minor changes included in the increase from v1.0.1 to v.2.0.0?

This is an issue for all software these days. But as SemVer is the de facto standard, I don't think we need to solve this, rather just be aware of the issue. I can add something in the documentation about this.

As argued above (https://medium.com/javascript-scene/software-versions-are-broken-3d2dc0da0783), one should rather use Release names, or code names, for major releases. So since v2.0.0 will be the first major release, we should also define a set of code names. And since I am currently trying to release this alone, in overtime, while the other norwegians have started their vacation, I also take it on me to launch the rules for the code names. First, some considerations:

  • We should try to follow the alphabet, just like the hurricanes. In this way, the order of the releases are apparent. But we should reserve the right to skip some letters if there is not a good name available.
  • As we are handling genome track data, it should be something from the domain of relevance to life science, but domains like animals, or species names, has been done to death.
  • The draft standard is all about interoperability, and there is a large focus on using ontologies. Hence, we should pick names from ontologies
  • When first entering bioinformatics, I was a bit taken back from the heavy use of corny naming of tools and such (e.g. Bowtie, Tophat, Cufflinks). However, I am now convinced that one should rather embrace such quirkiness, as it helps with getting people to remember the tools.
  • The ontology that is most directly focused on genomic track data is the "Sequence types and features ontology" (https://www.ebi.ac.uk/ols/ontologies/so), so I suggest fetching code names from that
  • We should not limit ourselves much beyond that, in order to be able to find good names!

Hence, after a bit of browsing through the Sequence Ontology, behold the first release codename:

FAIRtracks v2.0.0 "Assembly" (http://purl.obolibrary.org/obo/SO_0001248)

  • Assembly, in the most direct interpretation, points to that this is the first assembly of all the thoughts and concepts behind the draft standard
  • A genome assembly is also central to the very definition of a genomic track, it is the coordinate system that everything is based upon
  • Genome assembly information is often missing from track files, a problem which my colleagues have also written a paper about: "Genome build information is an essential part of genomic track files" (https://doi.org/10.1186/s13059-017-1312-1). In a sense, this is the core metadata issue for tracks, causing a bunch of problems over the years, not least due to the UCSC vs rest of the world split on genome assembly and chromosome naming.
  • A good solution for naming genome assemblies have still not been accepted by the community (ref these issues in the Galaxy community: galaxyproject/idc#7 and galaxyproject/idc#8). Btw: our treatment of this issue is available here: #16 (comment). I have been thinking of pushing these ideas also towards the Galaxy community (this comment representing the first laidback attempt in that direction).
  • Adding the term ID is of course a tounge-in-cheek way of pushing a FAIR principle (specifically "I2: (Meta)data use vocabularies that follow the FAIR principles": https://www.go-fair.org/fair-principles/i2-metadata-use-vocabularies-follow-fair-principles/)