legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore

Any of the file-containing directories can contain a README file and a CHANGES file.

README YAML files

Every file-containing directory, AKA "collection", in the LIS datastore should contain a README file in YAML format.

Filename: README.[collection].yml

Examples:

Validation

The basic README structure (acceptable field names, strings vs. lists vs. dates) can be validated using the following command:

ajv -s readme.schema.json -d README.[collection].yml --all-errors --coerce-types=array --remove-additional=all --changes

using the JSON schema definition readme.schema.json.

This schema must be kept up to date along with the sample template README.collection.yml when any changes are made to the README spec.

Content requirements

READMEs must be YAML-compliant, which means they pass the test on http://www.yamllint.com/ or using the yamllint command-line utility. Here are some, but not all, requirements for a valid LIS README:

  • identifier at the top repeats the name of the collection, i.e. the name of the containing directory.
  • synopsis should be short, 100 characters or less.
  • genotype is a YAML array: but use a single "strain1 x strain2" value for bi-parental crosses.
  • publication_doi (and any other DOI) is a DOI, not a URL (e.g. 10.1534/g3.118.200521).
  • Dates are in the format 2020-03-23.
  • Use spaces, not tabs (tabs may not appear anywhere in a YAML)
  • Enclose values in quotes when they contain a colon or quotes (you can use single or double quotes to distinguish from quotes in content)
  • Do not include empty keys - leave them out entirely. All keys must have values.
  • publication_doi is REQUIRED. If the data were generated by LIS, use the default LIS publication:
publication_doi: 10.1093/nar/gkv1159

Gotchas

  • READMEs may share content. For example, the README with a genome assembly (under /genomes/) often contains the same publication as the README with annotations (under /annotations/). Those publications must match exactly. Otherwise, the mine loader will error out with an error like "Conflicting values for field Publication.title between Zh13.gnm2.LV9P (value "Update soybean Zhonghuang 13 genome to a golden reference. Sci China" in database with ID 99000176) and Zh13.gnm2.ann1.FJ3G.cds (value "Update soybean Zhonghuang 13 genome to a golden reference" being stored)."

MANIFEST files

A directory may contain a MANIFEST.collection.correspondence.yml file which lists the current filenames and prior filenames:

---
# filename in this repository: previous names
glyma.Wm82.gnm2.DTC4.genome_hardmasked.fna.gz: Gmax_275_v2.0.hardmasked.fa.gz
glyma.Wm82.gnm2.DTC4.genome_softmasked.fna.gz: Gmax_275_v2.0.softmasked.fa.gz

... and also a MANIFEST.collection.descriptions.yml file which briefly describes the files:

---
# filename in this repository: description
glyma.Wm82.gnm2.DTC4.hardmasked.fna.gz: Genome assembly: masked with 'N's
glyma.Wm82.gnm2.DTC4.softmasked.fna.gz: Genome assembly: masked with lowercase

CHANGES files

A directory may contain a CHANGES.collection.txt file which lists file transformations and changes. For example:

file transformations:

seqlen.awk vigan.Gyeongwon.a3.v1.cds.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest"

fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.cds.fa | sort > tmp.fa.1ln

join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5c.cds_primaryTranscript.fna

seqlen.awk vigan.Gyeongwon.a3.v1.peptide.fa | perl -pe 's/(\w+\.\w+)\.(\d+) (\d+)/$1\t$2\t$3/' | sort -k1,1 -k3nr,3nr | top_line.awk | awk '{print ">" $1 "." $2}' | sort > tmp.longest

fasta_to_zero_lines.awk vigan.Gyeongwon.a3.v1.peptide.fa | sort > tmp.fa.1ln

join tmp.longest tmp.fa.1ln | perl -pe 's/ zqz /\n/' > vigan.Gyeongwon.gnm3.ann1.3Nz5f.protein_primaryTranscript.faa

changes: 

2018-03-03 Added MANIFEST files
2018-09-15 Changed fastas to include full prefixing (s/vigan/vigan.Gyeongwon.gnm3.ann1/)

About

Specifications for directory naming, file naming, file contents in the LIS datastore


Languages

Language:Perl 69.4%Language:Shell 28.8%Language:Awk 1.8%