biostream / bioschemas

ga4gh, gdc and bmeg in one place

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bioschemas

Common data structures and APIs.

This repo contains

  • git submodules from ga4gh, gdc and bmeg
  • A utility to read the schemas and produce different output (jsonschema and cerberus)

packaging

The schemas are packaged into a python module bioschemas The justification for the packaging is threefold:

  • Moves complexities of gitmodule management from the end user to the package release process
  • Each of the submodules referenced have many other components other than the schemas themselves. Packaging allows us to trim all components other than schema source.
  • The generated snapshot is checked into git - the rationalization is that is allows us to tag package explicitly and allows clients to install the package without submodule complexity.
pip install git+https://github.com/ohsu-computational-biology/bioschemas

package release

cd bin
./package-all.sh
 ... generates schema snapshot ...
 ... runs setup tests ...
----------------------------------------------------------------------
Ran 4 tests in 0.100s

OK

usage

$ bioschemas-snapshot --help
usage: bioschemas-snapshot [-h] [-o OUTPUT] [-v]

Extract bioschemas schema directory [ga4gh,bmeg,gdc]

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Extract to this directory name. Must not already
                        exist; it will be created as well as missing parent
                        directories.
  -v, --version         Print git hashes

The snapshot can be used by any language context and has the following structure:

.
├── cerberus
│   ├── bmeg
│   ├── ga4gh
│   │   ├── ga4gh
│   │   └── google
│   │       ├── api
│   │       └── protobuf
│   └── gdc
├── jsonschema
│   ├── bmeg
│   ├── ga4gh
│   │   ├── ga4gh
│   │   └── google
│   │       ├── api
│   │       └── protobuf
│   └── gdc
└── proto
    ├── bmeg
    └── ga4gh
        ├── ga4gh
        └── google
            └── api

python usage

import  bioschemas

bioschemas.schema_path()
>> '/home/someuser/bioschemas/bioschemas/snapshot'

 bioschemas.json_schema('Resource')
>> {u'properties': {u'checksum': {u'type': u'string'}, u'class': {u'type': u'string'}, u'created': {u'type': u'string'}, u'datasetID': {u'type': u'string'}, u'description': {u'type': u'string'}, u'format': {u'type': u'string'}, u'gid': {u'type': u'string'}, u'id': {u'type': u'string'}, u'info': {u'type': u'object'}, u'location': {u'type': u'string'}, u'mimeType': {u'type': u'string'}, u'name': {u'type': u'string'}, u'size': {u'type': u'integer'}, u'type': {u'type': u'string'}}, u'type': u'object'}  

 bioschemas.cerberus_schema('Resource')
>> {u'checksum': {u'type': u'string'}, u'class': {u'type': u'string'}, u'created': {u'type': u'string'}, u'datasetID': {u'type': u'string'}, u'description': {u'type': u'string'}, u'format': {u'type': u'string'}, u'gid': {u'type': u'string'}, u'id': {u'type': u'string'}, u'info': {u'type': {u'type': u'dict'}}, u'location': {u'type': u'string'}, u'mimeType': {u'type': u'string'}, u'name': {u'type': u'string'}, u'size': {u'type': u'integer'}, u'type': {u'type': u'string'}}

bioschemas.git_hashes()
>>  {u'bioschemas': u'f40f653', u'bmeg': u'537f94a', u'created_at': u'2016-11-18T17:47:56.858397Z', u'gdc': u'288f042'}

bioschemas.gdc_submission_template('file')

>> {u'aliquots': {u'submitter_id': None}, u'analytes': {u'submitter_id': None}, u'archives': {u'submitter_id': None}, u'cases': {u'submitter_id': None}, u'centers': {u'code': None}, u'data_formats': {u'name': None}, u'data_subtypes': {u'name': None}, u'derived_files': {u'submitter_id': None}, u'described_cases': {u'submitter_id': None}, u'experimental_strategies': {u'name': None}, u'file_name': None, u'file_size': None, u'md5sum': None, u'platforms': {u'name': None}, u'portions': {u'submitter_id': None}, u'project_id': None, u'related_files': {u'submitter_id': None}, u'samples': {u'submitter_id': None}, u'slides': {u'submitter_id': None}, u'state_comment': None, u'submitter_id': None, u'tags': {u'name': None}, u'type': u'file'}

utilty

The ga4gh and bmeg cannonical schemas are maintained in protobuf. The bin/custom-plugin.py processes the schemas for alternate uses (jsonschema, cerebus). The bioschemas/snapshot directory contains output from protoc. Please do not hand edit, rather change custom-plugin.py or json-to-cerberus.py

About

ga4gh, gdc and bmeg in one place

License:MIT License


Languages

Language:Protocol Buffer 72.1%Language:Shell 16.0%Language:Python 11.9%