sb43 / cgp_seq_input_val

Sequence data and manifest validation code.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cgp_seq_input_val

This package is contains tools to validate manifests and validate various types of sequence data file commonly used for NGS data.

Design

Many components of this system are heavily driven by configuration files. This is to allow new validation code to be added and incorporated without modifying the driver code.

Tools

cgpSeqInputVal has multiple sub commands, listed with cgpSeqInputVal --help.

cgpSeqInputVal man-norm

Takes input in multiple types and converts to tsv. If intput is tsv just copied the file to the output location (to simplify usage in workflows). Valid input types include:

  • xls - Excel workbook pre-2007
  • xlsx - Open Office XML workbook (Excel 2007+)
  • csv - Comma separated values
  • tsv - Tab sepearated values

Absolutely no validation is carried out here.

cgpSeqInputVal man-valid

Takes the tsv representation of a manifest and performs validation of the structure and data values. The checks applied are managed by the cgp_seq_input_val/config/*.json files. Each class+version of manifest will have a config file where different requirements and allowed values are defined.

The output is a lightly modified version of the input, adding:

  • Our Ref - A UUID to identify this dataset

And a json version of the file ready for use by downstream systems.

cgpSeqInputVal seq-valid

Takes an interleaved or a pair of paired-fastq files and produces a simple report of:

{
    "interleaved": false,
    "pairs": 722079,
    "valid_q": true
}

Various exceptions can occur for malformed files.

The primary purpose is to confirm Sanger/Illumina 1.8+ quality scores.

FASTQ not BAM/CRAM

The flow of the service data will require splitting of any multi-lane BAM/CRAM files down to the individual lanes, which we would do to interleaved fastq. There is no current need to parse BAM/CRAM files to check quality encoding directly as the spec technically disallows it. It is possible for BAM files to be incorrectly encoded though.

INSTALL

Installation is via easy_install. Simply execute with the path to the compiled 'egg':

easy_install bundles/cgp_seq_input_val-0.1.0-py3.6.egg

Package Dependancies

easy_install will install the relevant dependancies, listed here for convenience:

Development environment

This project uses git pre-commit hooks. As these will execute on your system it is entirely up to you if you activate them.

If you want tests, coverage reports and lint-ing to automatically execute before a commit you can activate them by running:

git config core.hooksPath git-hooks

Only a test failure will block a commit, lint-ing is not enforced (but please consider following the guidance).

You can run the same checks manually without a commit by executing the following in the base of the clone:

./run_tests.py

Development Dependencies

Setup VirtualEnv

cd $PROJECTROOT
hash virtualenv || pip3 install virtualenv
virtualenv -p python3 env
source env/bin/activate
pip install progressbar2
pip install xlrd
python setup.py develop # so bin scripts can find module

For testing/coverage (./run_tests.sh)

source env/bin/activate # if not already in env
pip install pytest
pip install pytest-cov
pip install pep8
pip install radon

Also see Package Dependancies

Cutting a release

Make sure the version is incremented in ./setup.py

The release is handled by wheel:

$ source env/bin/activate # if not already
$ python setup.py bdist_wheel -d dist
# this creates an wheel archive which can be copied to a deployment location, e.g.
$ scp cgp_seq_input_val-1.1.0-py3-none-any.whl user@host:~/wheels
# on host
$ pip install --find-links=~/wheels cgp_seq_input_val

About

Sequence data and manifest validation code.


Languages

Language:Python 96.0%Language:Shell 4.0%