An Django database for storing genetic variants found within a genetics laboratory.
VariantDatabase allows the following:
- Basic sample tracking capabilities. Organise Projects, Runs and Samples.
- Store variants that have been discovered.
- Parse and store run QC data from Illumina InterOp files.
- Parse and store sample QC data from SamStats.
- Allow searching for previously seen variants.
- View variants that have been found within a specific sample.
- Visualise variant annotation data.
- Integrates IGV.js to allow VCF and BAM viewing.
- Store the evidence and comments that Clinical Scientists make when analysing variants.
CentOS7
Python 2.7.11
Pip
virtualenv
Django==1.10.5
django-auditlog==0.4.3
gunicorn==19.7.1
interop==1.0.25
numpy==1.12.1
pysam==0.10.0
To serve VCF and BAMs using IGV.js a webserver capable of HTTP range requests is required. Nginx is used in a typical deployment. Nginx is typically paired with Gunicorn which handles dynamic requests.
To annotate vcfs VEP is required (Tested on API and Cache Version 90):
http://www.ensembl.org/info/docs/tools/vep/index.html
Within your python virtualenv type:
git clone https://github.com/WMRGL/VariantDatabase.git
pip install -r requirements.txt
python manage.py migrate
python manage.py makemigrations VariantDatabase
python manage.py migrate
python manage.py createsuperuser
- follow instructions to create superuser.
python manage.py loaddata db_setup.json
python manage.py test
python manage.py runserver
Go to http://127.0.0.1:8000/ in your web browser to see welcome page.
The main utility for uploading data into the database is the master_upload management function.
For help using this program type:
python manage.py master_upload -h
For example to upload all data for a worksheet (SampleSheet, Variants, Run QC, Sample QC, Gene Coverage and Exon Coverage) enter the following:
python manage.py master_upload --worksheet_dir /home/cuser/Documents/Project/DatabaseData/worksheet_dir/ --output_dir /home/cuser/Documents/Project/DatabaseData/MPN_213837/ --sample_sheet --run_qc --sample_qc --coverage --variants
It is important that the directories specified by the -w/ --worksheet_dir and output_dir/-o options are structured correctly.
The path to the worksheet directory.
This is the Illumina directory containing the file SampleSheet.csv. It should be structured as shown below. Only the files needed for the VariantDatabase to function correctly are shown. Folder and file names are case sensitive.
worksheet_dir
│ SampleSheet.csv
│ RunParameters.xml
│ RunInfo.xml
│ RunCompletionStatus.xml
│ RunParameters.xml
│ CompletedJobInfo.xml
│ GenerateFASTQRunStatistics.xml
│
└───InterOp
│ │ ControlMetricsOut.bin
│ │ CorrectedIntMetricsOut.bin
│ │ ExtractionMetricsOut.bin
│ │ IndexMetricsOut.bin
│ │ QMetricsOut.bin
│ │ TileMetricsOut.bin
The path to the pipeline output directory.
The directory should be structured as shown below. Only the files needed for the VariantDatabase to function correctly are shown. Folder and file names are case sensitive.
* = wildcard
sample_name = The unique sample name specified in the SampleSheet.csv file
output_dir
│
└───alignments*
│ │ sample_name.bwa.drm.realn.sorted.bam
│ │ sample_name.bwa.drm.realn.sorted.bam.bai
│ │ ...
│
└───archive*
│ │
│ └───*QC_stats.zip
│ │
│ └───*QC_stats
│ │ sample_name.bwa.drm.realn.sorted.bam.stats
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-acgt-cycles.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-coverage.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-gc-content.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-gc-depth.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-indel-cycles.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-indel-dist.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-insert-size.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-quals.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-quals2.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-quals3.png
│ │ sample_name.bwa.drm.realn.sorted.bam.stats-quals-hm.png
│ │ ...
│
└───reanalysis_data*
│ │ sample_name.exon-count-data.tsv.gz
│ │ sample_name.gene-count-data.tsv.gz
│ │ ...
│
└───vcfs*
│ │ sample_name*.vcf.gz
│ │ sample_name*.vcf.gz.tbi
│ │ ...
Add this option to import the information within the SampleSheet.csv file into the database. This will import a new worksheet and create sample objects as specified in the SampleSheet.
Add this option to import the run QC information into the database. This is the information contained within the InterOp files.
Add this option to import the sample QC information into the database. This is the information created by the SamStats program.
Add this option to import the Gene and Exon coverage information into the database.
Add this option to import the variant information contained within the VEP annotated vcf files.
Use this option to upload the data for a single sample. Example below:
python manage.py master_upload -worksheet_dir /home/cuser/Documents/Project/DatabaseData/worksheet_dir/ --output_dir /home/cuser/Documents/Project/DatabaseData/MPN_213837/ --sample_qc --coverage --variants --single_sample 213837-2-D17-26177-HP_S2
Note that the --sample_sheet and --run_qc options are not available when using the --single_sample option.
For the vcf files to be correctly parsed by the VariantDatabase parser (parsers/vcf_parser.py) they must be annotated by VEP.
Once VEP is installed annotate your vcfs with the following command:
vep -i input_vcf -o output.vcf --cache --fork 4 --refseq --vcf --flag_pick --exclude_predicted --everything --dont_skip --total_length --offline --fasta fasta_location
Other VCF annotations that are required include: INFO/Caller, FORMAT/AD, INFO/TCF, INFO/TCR and INFO/VAFS
They can then be bgzipped in preparation for database import:
bgzip file_name
tabix file_name.gz
Coming Soon - Wiki
Coming Soon - Guide to Nginx/Gunicorn Setup
Coming Soon - Setting basic security settings
- Joseph Halstead