dwheelerau / dalmations

Scanning for variants in NGS data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dalmations

Introduction

Multi-locus sequence typing (MLST) is a highly discriminating Candida albicans strain typing method. It is usually applied to one colony per patient sample. However, multiple strains can coexist in the same site in a patient. We therefore developed 100+1 NGS-MLST (dalmations), a next generation sequencing (NGS) modification of the existing C. albicans MLST method. It analyzes DNA extracted from a pool of 100 colonies from a sample plus DNA from one colony and bioinformatically infers the genotypes present and their frequency. It does so at a sequencing cost, per patient sample, four times lower than that of conventional MLST. For the directly typed single colonies its discriminating power is 0.998, comparable to that of conventional MLST. Its predictions of the ratio of different strains in a sample were fairly accurate - within 14±16% of the ratio between the numbers of colonies from two known strains combined to generate DNA pools for testing the method’s accuracy.

Details of our proof of principle experiment using 100+1 NGS-MLST can be found in the our recent publication XXXX.

To cite dalmations: XXXX et al:DIO:XXXXX

Requirements

This software has been tested on Ubuntu 14.04 and 16.06. This scripts are written in pure python and only require python 2.7 installed on the machine. The software may run on Mac and windows as long as python 2.7 is installed, but this has not been tested.

Installation

Create a base directory and clone the repo into.

git clone https://github.com/dwheelerau/dalmations
cd dalmations

setup the directories

mkdir final_genotypes/
mkdir final_sequences/
mkdir final_results/
mkdir genotype_data/
mkdir samples/

File and folder/structure required

The following file/folder structure is required to run dalmations. These should exist if you followed the installion instructions shown above:

BASE_DIR--samples--sample1_data
                 --sample2_data
                 --sample...
        --final_genotpyes 
        --final_sequences 
        --genotype_data  
        --reference_mlst--mlst.fa
        --reate_sequences.py  
        --extract_seq_from_sheet.py 
        --run_aligner.py  
        --demultplex.py  
        --genotyper_iter.py 

Running dalmations (gui version)

The GUI is under current development (see github branch). Stay tuned.....

Running dalmations (non-gui version)

  1. Either use the included demultplex.py script to demultplex your samples into the samples directory (in this case the BASE DIRECTORY is called dalmations), with the child directories named after the sample. For example, a sample called 1161NK_S75, which was sequenced using the 7 MLST primer combinations in paired end mode (ft = R1 and rt = R2), would have the following folder/file structure.
dalmations/samples/1161NK_S75/AAT1apft.fastq  
dalmations/samples/1161NK_S75/AAT1aprt.fastq  
dalmations/samples/1161NK_S75/SYA1pft.fastq  
dalmations/samples/1161NK_S75/SYA1prt.fastq  
dalmations/samples/1161NK_S75/ACC1pft.fastq  
dalmations/samples/1161NK_S75/ACC1prt.fastq   
dalmations/samples/1161NK_S75/VPS13pft.fastq  
dalmations/samples/1161NK_S75/VPS13prt.fastq   
dalmations/samples/1161NK_S75/ADP1pft.fastq  
dalmations/samples/1161NK_S75/ADP1prt.fastq   
dalmations/samples/1161NK_S75/ZWF1bpft.fastq  
dalmations/samples/1161NK_S75/ZWF1bprt.fastq  
dalmations/samples/1161NK_S75/MPIpft.fastq    
dalmations/samples/1161NK_S75/MPIprt.fastq

If sequences need demultiplexing then run the demultiplex.py using the following command:

usage:  
    python2 demultplex.py 12-0039_S10_L001_R2_001.fastq \
        forward_primers.txt reverse_primers.txt | tee -a log.txt

The forward_primers.txt and reverse_primers.txt files are included in this repo.

  1. Run the run_aligner.py script.
usage:
    python2 run_aligner.py samples/

If the above script works correctly each of the sample folders should now contain files ending in .aln, .csv, .data. The final table of genotype frequencies should be found in the final_results directory in a file called final_table_python.csv.

  1. Run the genotyper_iter.py script.

In the example below, SINGLE_COLONY_NAME and MIX_COLONY_NAME would correspond to sample folders found in the sample directory; they should also be found in the first column of the final_table_python.csv.

usage:
    python2 genotyper_iter.py <SINGLE_COLONY_NAME> <MIX_COLONY_NAME>  

example:
    python2 genotyper_iter.py FJ9-S_S16 P1-50-50_S35    

For convenience, if you have multiple pairs that you would like to process, place the pairs into a tab-separated text file, as follows:

SINGLE_COLONY_NAME1     MIX_COLONY_NAME1
SINGLE_COLONY_NAME2     MIX_COLONY_NAME2
SINGLE_COLONY_NAME3     MIX_COLONY_NAME3

Then process these automatically using the included shell script run_genotyper.sh using the following command:

./run_genotyper.sh pair.txt

Where pair.txt repressents the tab-separated file that you saved the pairs.

  1. Run the create_sequences.py script to generate the derived MLST sequence for each sample.
usage:
    python2 create_sequences.py

The final sequence file is saved in final_sequences/sequences.fa.

Important output files

  • final_sequences/sequences.fa which contains the dervied single and mix colony concatinated MLST sequences. These sequences can be compared to previous results using alignments or via phylogenetic trees.

  • final_results/final_table_python.csv the allele calls for each sample

Test data

The test_samples directory contains test data that can be used to test this package.

  1. Uncompress test data to the samples directory
tar xzf test_samples/hp11vw-S_S20.tar.gz -C samples/  
tar xzf test_samples/P1-5050_S83.tar.gz -C samples/  
  1. Follow the instructions above from step 2.

The results of running these samples are provided in the final_results and final_sequences directories found in test_samples.

License

This software is released under an MIT open source license. Please see LICENSE.txt for details.

About

Scanning for variants in NGS data

License:Other


Languages

Language:Python 99.6%Language:Shell 0.4%