DBA

DNA barcoding analysis pipeline designed to perform phylogenetic analysis on raw sanger sequencing data.

Requirements
Installation
Example commands
Full usage
Method
Interpeting output
Acknowledgements
License

Requirements and dependencies

DBA is designed to run on linux-based systems. The pipeline was specifcally tested on Ubuntu 22.04.2 LTS. Some tools and most dependencies used during analysis will be installed automatically when the conda environment is being constructed.

The following prerequisites should be installed prior to using DBA

Anaconda3
MEGA11
Python 3.8+
Java
Jalview
- Note: Jalview should be a system-wide executable as: jalview
MEGA11

Installation

Install DBA:

git clone https://github.com/mdcjansen/DBA
cd path/to/DBA
conda env create -f environment.yml
chmod a+x DBA.py

Allow DBA to be executed from anywhere within the anaconda environment:

ln -s /path/to/DBA.py /path/to/anaconda3/envs/DBA/bin/DBA

Example commands

Activate conda environment for analysis:

conda activate DBA

Display options:

This command will display all required and optional arguments in the format presented below

DBA
DBA -h "or" --help

Minimum required input:

DBA -i <input folder> -n <genbank reference NC_ID> -y <genbank reference YC_ID> -g <genbank outgroup NC_ID>

Example input:

The input below was used to produce the case data available within this repository:

DBA -i case_data -n NC_009065 -y YP_001054869 -g NC_02842

Full usage

usage: DBA -i <inputfolder> -n <genbank reference NC_ID> -y <genbank reference YP_ID> -g <genbank outgroup NC_ID> [options]

DBA is designed to automate the process of DNA barcoding by utilising standard Sanger sequencing data and user provided reference gene and protein accession numbers.
This pipeline obtains nucleic and protein sequences from genbank, followed by chromatogram production in .pdf format by utilising sangerseq viewer.
Next, BLASTx and BLASTn are run to assess query coverage against the reference gene.
MUSCLE is used to align the sequences, followed by manual review in jalview and phylogenetic analysis by MEGA.

List of arguments:
  -h, --help            show this help message and exit
  -v, --version         Prints program version
  -i [input folder]     Input folder containing per species folders, a single concatenated fasta file for analysis and a MEGA mao file for phylogenetic analysis
  -n [genbank reference nucl acc no.]
                        genbank nucleotide accession number for the gene to be used as reference, usually starts with the identifier NC_
  -y [genbank reference prot acc no.]
                        genbank protein accession number for the protein to be used as reference, usually starts with the identifier YC_
  -g [genbank nucl acc no. for out group]
                        genbank nucleotide accession number for outgroup gene used during phylogenetic analysis, usually starts with the identifier NC_
  -o [output folder]    Output directory. 
                        A default output folder will be produced in the following format will be created if it hasn't been specified:
                        barcoding_output_current-date_current-time
  -t [cpu threads]      Maximum amount of threads to be utilised during analysis.
                        Default: 20
  -keep []              Keep all files produced during analysis.
                        Files are stored in 'workdir' folder at the location where this scripthas been executed.
                        Default: False
  -rev []               Reverse compliment the fasta input.
                        Reverse complimented file is saved as an additional file.
                        Default: False

Method

DBA requires a specific folder structure in order to perform the analysis. A visualisation of this folder structure is given below where case_data is given as -i argument:

	.
	├── ...
	├── case_data               
	│	├── sanger.fasta				# FASTA file containing all sanger sequences to be used during analysis
	│	├── phylo.mao					# MEGA11 generated file used to perform phylogenetic analysis
	│	├── species_01					# Folder with sequencing data of first species to be analysed
	│	│	├── sample_01				# Folder containing sanger sequencing data of the first sample
	│	│	│	├── sanger_data.ab1
	│	│	│	└── ...
	│	│	└── ...
	│	├── species_02					# Folder with sequencing data of second species to be analysed
	│	│	├── sample_01				# Folder containing sanger sequencing data of the first sample
	│	│	│	├── sanger_data.ab1
	│	│	│	└── ...
	│	│	└── ...
	│	└── ...
	└── ...

DBA will perform phylogenetic analysis as presented below:

If specified, DBA will reverse complement the input fasta file
DBA will obtain genbank information on input genbank accession numbers
sangerseq_viewer is utilised to produce a chromatogram for each of the input species found
BLAST is performed on the input fasta sequence against the input reference sequences
Multple sequence alignment is performed with MUSCLE
DBA initates Jalview for manual review and trimming of alignment data
MEGA11 is used to perform phylogentic analysis and produce a consensus tree
If specified, supplemental data used during analysis is moved to the output folder and the analysis ends.

Interpeting output

DBA produces a single output folder where all results can be found. All procedures during analysis are logged and stored in the barcoding.log file

BLAST output

BLAST results are placed into two tsv files named blastn.tsv and blastx.tsv for BLASTn and BLASTx results respectively. Standard BLAST tabular output format 6 is used to generate the output. An example with case data and manually added header is presented below:

qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	bitscore
RC_EF71414350_EF71414350	NC_009065.1	98.214	672	8	4	13	680	5513	6184	1171
RC_EF71414351_EF71414351	NC_009065.1	99.519	624	3	0	5	628	5507	6130	1136
RC_EF71414352_EF71414352	NC_009065.1	97.765	671	9	6	5	670	5507	6176	1151
RC_EF71414353_EF71414353	NC_009065.1	98.336	661	6	5	10	670	5513	6168	1155
RC_EF71414354_EF71414354	NC_009065.1	99.522	628	3	0	6	633	5507	6134	1144

Sangerseq_viewer output

Sangerseq_viewer produces a chromatogram in pdf format for each species folder present in the input folder. These chromatograms display all sequences found inside a particular species folder and present mapped genbank genes. An example produced from case data is presented below:

MUSCLE output

MUSCLE produces multiple alignments files, all presented with the .fa suffix. The alignment with the highest column confidence is extracted from the diversified ensemble and opened in jalview to be reviewed by the user. Additionally, an html file will be produced for the diversified ensemble containing the letter confidence values of the ensemble.

MEGA output

MEGA generates phylogenetic trees in newick format using user specified parameters specified within the mao file. Newick trees and consensus tree are both present in the output folder. Summary and partition text files are also saved within the output.

Acknowledgements

DBA uses the following tools in the pipeline:

License

MIT license

mdcjansen / DBA

DBA

Table of contents