DepthSizer v1.9.0
For a better rendering and navigation of this document, please download and open ./docs/depthsizer.docs.html
, or visit https://slimsuite.github.io/depthsizer/.
Documentation can also be generated by running DepthSizer with the dochtml=T
option. (R and pandoc must be installed - see below.)
DepthSizer is an updated version of the genome size estimate methods of Diploidocus. DepthSizer needs a genome assembly
(fasta format, seqin=FILE
), a set of long read (ONT, PacBio or HiFi) data for the assembly (reads=FILELIST
and
readtype=LIST
) (or readbp=INT
), and a BUSCO/BUSCOMP full table of results (busco=TSVFILE
).
DepthSizer works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid
read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth.
DepthSizer uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth.
This is converted into an estimated single copy read depth using a smoothed density plot of BUSCO single copy genes.
Genome size is then estimated based on a crude calculation using the total combined sequencing length.
This will be calculated from reads=FILELIST
unless provided with readbp=INT
.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table matching the BUSCO fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
Output from BUSCOMP is also compatible with DepthSizer. DepthSizer has been tested with outputs from BUSCO v3 and v5.
NOTE: The basic DepthSizer approach assumes that the raw long read data has a 1:1 correspondence to the
genomic DNA being sequenced, i.e. there is no contamination (including plastids) and no bias towards insertion
or deletion read errors. As a consequence, the default genome size prediction is expected to be an over-estimate.
DepthSizer will also calculate an estimated lower bound, based on only those reads that map to the assembly (unless
covbases=F
) . An adjustment for read error profiles is made by calculating the ratio of read:genomic data for
mapped read from the BAM CIGAR strings ((insertions+matches)/(deletions+matches)) and reported as the IndelRatio
adjustment. The older MapAjust
method, which uses mapped reads and mapped bases calculated from samtools coverage
and samtools fasta
) to try to correct for read mapping and imbalanced insertion:deletion ratios, can also be
switched on with mapadjust=T
(or benchmark=T
). Benchmarking of the different adjustments is ongoing. Read
volumes can also be manually adjusted with readbp=INT
. All calculated sizes will be reported in the
*.gensize.tdt
output, but the adjustment method selected by adjustmode=X
(None/CovBases/IndelRatio/MapAdjust,
default IndelRatio
) will be used for "the" genome size prediction.
Version 1.1. The core depth calculation shifted in Version 1.1. Legacy
mode will use the old code to
calculate the modal read depth for each BUSCO gene along with the overall modal read depth for all gene
regions. These are not recommended.
Version 1.8. Version 1.8 introduced a new reduced=T/F
mode, which only processes sequences that have BUSCO
predictions. (Complete, Duplicated or Fragmented.) This is on (True
) by default, and substantially reduces
the disk footprint and processing time for highly fragmented genomes. If the BUSCO Completeness is low, using the
fragmented=T
option (introduced in version 1.7, default False
) will use Fragmented
BUSCO genes as well as
Complete
genes to establish the single-copy read depth.
DepthSizer has been published as part of the Waratah genome paper:
Chen SH, Rossetto M, van der Merwe M, Lu-Irving P, Yap JS, Sauquet H, Bourke G, Amos TG, Bragg JG & Edwards RJ (2022). Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. Molecular Ecology Resources doi: 10.1111/1755-0998.13574
Please contact the author if you have trouble getting the full text version, or read the bioRxiv preprint version:
Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. bioRxiv 2021.06.02.444084; doi: 10.1101/2021.06.02.444084.
DepthSizer is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/depthsizer.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone DepthSizer git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the DepthSizer git repo
for running on example data.
Unless bam=FILE
is given, minimap2 must be installed and either added to the
environment $PATH
or given to DepthSizer with the minimap2=PROG
setting, and samtools
needs to be installed. Unless legacy=T depdensity=F
, R will also need be installed.
To generate documentation with dochtml
, R will need to be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For DepthSizer documentation, run with dochtml=T
and read the *.docs.html
file generated.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general
SLiMSuite documentation for details of how to
use commandline options, including setting default values with INI files.
### ~ Main DepthSizer run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Input sequence assembly [None]
basefile=FILE : Root of output file names [gapspanner or $SEQIN basefile]
summarise=T/F : Whether to generate and output summary statistics sequence data before and after processing [True]
bam=FILE : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
bamcsi=T/F : Use CSI indexing for BAM files, not BAI (needed for v long scaffolds) [False]
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F : Generate HTML DepthSizer documentation (*.docs.html) instead of main run [False]
tmpdir=PATH : Path for temporary output files during forking (not all modes) [./tmpdir/]
### ~ Genome size prediction options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT : Total combined read length for depth calculations (over-rides reads=FILELIST) []
adjustmode=X : Map adjustment method to apply (None/CovBases/IndelRatio/MapBases/MapAdjust/MapRatio/OldAdjust/OldCovBases) [IndelRatio]
quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depchunk=INT : Chunk input into minimum of INT bp chunks for temp depth calculation [1e6]
deponly=T/F : Cease execution following checking/creating BAM and fastdep/fastmp files [False]
depfile=FILE : Precomputed depth file (*.fastdep or *.fastmp) to use [None]
covbases=T/F : Whether to calculate predicted minimum genome size based on mapped reads only [True]
mapadjust=T/F : Whether to calculate mapadjust predicted genome size based on read length:mapping ratio [False]
benchmark=T/F : Activate benchmarking mode and also output the assembly size and mean depth estimate [False]
legacy=T/F : Whether to perform Legacy v1.0.0 (Diploidocus) calculations [False]
depdensity=T/F : Whether to use the BUSCO depth density profile in place of modal depth in legacy mode [True]
depadjust=INT : Advanced R density bandwidth adjustment parameter [12]
seqstats=T/F : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]
reduced=T/F : Only generate/use fastmp for BUSCO-containing sequences (*.busco.fastmp) [True]
fragmented=T/F : Whether to use Fragmented as well as Complete BUSCO genes for SC Depth estimates [False]
### ~ Forking options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
killmain=T/F : Whether to kill main thread rather than individual forks when killforks reached. [False]
logfork=T/F : Whether to log forking in main log [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
The main inputs for DepthSizer genome size prediction are:
seqin=FILE
: Input sequence assembly to tidy [Required].reads=FILELIST
: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped.readtype=LIST
: List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]busco=TSVFILE
: BUSCO full table [full_table_$BASEFILE.busco.tsv
] used for calculating single copy ("diploid") read depth.
The first step is to generate a BAM file by mapping reads
on to seqin
using minimap2.
A pre-generated BAM file can be given instead using bam=FILE
. There should be no secondary mapping of reads, as
these will inflate read depths, so filter these out if they were allowed during mapping. Similarly, the BAM file
should not contain unmapped reads. (These should be filtered during processing if present.) If no BAM file setting
is given, the BAM file will be named $BASEFILE.bam
, where $BASEFILE
is set by basefile=X
.
DepthSizer works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid read
depth) regions along with some poor-quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth. This is estimated using a smoothed density distribution calculated using R density()
.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
BUSCOMP v0.10.0 and above will generate a *.complete.tsv
file that can
be used in place of BUSCO results. This can enable rapid re-annotation of BUSCO genes following, for example,
vector trimming with Diploidocus. If fragmented=T
then entries with
Status
= Fragmented
are also used. This is useful when Completeness is low.
DepthSizer uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth
and extracts the smoothed modal read depth for all single-copy (Complete
BUSCO genes) using the density()
function of R. To avoid a minority of extremely deep-coverage bases disrupting the density profile, the depth
range is first limited to the range from zero to 1000, or four time the pure modal read depth if over 1000. If
the pure mode is zero coverage, zero is returned. The number of bins for the density function is set to be
greater than 5 times the max depth for the calculation.
By default, the density bandwidth smoothing parameter is set to adjust=12
. This can be modified with
depadjust=INT
. The raw and smoothed profiles are output to *.plots/*.raw.png
*.plots/*.scdepth.png
to check smoothing if required. Additional checking plots are also output (see Outputs below).
The full output of depths per position is output to $BAM.fastmp
(or $BAM.fastdep
if quickdepth=T
). The
single-copy is also output to $BAM.fastmp.scdepth
. If reduced=T
(the default) then the fastmp
or fastdep
file will have a $BAM.busco.*
prefix and only include the sequences in the BUSCO table. By default, generation
of the fastdep/fastmp data is performed by chunking up the assembly and creating temporary files in parallel
(tmpdir=PATH
). Sequences are batched in order such that each batch meets the minimum size criterion set by
depchunk=INT
(default 1Mbp). If depchunk=0
then each sequence will be processed individually. This is not
recommended for large, highly fragmented genomes. Unless dev=T
or debug=T
, the temporary files will be
deleted once the final file is made. If DepthSizer crashed during the generation of the file, it should be
possible to re-run and it will re-use existing temporary files.
NOTE: To generate output that is compatible with DepthKopy, run
with reduced=F
.
The basic DepthSizer approach (adjustmode=None
) assumes that the raw long read data has a 1:1 correspondence to
the genomic DNA being sequenced, i.e. there is no contamination (including plastids) and no bias towards insertion
or deletion read errors. As a consequence, the default genome size prediction is expected to be an over-estimate.
DepthSizer will also calculate several adjusted estimation values that aim to provide the range in which the true
genome size is expected to lie. Note that benchmarking and refinement of these adjustments is ongoing and will be
expanded in future releases. For most use cases, it is anticipated that CovBases
and None
will provide the
lower and upper bounds, whilst IndelRatio
and MapAdjust
should fall in between and be more accurate. (See
notes below for each method.
Currently, there are four adjustmode
settings that can be output, in addition to two benchmarking calculations:
None
: The purest DepthSizer mode makes no adjustment to total sequencing depth. (See Step 5.)CovBases
: This usessamtools coverage
to calculate the total number of mapped bases as covered bases multiplied by the mean depth:samtools coverage $BAM | grep -v coverage | awk '{{sum += ($7 * $5)}} END {{print sum}}'
. Very big differences betweenCovBases
andNone
may indicate a very incomplete assembly and/or an excess of contamination in the raw sequencing data. If BUSCO scores etc. indicate good completeness, it is advisable to carefully check theread=FILELIST
data provided to DepthSizer.IndeRatio
: This mode extracts the CIGAR strings from the BAM file and sums up the insertions (nI
), deletions (nD
) and mapped bases (nX
+nM
+n=
). The insertion:deletion ratio is then calculated as: (I+X+M+=
)/(D+X+M+=
). The goal here is to estimate whether the raw sequencing data is biased towards insertion or deletion errors. Insertion bias will inflate the apparent volume of sequencing. The total read volume is therefore adjusted by dividing by the indelratio. An insertion bias will decrease the estimated genome size, whereas a deletion bias will increase the prediction. To reduce issues caused by poor-quality regions of the assembly, mapped regions are reduced toComplete
BUSCO genes (in a BED file,$BED
):samtools view -h -F 4 $BAM -L $BED | grep -v '^@' | awk '{print $6;}' | uniq
. The combined CIGAR counts are saved in$BAM.indelratio.txt
to accelerate re-calculation.MapAdjust
: An earlier attempt to model insertion:deletion ratios, this combines theCovBases
calculation of total assembly base coverage withsamtools fasta
to calculate the total number of bases in the mapped reads:samtools view -hb -F 4 {0} | samtools fasta - | grep -v '^>' | wc | awk '{{ $4 = $3 - $2 }} 1' | awk '{{print $4}}'
MapAdjust calculates the general loss of raw sequencing during mapping asCovBases/MapBases
. This aims to estimate the proportion of the raw sequencing data that contributed to the single copy read depth and is used as a multiplier for the total read volume. Extreme mapadjust ratios should be treated with caution and may indicate problems with the assembly and/or source data.Assembly
: Inbenchmark=T
mode, the observed assembly size is output.MeanX
: Inbenchmark=T
mode, the mean coverage is calculated asCovBases
/AssemblySize
and used in place ofscdepth
for the genome size estimation using the full sequencing volume.
By default, DepthSizer will estimate genome sizes using IndelRatio
, CovBases
in addition to None
. For
speed, CovBases
can be switched off with covbases=F
and IndelRatio
by setting adjustmode=None
. The old
MapAdjust
calculation is not made by default, but can be switched on with mapadjust=T
, adjustmode=MapAdjust
,
or benchmark=T
. Setting benchmark=T
will output all six estimates.
NOTE: v1.5.0 expands the options to None/CovBases/IndelRatio/MapBases/MapAdjust/MapRatio/OldAdjust/OldCovBases. Details to follow.
Genome size is estimated using the total combined sequencing length. This will be calculated from reads=FILELIST
unless provided with readbp=INT
. The number of bases for each input $READFILE
is saved as
$READFILE.basecount.txt
and will be reloaded for future runs unless force=T
.
The final genome size is predicted based on the total (adjusted) combined sequencing length and the single-copy
read depth, as: EstGenomeSize
=ReadBP
/SCDepth
. Size predictions will be output for to *.gensize.tdt
and
as #GSIZE
entries in the log file.
The main DepthSizer outputs are:
*.gensize.tdt
= Main genome size prediction table.*.log
= DepthSizer log file with key steps and details of any errors or warnings generated.*.plots/
= Directory of PNG plots (see below)
The primary output is the *.gensize.tdt
table, which has the following fields:
SeqFile
= Assembly file used for genome size prediction.DepMethod
= Depth estimation method used (mpileup
ordepth
)Adjust
= Read mapping adjustment (see Step 4, above)ReadBP
= Total read volume (see Step 5, above)MapAdjust
= The relevant adjustment ratio.SCDepth
= The single copy read depth used in the prediction.EstGenomeSize
= Genome size prediction (bp).
The first time SCDepth
is calculated (if not provided with scdepth=NUM
), DepthSizer will also generate a
number of plots for additional QC of results, which are output in $BASE.plots/
as PNG files.
First, the raw and smoothed read depth profiles will be output to:
$BASE.plots/$BASE.raw.png
= raw depth profile$BASE.plots/$BASE.scdepth.png
= smoothed depth profile with SC depth marked
In addition, violin plots will be generated for BUSCO Complete
and Duplicated
genes for the following $STAT
values:
$BASE.plots/$BASE.MeanX.png
= Mean depth of coverage$BASE.plots/$BASE.MedX.png
= Median depth of coverage$BASE.plots/$BASE.ModeX.png
= Pure Modal depth of coverage$BASE.plots/$BASE.DensX.png
= Smoothed density modal depth of coverage$BASE.plots/$BASE.CN.png
= Estimated copy number. (See DepthKopy for more details.)
If seqstats=T
then each assembly sequence will also be output in a Sequences
violin plot for comparison.
Each point in the plots is a separate gene or sequence. Values for individual genes/sequences are also output as
a density scatter plot named $BASE.plots/$BASE.$REGIONS.$STAT.png
, where
$REGIONS
is the type of region plotted (BUSCO
complete,Duplicated
, or assemblySequences
).$STAT
is the output statistic:MeanX
,MedX
,ModeX
orDensX
.
The BUSCO input file will also get a *.regcnv.tsv
and *.dupcnv.tsv
copy number prediction tables generated.
See DepthKopy for more details.
In addition to the BAM file of mapped reads itself, DepthSizer may generate (or reload) the following files associated with the BAM generated/provided:
$BAM.bai
= Samtools index$BAM.fastmp
=samtools mpileup
read depths per position. This format apes KAT*.cvg
format, with a fasta-style sequence header, followed on the next line by a depth value per position.$BAM.fastdep
=samtools depth
read depths per position. This format apes KAT*.cvg
format, with a fasta-style sequence header, followed on the next line by a depth value per position.$BAM.fast*.scdepth
= Single-copy read depth calculated for the above files.$BAM.indelratio.txt
= Compiled CIGAR strings for IndelRatio calculation.$BAM.mapratio.txt
= CovBases and MapBases numbers for MapAdjust calculation.
© 2023 Richard Edwards | rich.edwards@uwa.edu.au