DepthSizer: Read-depth based genome size prediction

DepthSizer v1.9.0

For a better rendering and navigation of this document, please download and open ./docs/depthsizer.docs.html, or visit https://slimsuite.github.io/depthsizer/. Documentation can also be generated by running DepthSizer with the dochtml=T option. (R and pandoc must be installed - see below.)

Introduction

DepthSizer is an updated version of the genome size estimate methods of Diploidocus. DepthSizer needs a genome assembly (fasta format, seqin=FILE), a set of long read (ONT, PacBio or HiFi) data for the assembly (reads=FILELIST and readtype=LIST) (or readbp=INT), and a BUSCO/BUSCOMP full table of results (busco=TSVFILE).

DepthSizer works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some poor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth.

DepthSizer uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth. This is converted into an estimated single copy read depth using a smoothed density plot of BUSCO single copy genes. Genome size is then estimated based on a crude calculation using the total combined sequencing length. This will be calculated from reads=FILELIST unless provided with readbp=INT.

BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table matching the BUSCO fields: ['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy. Output from BUSCOMP is also compatible with DepthSizer. DepthSizer has been tested with outputs from BUSCO v3 and v5.

NOTE: The basic DepthSizer approach assumes that the raw long read data has a 1:1 correspondence to the genomic DNA being sequenced, i.e. there is no contamination (including plastids) and no bias towards insertion or deletion read errors. As a consequence, the default genome size prediction is expected to be an over-estimate. DepthSizer will also calculate an estimated lower bound, based on only those reads that map to the assembly (unless covbases=F) . An adjustment for read error profiles is made by calculating the ratio of read:genomic data for mapped read from the BAM CIGAR strings ((insertions+matches)/(deletions+matches)) and reported as the IndelRatio adjustment. The older MapAjust method, which uses mapped reads and mapped bases calculated from samtools coverage and samtools fasta) to try to correct for read mapping and imbalanced insertion:deletion ratios, can also be switched on with mapadjust=T (or benchmark=T). Benchmarking of the different adjustments is ongoing. Read volumes can also be manually adjusted with readbp=INT. All calculated sizes will be reported in the *.gensize.tdt output, but the adjustment method selected by adjustmode=X (None/CovBases/IndelRatio/MapAdjust, default IndelRatio) will be used for "the" genome size prediction.

Version 1.1. The core depth calculation shifted in Version 1.1. Legacy mode will use the old code to calculate the modal read depth for each BUSCO gene along with the overall modal read depth for all gene regions. These are not recommended.

Version 1.8. Version 1.8 introduced a new reduced=T/F mode, which only processes sequences that have BUSCO predictions. (Complete, Duplicated or Fragmented.) This is on (True) by default, and substantially reduces the disk footprint and processing time for highly fragmented genomes. If the BUSCO Completeness is low, using the fragmented=T option (introduced in version 1.7, default False) will use Fragmented BUSCO genes as well as Complete genes to establish the single-copy read depth.

Citation

DepthSizer has been published as part of the Waratah genome paper:

Chen SH, Rossetto M, van der Merwe M, Lu-Irving P, Yap JS, Sauquet H, Bourke G, Amos TG, Bragg JG & Edwards RJ (2022). Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. Molecular Ecology Resources doi: 10.1111/1755-0998.13574

Please contact the author if you have trouble getting the full text version, or read the bioRxiv preprint version:

Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. bioRxiv 2021.06.02.444084; doi: 10.1101/2021.06.02.444084.

Running DepthSizer

DepthSizer is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/depthsizer.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone DepthSizer git repo, $CODEPATH will be the path the to code/ directory. Please see details in the DepthSizer git repo for running on example data.

Dependencies

Unless bam=FILE is given, minimap2 must be installed and either added to the environment $PATH or given to DepthSizer with the minimap2=PROG setting, and samtools needs to be installed. Unless legacy=T depdensity=F, R will also need be installed.

To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For DepthSizer documentation, run with dochtml=T and read the *.docs.html file generated.

Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Main DepthSizer run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Input sequence assembly [None]
basefile=FILE   : Root of output file names [gapspanner or $SEQIN basefile]
summarise=T/F   : Whether to generate and output summary statistics sequence data before and after processing [True]
bam=FILE        : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
bamcsi=T/F      : Use CSI indexing for BAM files, not BAI (needed for v long scaffolds) [False]
reads=FILELIST  : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST   : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F     : Generate HTML DepthSizer documentation (*.docs.html) instead of main run [False]
tmpdir=PATH     : Path for temporary output files during forking (not all modes) [./tmpdir/]
### ~ Genome size prediction options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
busco=TSVFILE   : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT      : Total combined read length for depth calculations (over-rides reads=FILELIST) []
adjustmode=X    : Map adjustment method to apply (None/CovBases/IndelRatio/MapBases/MapAdjust/MapRatio/OldAdjust/OldCovBases) [IndelRatio]
quickdepth=T/F  : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depchunk=INT    : Chunk input into minimum of INT bp chunks for temp depth calculation [1e6]
deponly=T/F     : Cease execution following checking/creating BAM and fastdep/fastmp files [False]
depfile=FILE    : Precomputed depth file (*.fastdep or *.fastmp) to use [None]
covbases=T/F    : Whether to calculate predicted minimum genome size based on mapped reads only [True]
mapadjust=T/F   : Whether to calculate mapadjust predicted genome size based on read length:mapping ratio [False]
benchmark=T/F   : Activate benchmarking mode and also output the assembly size and mean depth estimate [False]
legacy=T/F      : Whether to perform Legacy v1.0.0 (Diploidocus) calculations [False]
depdensity=T/F  : Whether to use the BUSCO depth density profile in place of modal depth in legacy mode [True]
depadjust=INT   : Advanced R density bandwidth adjustment parameter [12]
seqstats=T/F    : Whether to output CN and depth data for full sequences as well as BUSCO genes [False]
reduced=T/F     : Only generate/use fastmp for BUSCO-containing sequences (*.busco.fastmp) [True]
fragmented=T/F  : Whether to use Fragmented as well as Complete BUSCO genes for SC Depth estimates [False]
### ~ Forking options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X         : Number of parallel sequences to process at once [0]
killforks=X     : Number of seconds of no activity before killing all remaining forks. [36000]
killmain=T/F    : Whether to kill main thread rather than individual forks when killforks reached. [False]
logfork=T/F     : Whether to log forking in main log [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

DepthSizer workflow and options

The main inputs for DepthSizer genome size prediction are:

seqin=FILE : Input sequence assembly to tidy [Required].
reads=FILELIST: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped.
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv] used for calculating single copy ("diploid") read depth.

Step 1: BAM file (read mapping)

The first step is to generate a BAM file by mapping reads on to seqin using minimap2. A pre-generated BAM file can be given instead using bam=FILE. There should be no secondary mapping of reads, as these will inflate read depths, so filter these out if they were allowed during mapping. Similarly, the BAM file should not contain unmapped reads. (These should be filtered during processing if present.) If no BAM file setting is given, the BAM file will be named $BASEFILE.bam, where $BASEFILE is set by basefile=X.

Step 2: BUSCO(MP) results

DepthSizer works on the principle that Complete BUSCO genes should represent predominantly single copy (diploid read depth) regions along with some poor-quality and/or repeat regions. Assembly artefacts and collapsed repeats etc. are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the region is actually diploid coverage, the modal read depth is expected to represent the actual single copy read depth. This is estimated using a smoothed density distribution calculated using R density().

BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE (default full_table_$BASEFILE.busco.tsv). This can be replaced with any table with the fields: ['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status = Complete and the Contig, Start and End fields are used to define the regions that should be predominantly single copy. BUSCOMP v0.10.0 and above will generate a *.complete.tsv file that can be used in place of BUSCO results. This can enable rapid re-annotation of BUSCO genes following, for example, vector trimming with Diploidocus. If fragmented=T then entries with Status = Fragmented are also used. This is useful when Completeness is low.

Step 3: Single-copy read depth

DepthSizer uses samtools mpileup (or samtools depth if quickdepth=T) to calculate the per-base read depth and extracts the smoothed modal read depth for all single-copy (Complete BUSCO genes) using the density() function of R. To avoid a minority of extremely deep-coverage bases disrupting the density profile, the depth range is first limited to the range from zero to 1000, or four time the pure modal read depth if over 1000. If the pure mode is zero coverage, zero is returned. The number of bins for the density function is set to be greater than 5 times the max depth for the calculation.

By default, the density bandwidth smoothing parameter is set to adjust=12. This can be modified with depadjust=INT. The raw and smoothed profiles are output to *.plots/*.raw.png *.plots/*.scdepth.png to check smoothing if required. Additional checking plots are also output (see Outputs below).

The full output of depths per position is output to $BAM.fastmp (or $BAM.fastdep if quickdepth=T). The single-copy is also output to $BAM.fastmp.scdepth. If reduced=T (the default) then the fastmp or fastdep file will have a $BAM.busco.* prefix and only include the sequences in the BUSCO table. By default, generation of the fastdep/fastmp data is performed by chunking up the assembly and creating temporary files in parallel (tmpdir=PATH). Sequences are batched in order such that each batch meets the minimum size criterion set by depchunk=INT (default 1Mbp). If depchunk=0 then each sequence will be processed individually. This is not recommended for large, highly fragmented genomes. Unless dev=T or debug=T, the temporary files will be deleted once the final file is made. If DepthSizer crashed during the generation of the file, it should be possible to re-run and it will re-use existing temporary files.

NOTE: To generate output that is compatible with DepthKopy, run with reduced=F.

Step 4: Read mapping adjustments

The basic DepthSizer approach (adjustmode=None) assumes that the raw long read data has a 1:1 correspondence to the genomic DNA being sequenced, i.e. there is no contamination (including plastids) and no bias towards insertion or deletion read errors. As a consequence, the default genome size prediction is expected to be an over-estimate.

DepthSizer will also calculate several adjusted estimation values that aim to provide the range in which the true genome size is expected to lie. Note that benchmarking and refinement of these adjustments is ongoing and will be expanded in future releases. For most use cases, it is anticipated that CovBases and None will provide the lower and upper bounds, whilst IndelRatio and MapAdjust should fall in between and be more accurate. (See notes below for each method.

Currently, there are four adjustmode settings that can be output, in addition to two benchmarking calculations:

None : The purest DepthSizer mode makes no adjustment to total sequencing depth. (See Step 5.)
CovBases : This uses samtools coverage to calculate the total number of mapped bases as covered bases multiplied by the mean depth: samtools coverage $BAM | grep -v coverage | awk '{{sum += ($7 * $5)}} END {{print sum}}'. Very big differences between CovBases and None may indicate a very incomplete assembly and/or an excess of contamination in the raw sequencing data. If BUSCO scores etc. indicate good completeness, it is advisable to carefully check the read=FILELIST data provided to DepthSizer.
IndeRatio : This mode extracts the CIGAR strings from the BAM file and sums up the insertions (nI), deletions (nD) and mapped bases (nX+nM+n=). The insertion:deletion ratio is then calculated as: (I+X+M+=)/(D+X+M+=). The goal here is to estimate whether the raw sequencing data is biased towards insertion or deletion errors. Insertion bias will inflate the apparent volume of sequencing. The total read volume is therefore adjusted by dividing by the indelratio. An insertion bias will decrease the estimated genome size, whereas a deletion bias will increase the prediction. To reduce issues caused by poor-quality regions of the assembly, mapped regions are reduced to Complete BUSCO genes (in a BED file, $BED): samtools view -h -F 4 $BAM -L $BED | grep -v '^@' | awk '{print $6;}' | uniq. The combined CIGAR counts are saved in $BAM.indelratio.txt to accelerate re-calculation.
MapAdjust : An earlier attempt to model insertion:deletion ratios, this combines the CovBases calculation of total assembly base coverage with samtools fasta to calculate the total number of bases in the mapped reads: samtools view -hb -F 4 {0} | samtools fasta - | grep -v '^>' | wc | awk '{{ $4 = $3 - $2 }} 1' | awk '{{print $4}}' MapAdjust calculates the general loss of raw sequencing during mapping as CovBases/MapBases. This aims to estimate the proportion of the raw sequencing data that contributed to the single copy read depth and is used as a multiplier for the total read volume. Extreme mapadjust ratios should be treated with caution and may indicate problems with the assembly and/or source data.
Assembly : In benchmark=T mode, the observed assembly size is output.
MeanX : In benchmark=T mode, the mean coverage is calculated as CovBases/AssemblySize and used in place of scdepth for the genome size estimation using the full sequencing volume.

By default, DepthSizer will estimate genome sizes using IndelRatio, CovBases in addition to None. For speed, CovBases can be switched off with covbases=F and IndelRatio by setting adjustmode=None. The old MapAdjust calculation is not made by default, but can be switched on with mapadjust=T, adjustmode=MapAdjust, or benchmark=T. Setting benchmark=T will output all six estimates.

NOTE: v1.5.0 expands the options to None/CovBases/IndelRatio/MapBases/MapAdjust/MapRatio/OldAdjust/OldCovBases. Details to follow.

Step 5: Total read volume

Genome size is estimated using the total combined sequencing length. This will be calculated from reads=FILELIST unless provided with readbp=INT. The number of bases for each input $READFILE is saved as $READFILE.basecount.txt and will be reloaded for future runs unless force=T.

Step 6: Genome size prediction

The final genome size is predicted based on the total (adjusted) combined sequencing length and the single-copy read depth, as: EstGenomeSize=ReadBP/SCDepth. Size predictions will be output for to *.gensize.tdt and as #GSIZE entries in the log file.

Outputs

The main DepthSizer outputs are:

*.gensize.tdt = Main genome size prediction table.
*.log = DepthSizer log file with key steps and details of any errors or warnings generated.
*.plots/ = Directory of PNG plots (see below)

The primary output is the *.gensize.tdt table, which has the following fields:

SeqFile = Assembly file used for genome size prediction.
DepMethod = Depth estimation method used (mpileup or depth)
Adjust = Read mapping adjustment (see Step 4, above)
ReadBP = Total read volume (see Step 5, above)
MapAdjust = The relevant adjustment ratio.
SCDepth = The single copy read depth used in the prediction.
EstGenomeSize = Genome size prediction (bp).

DepthSizer plots and additional tables

The first time SCDepth is calculated (if not provided with scdepth=NUM), DepthSizer will also generate a number of plots for additional QC of results, which are output in $BASE.plots/ as PNG files.

First, the raw and smoothed read depth profiles will be output to:

$BASE.plots/$BASE.raw.png = raw depth profile
$BASE.plots/$BASE.scdepth.png = smoothed depth profile with SC depth marked

In addition, violin plots will be generated for BUSCO Complete and Duplicated genes for the following $STAT values:

$BASE.plots/$BASE.MeanX.png = Mean depth of coverage
$BASE.plots/$BASE.MedX.png = Median depth of coverage
$BASE.plots/$BASE.ModeX.png = Pure Modal depth of coverage
$BASE.plots/$BASE.DensX.png = Smoothed density modal depth of coverage
$BASE.plots/$BASE.CN.png = Estimated copy number. (See DepthKopy for more details.)

If seqstats=T then each assembly sequence will also be output in a Sequences violin plot for comparison. Each point in the plots is a separate gene or sequence. Values for individual genes/sequences are also output as a density scatter plot named $BASE.plots/$BASE.$REGIONS.$STAT.png, where

$REGIONS is the type of region plotted (BUSCO complete, Duplicated, or assembly Sequences).
$STAT is the output statistic: MeanX, MedX, ModeX or DensX.

The BUSCO input file will also get a *.regcnv.tsv and *.dupcnv.tsv copy number prediction tables generated. See DepthKopy for more details.

BAM file outputs

In addition to the BAM file of mapped reads itself, DepthSizer may generate (or reload) the following files associated with the BAM generated/provided:

$BAM.bai = Samtools index
$BAM.fastmp = samtools mpileup read depths per position. This format apes KAT *.cvg format, with a fasta-style sequence header, followed on the next line by a depth value per position.
$BAM.fastdep = samtools depth read depths per position. This format apes KAT *.cvg format, with a fasta-style sequence header, followed on the next line by a depth value per position.
$BAM.fast*.scdepth = Single-copy read depth calculated for the above files.
$BAM.indelratio.txt = Compiled CIGAR strings for IndelRatio calculation.
$BAM.mapratio.txt = CovBases and MapBases numbers for MapAdjust calculation.

slimsuite / depthsizer