BUSCOMP v1.1.3
For a better rendering and navigation of this documentation, please visit https://slimsuite.github.io/buscomp/ or download and open ./docs/buscomp.docs.html
.
Documentation can also be generated by running BUSCOMP with the dochtml=T
option. (R and pandoc must be installed - see below.)
BUSCOMP is designed to overcome some of the non-deterministic limitations of BUSCO to:
- compile a non-redundant maximal set of complete BUSCOs from a set of assemblies, and
- use this set to provide a "true" comparison of completeness between different assemblies of the same genome with predictable behaviour.
For each BUSCO gene, BUSCOMP will extract the best "Single Complete" sequence from those available, using the
full_table_*.tsv
results table and single_copy_busco_sequences/
directory of hit sequences. BUSCOMP ranks all
the hits across all assemblies by Score and keeps the top-ranking hits. Ties are then resolved by Length, keeping
the longest sequence. Ties for Score and Length will keep an arbitrary entry as the winner. Single complete hits
are given preference over Duplicate hits, even if they have a lower score, because only Single hit have their
sequences saved by BUSCO in the single_copy_busco_sequences/
directory. This set of predicted gene sequences
forms the "BUSCOMPSeq" gene set.
BUSCOMP uses minimap2 to map BUSCOSeq predicted CDS sequences onto genome/transcriptome assemblies, including those not included in the original BUSCO compilation. This way, the compiled set of species-specific BUSCO sequences can also be used to generate a quick-and-dirty assessment of completeness for a new genome assembly. Hits are converted into percentage coverage stats, which are then used to reclassify the BUSCO gene on the basis of coverage and identity. BUSCOMP ratings are designed to mimic the original BUSCO ratings but have different definitions. In addition, two extra classes of low quality hit have been added: "Partial" and "Ghost".
- Complete: 95%+ Coverage in a single contig/scaffold. (Note: accuracy/identity is not considered.)
- Duplicated: 95%+ Coverage in 2+ contigs/scaffolds.
- Fragmented: 95%+ combined coverage but not in any single contig/scaffold.
- Partial: 40-95% combined coverage.
- Ghost: Hits meeting local cutoff but <40% combined coverage.
- Missing: No hits meeting local cutoff.
NOTE: Duplicated
genes are explicitly those with hits on two different contigs/scaffolds. BUSCOMP is focused
on identifying the coding potential of an assembly rather than a detailed completeness assessment. Duplicated
genes on the same contig/scaffold will be marked as Complete
.
In addition to individual assembly stats, BUSCO and BUSCOMP ratings are compiled across user-defined groups of
assemblies with various outputs to give insight into how different assemblies complement each other. Ratings are
also combined with traditional genome assembly statistics (NG50 and LG50) based on a given genomesize=X
to help
identify the "best" assemblies. Details of settings, key results, tables and plots are output to an HTML report
using Rmarkdown.
NOTE: For HTML output, R must be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
NOTE: BUSCOMPSeq sequences can be provided with buscofas=FILE
in place of compilation. This option has not been
tested and might give some unexpected behaviours, as some of the quoted figures will still be based on the
calculated BUSCOMPSeq data. Please report any unexpected behaviour.
BUSCOMP is still under review as part of the Starling genome paper. For now, please cite the preprint:
Stuart KC*, Edwards RJ*, Cheng Y, Warren WC, Burt DW, Sherwin WB, Hofmeister NR, Werner SJ, Ball GF, Bateson M, Brandley MC, Buchanan KL, Cassey P, Clayton DF, De Meyer T, Meddle SL & Rollins LA (preprint): Transcript- and annotation-guided genome assembly of the European starling. bioRxiv 2021.04.07.438753; doi: 10.1101/2021.04.07.438753. [*Joint first authors]
BUSCOMP is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/buscomp.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone BUSCOMP git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the BUSCOMP git repo
for running on example data.
For BUSCOMPSeq analysis, minimap2 must be installed and either added to the
environment $PATH
or given to BUSCOMP with the minimap2=PROG
setting.
A full list of options and the workflow is provided below. The easiest/standard way to run BUSCOMP is to:
- Stick one or more assemblies in a directory, e.g.
fasta/
- Run BUSCO on all of them in another directory, e.g.
busco/
, withrun_
prefixes (-o setting
) - Run BUSCOMP with a
basefile=$PREFIX
setting to get$PREFIX.*
outputs, e.g.
# Generate a name-matched BUSCO run per genome
cd busco/
for GENOME in ../fasta/*.fasta; do
GENBASE=$( basename ${GENOME/.fasta/} )
busco -o run_$GENBASE -i $GENOME -l $LINEAGE --cpu $PPN -m genome
done
cd ../
# Run BUSCOMP
python $CODEPATH/buscomp.py runs="./busco/run_*" fastadir=./fasta/ basefile=$PREFIX
If you want to run the full BUSCOMP comparison, make sure minimap2 is installed and give the number of threads to
use with forks=$THREADNUM
. If you only want the BUSCO compilation, add buscompseq=F
.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general
SLiMSuite documentation for details of how to
use commandline options, including setting default values with INI files. If anything is not clear, or for
questions how to perform a particular analysis, please post a question in the GitHub Issues.
### ~ Input/Output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
runs=DIRLIST : List of BUSCO run directories (wildcards allowed) [run_*]
fastadir=DIRLIST: List of directories containing genome fasta files (wildcards allowed) [./]
fastaext=LIST : List of accepted fasta file extensions that will be checked for in fastadir [fasta,fas,fsa,fna,fa]
genomes=FILE : File of Prefix and Genome fields for generate user-friendly output [*.genomes.tdt if found]
restrict=T/F : Restrict analysis to genomes with a loaded alias [False]
runsort=X : Output sorting order for genomes and groups (or "Genome","Prefix","Complete","Group") [Group]
stripnum=T/F : Whether to strip numbers ("XX_*") at start of Genome alias in output [True]
groups=FILE : File of Genome and Group fields to define Groups for compilation [*.groups.tdt]
buscofas=FASFILE: Fasta file of BUSCO DNA sequences. Will combine and make (NR) if not given [None]
metaeukfna=T/F : Perform v5 metaeuk nucleotide busco sequences extraction if missing [True]
buscomp=T/F : Whether to run BUSCO compilation across full results tables [True]
dupbest=T/F : Whether to rate "Duplicated" above "Complete" when compiling "best" BUSCOs across Groups [False]
buscompseq=T/F : Whether to run full BUSCO comparison using buscofas and minimap2 [True]
phylofas=T/F : Generate output of compiled and renamed files for BUSCO-based phylogenomics [False]
ratefas=FILELIST: Additional fasta files of assemblies to rate with BUSCOMPSeq (No BUSCO run) (wildcards allowed) []
rmdreport=T/F : Generate Rmarkdown report and knit into HTML [True]
fullreport=T/F : Generate full Rmarkdown report including detailed tables of all ratings [True]
missing=T/F : Generate summary tables for sets of Missing genes for each assembly/group [True]
dochtml=T/F : Generate HTML BUSCOMP documentation (*.docs.html) instead of main run [False]
summarise=T/F : Include summaries of genomes in main `*.genomes.tdt` output [True]
### ~ Mapping/Classification options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
minimap2=PROG : Full path to run minimap2 [minimap2]
endextend=X : Extend minimap2 hits to end of sequence if query region with X bp of end [0]
minlocid=INT : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [1]
uniquehit=T/F : Option to use *.hitunique.tdt table of unique coverage for GABLAM coverage stats [True]
mmsecnum=INT : Max. number of secondary alignments to keep (minimap2 -N) [3]
mmpcut=NUM : Minimap2 Minimal secondary-to-primary score ratio to output secondary mappings (minimap2 -p) [0]
mapopt=CDICT : Dictionary of additional minimap2 options to apply (caution: over-rides conflicting settings) []
### ~ Processing options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
forks=X : Number of parallel sequences to process at once [0]
killforks=X : Number of seconds of no activity before killing all remaining forks. [36000]
forksleep=X : Sleep time (seconds) between cycles of forking out more process [0]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
STEP 1 is to set up the input data. BUSCOMP will first identify and check the run directories (given with the
runs=DIRLIST
command). This command can either be a list of directories, or an expandable list with wildcard.
A list of individual directories can be provided as a file if desired.
NOTE: runs=DIRLIST
wants the actual BUSCO run directories (run_$PREFIX[.busco]/
), not the parent
directory. These directories can optionally have a .busco
suffix, which will be trimmed off. By default,
BUSCOMP will look for ./run*
.
From these, the run list will be extracted, consisting of a number of genome $PREFIX
values. It is expected
that the $PREFIX
for each run will match the input genome file, found in a directory set by fastadir=DIRLIST
.
This file can also match the Genome
alias for that genome (see below). If you have not used consistent naming
across files, please check data has been loaded and mapped correctly during the Review phase (below).
The presence of a full_$PREFIX[.busco].tsv
table and single_copy_busco_sequences/
directory within each run
directory will also be checked. If buscocompseq=F
then all of the full_*.tsv
tables in the runs
directories will be loaded for compilation, but no sequence searching will be performed. The presence of
sequences available for compilation will be stored in the Sequences
field of *.genomes.tdt
.
BUSCO v4.x: From v0.9.1
, BUSCOMP should recognise BUSCO v4 output. Due to the reorganisation and altered
naming strategy of version 4, the naming of results files is more strict. The main results directory (set by
-o
when you run BUSCO) should match the genome assembly prefix with an optional .busco
suffix. (A leading
run_
is also permitted and will be ignored. For example, if assembly.fasta
was analysed with BUSCO v4,
BUSCOMP should be able to parse results generated using -o assembly
, -o assembly.busco
, -o run_assembly
, or
-o run_assembly.busco
. If none of these settings were used, the results directory can be manually renamed. For
additional sorting, a XX_ numerical prefix can be used (see below), e.g. run_01_assembly
will still look
for assembly.*
in the fastadir
path.
BUSCO v5.x From v0.11.0
, BUSCOMP should recognised BUSCO v5 output. This will be treated like v4 output.
Additional assemblies can rated using the BUSCOMPSeq analysis (see below) without having BUSCO data analysed.
These are given with the ratefas=LIST
command (wildcards allowed), and will be added to the main Genomes table
after BUSCO compilation has been performed. The default prefix for these files will be their basename (filename,
minus path and extension). This can be linked to an alias and/or have a prefix number stripped, as with the
run directories. When combined with buscofas=FASFILE
, this enables the BUSCOMPSeq analysis to be performed in
the absence of any BUSCO results.
Each genome analysed has a "Prefix" that can be used to help sorting (if runsort=prefix
) and identify relevant
BUSCO files where the genome (Fasta) name and BUSCO run name do not match. It is generally assumed that this
Prefix will match the prefix of the original assembly file on which BUSCO was run. This Prefix is set once as the
data is loaded and is never changed. (Sorting and visualisations can be made more informative altered using the
Genome (a.k.a. Alias) and Description fields.)
The assembly Prefix will be set as:
- If a
full_table_*.tsv
file is found, the core (indicated with*
) will be used as the prefix. - If no BUSCO results are available, the path- and extension-stripped fasta file name will be used for the prefix
Genome Aliases will be parsed initially from the loaded data:
- If a BUSCO run directory is given and it contains a single BUSCO run, the run directory name will be used for
the
Genome
alias, minus the leading "run_
". - If the run directory contains multiple BUSCO runs, the
full_table_*.tsv
core used for thePrefix
will also be used for theGenome
alias. - If no BUSCO results are available, the path- and extension-stripped fasta file name will be used for the
Genome
alias.
If Genome
contains a trailing .busco
, this will be stripped. If Genome
starts with preceding numbers
(XX_
), these will be kept for sorting but stripped for outputs and fasta matching (unless stripnum=F
).
NOTE: If the assembly fasta file does not match the name used for the BUSCO run file, then the Genome
alias
must be set to the basefile
of the fasta file by appropriate naming of the run_
directory, i.e. the
run_*
directory should contain a single BUSCO run and the directory name should match a *.fasta
file (or
other accepted fastaext=LIST
extension) in one of the directories given by fastadir=DIRLIST
.
An optional alias table can be provided (genomes=FILE
) that contains Prefix
and Genome
fields (others
allowed) and can give alternative names for each genome to be used in other outputs. If running interactively,
there will also be an option to check/update these aliases manually. Genome aliases are output as part of the
main *.genomes.tdt
output, which can also be used as subsequent genomes=FILE
input (the default, if found).
This table can also have a Description
field, which will be used to update descriptions for output if found.
(Empty descriptions will not be mapped.) Otherwise, descriptions can be parsed from a description_$PREFIX.txt
file (exactly matching the $PREFIX
parsed from a full_table_*.tsv
file) if found. Alternatively, if the run
directory has a run_
prefix AND there is only one full_table
, or it is a BUSCO v4 output directory, a
description will be parsed from description.txt
. In each case, only the first line will be used. If neither
condition is met, the name of the genome fasta file will be used. Failing that, Genome
will be used. (Note that
for BUSCO v4, description.txt
must be in the same directory as full_table.tsv
.
NOTE: The optional Alias table can be used to change a Genome
name to something other than its Fasta file
- this mapping is performed after fasta files have been identified.
NOTE: Prefix
and Genome
fields must be unique. A common Prefix
and Genome
value is permitted but
these will be assumed to refer to the same assembly. (In other words, do not run BUSCO on two assemblies
and give each BUSCO run the name of the other assembly!)
NOTE: Some Genome
names may cause conflicts with running minimap2 and/or the Rmarkdown output.
Whitespace is not permitted in Genome
names (though can be in R labels) and will be stripped. It is
recommended to keep Genome
names simple and only containing standard characters (letters, numbers, dots and
underscores).
BUSCOMP compilation will take a set of genomes and identify the best rating for each BUSCO across the set. By
default, this is done for all input genomes and assigned to a "BUSCOMP" rating. Ratings are compiled according
to the ratings=LIST
hierarchy, which by default is: Complete
, Duplicated
, Fragmented
, Missing
. The
"best" rating (first in the hierarchy) will be used for that group, even if it is found in only a single
assembly in the group. If dupbest=T
, the hierarchy is Duplicated
, Complete
, Fragmented
, Missing
.
Additional subsets of genomes can also be grouped together for compilation, using the groups=FILE
input. This
must have Genome
and Group
fields. Once generated, a Group becomes a "Genome". Groups can therefore include
other Groups in them, provided there is no circular membership. Genome mapping should also recognise Prefix
values. The "BUSCOMP" group with all genomes will be added unless a conflicting "BUSCOMP" group is provided,
or buscomp=F
.
NOTE: Genome groups are generated mapping onto the Genome
(or Prefix
) values after any alias mapping
but before any manual editing of Genome
aliases.
When running interactively (i>=0
), Group review and editing can be accessed from the main Genome menu. This
will show the genomes currently in each group, and provides the following menu options:
<A>dd
= Create a new Group. First, a group name will need to be chosen, then the<E>dit
menu will be triggered.<D>elete
= Remove the Group. Note that this will not remove Group members from any "parent" groups that contained the deleted group when they were loaded.<R>ename
= Change the Group name.<E>dit
= Edit group membership. This will give options to add/remove each assembly in the analysis.E<x>it menu
= Return to main Genome Menu.<Q>uit
= Quit BUSCOMP.
Once group edits are complete, group data will be saved to the *.groups.tdt
file. If runsort=group
then
genome sorting will also be performed following group edits.
Following initial setup of genomes and groups, a Genome/Group edit menu enables updating loaded information.
(This menu will not be available if running in non-interactive mode, e.g. i=-1
.) The primary use for this is
to convert unwieldy genome and/or BUSCO run names into something more user-friendly for display. It can also be
used for adding descriptions to assemblies, adding or modifying group membership, and sorting genomes using
different criteria. Assemblies can also be dropped from analysis from this menu.
The main menu offers five options:
R
= Review genomesG
= Review/edit genome groupsS
= Sort genomesQ
= Quit BUSCOMPX
= Exit Menu and Proceed [default]
If running in interactive mode, users will have the option to review and manually update loaded data. Each
loaded genome will be displayed in turn, in its current sorted order. (Sorting can be altered or updated via
the Sort genomes
option of the main genome/group menu.) Options will be presented to modify data, or quit
the menu (or BUSCOMP program). Modifications permitted are:
<A>lias
= Sets theGenome
field for the entry. This can have a "silent" numeric prefix for sorting ifrunsort=Genome
orrunsort=Group
. Most importantly, this will be the default label in tables and plots, and used for fields in compiled data.<D>escription
= This gives the option to set a manual description of the assembly, which is used in the final output.<R>emove
= Drop this assembly from the analysis. BUSCOMP will ask for confirmation. This cannot be undone.
If genomes are edited and runsort=Genome
or runsort=Group
, they will be re-sorted upon exiting the menu.
An optional runsort=LIST
parameter will determine the output order. Future releases of BUSCOMP will proceess
a mixed list of Prefix, Genome and Group names. Any data missing from the list will be appended at the end in
Genome order. (NOTE: Group names are a special kind of Genome in this context.) This sorting is preformed after
all manual reviews and updates are complete. Currently, providing a list of genomes (or runsort=Manual
) will
trigger manual ordering via the sort menu.
There are also four special sort modes that can be implemented:
- Genome = sort alphabetically based on Genome alias and Group name. (NOTE: "silent" numerical prefixes will affect this sorting (see below).
- Group = sort alphabetically using
Genome
sorting within each group with the Group summary immediately following the last Group member. [Default] - Prefix = sort alphabetically based on Prefix. (NOTE: "silent" numerical prefixes will affect this sorting (see below).
- Complete = sort in order of most complete to least complete.
- Missing = sort in order of fewest missing to most missing BUSCOs
The runsort=prefix
option is best combined with the addition of a number to the run directory name:
run_01_OneGenome/
containingfull_OneGenome.tsv
and generated fromOneGenome.fasta
run_02_AnotherGenome.busco/
containingfull_AnotherGenome.busco.tsv
and generated fromAnotherGenome.fasta
This example would sort OneGenome before AnotherGenome if runsort=prefix, but AnotherGenome before OneGenome if
runsort=Genome
.
The final setup step is to load any genomes for which Fasta files have been provided. Unless summarise=F
,
summary statistics will be calculated for each genome, enabling assessements of genome completeness and quality
in combination with the BUSCO and BUSCOMP results. Summary statistics are calculated by rje_seqlist
. The
following statistics are calculated for each genome:
- SeqNum: The total number of scaffolds/contigs in the assembly.
- TotLength: The total combined length of scaffolds/contigs in the assembly.
- MinLength: The length of the shortest scaffold/contig in the assembly.
- MaxLength: The length of the longest scaffold/contig in the assembly.
- MeanLength: The mean length of scaffolds/contigs in the assembly.
- MedLength: The median length of scaffolds/contigs in the assembly.
- N50Length: At least half of the assembly is contained on scaffolds/contigs of this length or greater.
- L50Count: The smallest number scaffolds/contigs needed to cover half the the assembly.
- CtgNum: Number of contigs (
SeqNum
+GapCount
). - N50Ctg: At least half of the assembly is contained on contigs of this length or greater.
- L50Ctg: The smallest number contigs needed to cover half the the assembly.
- NG50Length: At least half of the genome is contained on scaffolds/contigs of this length or greater.
This is based on
genomesize=X
. If no genome size is given, it will be relative to the biggest assembly. - LG50Count: The smallest number scaffolds/contigs needed to cover half the the genome.
This is based on
genomesize=X
. If no genome size is given, it will be relative to the biggest assembly. - GapLength: The total number of undefined "gap" (
N
) nucleotides in the assembly. - GapCount: The total number of undefined "gap" (
N
) regions above mingap=X (default 10bp). - GC: The %GC content of the assembly.
NOTE: NG50Length
and LG50Count
statistics use genomesize=X
or the biggest assembly loaded.
If BUSCOMP has been run more than once on the same data (e.g. to update descriptions or sorting),
previously calculates statistics will be read from the genomes=FILE
table, if loaded. This may cause some
inconsistencies between reported NG50 and LG50 values and the given/calculated genomesize=X
. If partial
results are reloaded following a change in genomesize=X
, this will give rise to inconsistencies between
assemblies. When re-running, please make sure that a consistent genome size is used.
If in doubt, run with force=T
and force regeneration of statistics.
Once all the genomes are loaded, BUSCOMP compiles the BUSCO results. This is done for three purposes:
- Report comparative BUSCO statistics across assemblies and plot with other assembly statistics.
- Combine BUSCO results across groups of assemblies to summarise total BUSCO coverage.
- Identify and compile the best
Complete
BUSCO hits for BUSCOMPSeq analysis (below).
Loaded assemblies with an identified full results Table
will be processed and added to the *.full.tdt
table.
The best rating per BUSCO gene will then be used for summary BUSCO counts for each genome. Finally, each Group
will have its combined BUSCO rating calculated. This is performed by choosing the "best" rating for each BUSCO
across the Group's members, where "best" is determined by the dupbest=T/F
setting:
- By default (
dupbest=F
), the rating hierarchy is: 'Complete', 'Duplicated', 'Fragmented', 'Missing'. - If
dupbest=T
, the rating hierarchy is: 'Duplicated', 'Complete', 'Fragmented', 'Missing'.
Where ratings are defined (quoting from the BUSCO v3 User Guide as:
Complete
: Single-copy hits where "BUSCO matches have scored within the expected range of scores and within the expected range of length alignments to the BUSCO profile."Duplicated
: AsComplete
but 2+ copies.Fragmented
: "BUSCO matches ... within the range of scores but not within the range of length alignments to the BUSCO profile."Missing
: "Either no significant matches at all, or the BUSCO matches scored below the range of scores for the BUSCO profile."
Total BUSCO counts (N
) and summed Ratings are added to the main *.genomes.tdt
table for each assembly and
group. A *.busco.tdt
file is also generated that has the rating for each BUSCO gene (rows) in each
assembly/group (columns).
If output is being sorted using runsort=Complete
or runsort=Missing
, data will be sorted at this point.
Raw BUSCO results and compiled numbers will then be output to the log file.
The following tables will then be saved (see Output files, below, for details):
*.genomes.tdt
= summary of loaded data, genome statistics and BUSCO ratings.*.full.tdt
= full BUSCO results across all assemblies.*.busco.tdt
= compiled BUSCO results showing the rating per genome/group for each BUSCO gene.
The final step of the BUSCOMP BUSCO compilation is to extract the best Complete BUSCO sequences from those
available. For all assemblies with BUSCO results and a single_copy_busco_sequences/
, BUSCOMP ranks all the hits
across all assemblies by Score
and keeps the top-ranking hits. Ties are then resolved by Length
, keeping the
longest sequence. Ties for Score
and Length
will keep an arbitrary entry as the winner. Single
complete
hits are given preference over Duplicate
hits, even if they have a lower score, because only Single
hit
have their sequences saved by BUSCO in the single_copy_busco_sequences/
directory.
If buscofas=FASFILE
has been given, this will be used for BUSCOMPSeq searches. Otherwise, the best BUSCO
seqences identified for each BUSCO gene will be saved as *.buscomp.fasta
for BUSCOMPSeq analysis. The
exception is if buscofas=FASFILE
is pointing to the *.buscomp.fasta
and force=T
, or if the
buscofas=FILE
fasta file cannot be found.
NOTE: The buscofas=FASFILE
option has not been tested and might give some unexpected behaviours, as some
of the quoted figures will still be based on the calculated BUSCOMPSeq data.
Optionally, BUSCOMP can compile fasta files for phylogenomic analysis (phylofas=T
). These will be saved in the
phylogenomics/
directory as a series of nucleotide *.fna
files in phylofna
and protein *.faa
files in
phylofaa
. The key organisation of these directories is a single file per BUSCO ID that contains sequences for
every genome with a single Complete rating. Files will be renamed from >$SEQNAME:$START-$END
to
>$PREFIX $SEQNAME:$START-$END:$BUSCOID
where $PREFIX
is the assembly Prefix (see above).
NOTE: for HAQESAC-based phylogenomics, you might want the prefixes to have the format eog_$SPCODE__$ID, though. it should also run OK with unkspec=T allowvar=T gnspacc=F.
Once the gene sequences for BUSCOMPSeq have been established, the next step is to search for them in the assembly
fasta files using a more deterministic (and faster) approach. For this, minimap2 has been chosen. BUSCOMP will
perform a minimap2 search on each assembly fasta file. (Use minimap2=FILE
to tell BUSCOMP where to find
minimap2 if the path is not an environment variable.)
Minimap2 will run with -x splice -uf -C5
and hits will be filtered by length, keeping those that meet the
minloclen=X
cutoff (default 20 bp). A percentage identity cutoff can also be applied with minlocid=X
(default 0%). Hits are also reduced to unique coverage by starting with the local alignment with the largest
number of identical positions,
and then trimming any other local alignments that overlap with the same part(s) of the query (BUSCO) sequence.
This is repeated until all local alignments are non-overlapping (and still meet the identity and length
criteria). Local hits are then compiled into global alignment (GABLAM) statistics of coverage and identity.
By default, Minimap2 searches will keep the Top 3 secondary alignments for each sequence (the minimap2 -N
commandline parameter). In limited tests, this appears to give a good trade-off between search speed and the
identification of Duplicated
BUSCO genes. This default can be adjusted with mmsecnum=INT
to make runs faster
or more sensitive, as required. This can be further modulated using mmpcut=NUM
, which sets the Minimap2 minimal
secondary-to-primary score ratio to output secondary mappings (the minimap2 -p
commandline parameter).
Minimap2 search results are saved in a *.minimap/
directory, with the prefix $BASEFILE.$CUTOFFS
, where
$BASEFILE
is the fasta file basename for assembly, and $CUTOFFS
takes the form NXLXXIDXX
, where NX
is the
max number of secondary hits (mmpcut=NUM
), LXX
is the
length cutoff (minloclen=X
) and IDXX
is the identity cutoff (minlocid=X
) - L20ID0
by default. Unless
uniquehit=F
(see below), $CUTOFFS
will have a U
suffix, e.g. L20ID0U
. The Minimap2 *.paf
file will only
have the .NX
suffix. A *.NX.paf.cmd
file with the actual Minimap2 command used will also be saved, so the
mmpcut=NUM
setting and any other Minimap2 options set using mapopt=CDICT
can be checked if required.
Existing files will be loaded and reused unless force=T
.
Minimap2 hits are first converted into BLAST-style "local" hits, which are then converted into
GABLAM-style summary statistics. For the *.hitsum.tdt
table,
local hits are reduced to unique coverage of each Query, such that each part of the Query is covered by a single
hit. Local hits are selected in order of the number of identical positions between Query and Hit, followed by
Length in the case of a tie. Overlapping regions from other local hits are trimmed, and the process repeated with
the next-best local hit, ranked on trimmed stats where necessary. For the *.gablam.tdt
output, each Query-Hit
pair is considered independently and local hits are reduced to (a) unique coverage of the Query for Qry_*
output, and (b) unique coverage of the Hit for Hit_*
output. (Note that for each Query-Hit pair, only the local
hits between that pair are considered for "unique" local hit reduction - there may be overlapping alignments to
different queries/hits.) Unless uniquehit=F
, hits will first be reduced to be non-overlapping across all
queries before being coverted to Query-Hit pair GABLAM coverage stats.
NOTE: Re-using results does NOT robustly check whether the BUSCOMPSeq data has changed, so this directory
should be deleted if re-running with different sequences. BUSCOMP will save the name of the BUSCO file along
with the md5sum of its contents to *.NX.input.md5
, which will be checked if present and re-using data. Some basic
checks should also be performed during the following results compilation stage, but please note that BUSCO IDs
are used for sequence identifiers and so changes in BUSCO hit sequences will not be identified if files have
been inappropriatley copied/moved between runs etc. - please look out for unexpected behaviour outside a "clean"
run.
NOTE: Minimap2 only works with high sequence identity. BUSCOMP is designed to be used on multiple assemblies
from the same species. If using cross-species BUSCO analysis, odd results might be obtained, biased by the
evolutionary distance from the species with the most BUSCOMP sequences. Under these circumstances, it might be
better to swap out minimap2 for BLAST+. This can be achieved by independently running GABLAM using the
BUSCOMPSeq fasta file as queries and each assembly as the search database, then replacing the *.gablam.tdt
and
*.hitsum.tdt
files before re-running BUSCOMP. Please contact the author if you want help with this.
Minimap2 searches of the compiled BUSCOMP sequences are then compiled in similar vein to the original BUSCO results. The primary difference is that the search results need to first be converted into BUSCO-style ratings. This rating is explicitly more "quick and dirty" than the original BUSCO ratings, and should be considered complementary rather than superior. It aims to provide a quick, consistent assessment, but does have fewer checks and balances as a result.
BUSCOMP uses results from both the *.hitsum.tdt
table (overall coverage in assembly) and the *.gablam.tdt
table (coverage in a single contig/scaffold) to rate sequences as:
- Complete: 95%+ Coverage in a single contig/scaffold. (Note: unlike BUSCO, accuracy/identity is not considered. This will make it more prone to misidentification of closely related paralogues.)
- Duplicated: 95%+ Coverage in 2+ contigs/scaffolds. (Note: unlike BUSCO, multiple hits must be on different sequences to be counted as Duplicated. This will make it prone to missing false tandem repeats etc.)
- Fragmented: 95%+ combined coverage but not in any single contig/scaffold. (Note: as with BUSCOMP
Complete
ratings, these might include part of closely related paralogues.) - Partial: 40-95% combined coverage. (Note: these might be marked as "Fragmented" by BUSCO, which does not discriminate between "split" full-length hits, and single partial hits.)
- Ghost: Hits meeting local cutoff but <40% combined coverage.
- Missing: No hits meeting local cutoff.
When compiling the results for all BUSCO genes, Single/Duplicated Complete hits will also be rated as Identical if they have 100% coverage and 100% identity in at least one contig/scaffold.
Once all the individual assemblies have been rated for the full set of assemblies, results are compiled across Groups as described for the original BUSCO results (above). Because no genes receive an individual "Identical" rating, Groups will not have a summary count for Identical hits.
Individual gene ratings for each genome and group are output to *.LnnIDxx.buscomp.tdt
, where LnnIDxx
records
the minloclen=nn
and minlocid=xx
settings. Compiled ratings are output to *.LnnIDxx.ratings.tdt
.
The final step of the BUSCOMP compilation is to summarise the findings in the log file, akin to the BUSCO summary file. This will first generate a one line summary of the percentages, along with the original number of complete BUSCOs and the number of BUSCOMP sequences contributed by that assembly (i.e. the number with the best score of all the BUSCO searches.) This is followed by a more detailed breakdown of each category. For example:
#BUSCO 00:00:41 Results: C:89.2%[S:87.8%,D:1.4%,I:23.3%],F:5.8%,P:3.8%,G:1.6%,M:0.9%,n:3736 - canetoad_v2.2 (3194 Complete BUSCOs; 102 BUSCOMP Seqs)
#INFO 00:00:41 870 Identical BUSCOs (I) [100% complete and identical]
#INFO 00:00:41 3333 Complete BUSCOs (C) [95%+ coverage in a single contig/scaffold]
#INFO 00:00:41 3281 Complete and single-copy BUSCOs (S) [1 Complete hit]
#INFO 00:00:41 52 Complete and duplicated BUSCOs (D) [2+ Complete hits]
#INFO 00:00:41 217 Fragmented BUSCOs (F) [95%+ coverage spread over 2+ contigs/scaffolds]
#INFO 00:00:41 143 Partial BUSCOs (P) [40-95% coverage]
#INFO 00:00:41 61 Ghost BUSCOs (G) [<40% coverage]
#INFO 00:00:41 34 Missing BUSCOs (M) [No hits]
#INFO 00:00:41 3736 Total BUSCO gene hits searched
There is a risk that performing a low stringency search will identify homologues or pseudogenes of the desired BUSCO gene in error.
If there is a second copy of a gene in the genome that is detectable by the search then we would expect the same
genes that go from Missing
to Complete
in some genomes to go from Single
to Duplicated
in others.
To test this, data is reduced for each pair of genomes to BUSCO-BUSCOMP rating pairs of:
Single
-Single
Single
-Duplicated
Missing
-Missing
Missing
-Single
This is then converted in to G
ain ratings (Single
-Duplicated
& Missing
-Single
) or N
o Gain ratings
(Single
-Single
& Missing
-Missing
). The Single
-Duplicated
shift in one genome is then used to set the expected Missing
-Single
shift in the other, and assess the probability of observing the Missing
-Single
shift using a cumulative binomial
distribution, where:
k
is the number of observedGG
pairs (Single
-Duplicated
andMissing
-Single
)n
is the number ofMissing
-Single
G
ains in the focal genome (NG
+GG
)p
is the proportion ofSingle
-Duplicated
G
ains in the background genome (GN
+GG
/ (GN
+GG
+NN
+NG
))pB
is the probability of observingk+
Missing
-Single
gains, givenp
andn
This is output to *.gain.tdt
, where each row is a Genome and each field gives the probability of the row
genome's Missing
-Single
gains, given the column genome's Single
-Duplicated
gains.
Contents of the main output files are given below. In addition to the tab-delimited text output, a summary HTML
(and source RMarkdown) file will be generated, assuming R is installed. If buscompseq=T
then a fasta file of
the "best" (top-scoring) Single Complete BUSCO genes will also be generate (*.buscomp.fasta
), unless it is
provided using buscofas=FASFILE
.
Unless buscompseq=F
or buscofas=FASFILE
is provided, nucleotide and protein sequences for the "best" BUSCO
gene hits will be output to *.buscomp.fasta
(nucleotide) and *.buscomp.faa
(protein). The *.buscomp.faa
protein file is not used by BUSCOMP but is provided as a useful set of (near-)complete protein sequences. Each
fasta file is made from concatenating individual fasta files from the single_copy_busco_sequences/
directories.
As such, they will be named as with the standard BUSCO output:
>$BUSCOID $FASTAFILE:$CONTIG:$START-$END
NOTE: The $FASTAFILE
path given in this file will be the one from the original BUSCO run, not the path to
the genome file used by BUSCOMP if it has subsequently been moved/renamed.
The *.genomes.tdt
table is the main summary table for the input genomes, their BUSCO rating summaries and genome statistics.
#
= Sorting order for outputDirectory
= Directory for this BUSCO runPrefix
= Prefix identified from BUSCO directory (or full table), corresponding to BUSCO output filesGenome
= Optional alternative name (or "alias") to be used in outputsFasta
= path to genome file, if found- Genome statistics = genome summary statistics if genome found and
summarise=T
. (See above.) Sequences
(True/False) = Whethersingle_copy_busco_sequences/
directory foundTable
= Path tofull_*.tsv
table- BUSCO Ratings summary will be added (assuming
Table
isTrue
) (See above.)
The *.groups.tdt
table has the mappings between Genome
and Group
identifiers to make the groups for combined ratings (see
above).
#
= Arbitrary unique key for tableGenome
= Genome alias or PrefixGroup
= Name for grouped rating
The *.full.tdt
table is a compiled version of the all the individual BUSCO full results tables.
Genome
= Genome alias#
= Arbitrary unique key element for table (adjusting for duplicates) \ -BuscoID
= BUSCO gene identifierStatus
= BUSCO ratingContig
= Contig containing BUSCO hitStart
= Start positionEnd
= End positionScore
= BUSCO scoreLength
= Length of BUSCO hit
The *.busco.tdt
table has the compiled ratings for individual BUSCO genes in each genome/group
BuscoID
= BUSCO EOG gene identifier- Genomes and Groups fields have the rating of that gene in that Genome or Group
The *.buscoseq.tdt
table has the set of best BUSCO genes used for the compiled BUSCOMPSeq sequence set. In
addition to the BUSCO stats for all the BUSCOMP sequences, BUSCOMP ratings for each Genome are output in this
file.
Genome
= Genome aliasBuscoID
= BUSCO gene identifierStatus
= BUSCO ratingContig
= Contig containing BUSCO hitStart
= Start positionEnd
= End positionScore
= BUSCO scoreLength
= Length of BUSCO hit- Genomes fields have the BUSCOMP rating of that gene in that Genome
The *.LnnIDxx.buscomp.tdt
table is the same as the *.busco.tdt
but with revised ratings based on BUSCOMP analysis.
BuscoID
= BUSCO EOG gene identifier- Genomes and Groups fields have the rating of that gene in that Genome or Group
The *.LnnIDxx.ratings.tdt
table has the compiled BUCOMP summary ratings per genome/group.
Genome
= Genome alias or group nameN
= Number of BUSCOMP genesIdentical
= 100% coverage and 100% identity in at least one contig/scaffold. These will also be rated asComplete
orDuplicated
. Groups do not haveIdentical
ratings.Complete
= 95%+ Coverage in a single contig/scaffold. (Note: accuracy/identity is not considered.)Duplicated
= 95%+ Coverage in 2+ contigs/scaffolds.Fragmented
= 95%+ combined coverage but not in any single contig/scaffold.Partial
= 40-95% combined coverage.Ghost
= Hits meeting local cutoff but <40% combined coverage.Missing
= No hits meeting local cutoff.
The *.LnnIDxx.changes.tdt
table has counts of change ratings (BUSCO to BUSCOMP) for each genome, in addition
to a Total count across all genomes.
BUSCO
= BUSCO rating.BUSCOMP
= BUSCOMP rating.- Genome fields have the count of the number of genes with this combination of ratings.
Total
= Summed count over all genomes.
The *.LnnIDxx.changes.full.tdt
table is the complete set of ratings changes (BUSCO to BUSCOMP) for each gene
and genome.
BuscoID
= BUSCO EOG gene identifier- Genome fields have the ratings change for that gene, using the first letters of the ratings.
Complete, Duplicated, Fragmented, Partial, Ghost, Missing
The *.LnnIDxx.unique.tdt
table has Genome
and Group
identifiers for each BuscoID
where that gene is only
Complete
(or Duplicated
) in that genome/group.
BuscoID
= BUSCO EOG gene identifierBUSCO
= Genome/Group that uniquely has a BUSCOComplete
rating.BUSCOMP
= Genome/Group that uniquely has a BUSCOMPComplete
rating.
The *.LnnIDxx.rdata.tdt
table has the BUSCOMP summary ratings (converted into percentage values) and sequence
statistics for each Genome
/Group
, in addition to:
BUSCO
= BUSCOComplete
(Single
andDuplicated
) percentageNoBUSCO
= BUSCOMissing
percentageUniqBUSCO
= Number of unique BUSCOComplete
genesUniqBUSCOMP
= Number of unique BUSCOMPComplete
genescol
= colour field for R plotting.pch
= point type for R plotting.label
= Genome/Group label for R plotting.plot
= TRUE/FALSE, whether to plot the Genome/Group statistics.best
= any categories for which this Genome is rated "best".
Minimap2 will be run for each genome with a Fasta
file, generating:
*.paf
*.paf.cmd
*.input.md5
*.L20ID80.local.tdt
*.L20ID80.hitunique.tdt
*.L20ID80.qryunique.tdt
*.L20ID80.hitsum.tdt
*.L20ID80.gablam.tdt
Output details will be added here.
The final step in BUSCOMP analysis is to generate a summary report. This is produced as an RMarkdown document
(*.Rmd
) and (assuming the path to pandoc is set) is then "knitted" to make an HTML file (*.html
). The
RMarkdown file is retained to make it easy for the user to modify the content and convert the output into a
report of their own. HTML output can be switched off with rmdreport=F
. Unless fullreport=F
, a larger report
with full BUSCO results tables and comparative plots of Missing BUSCO genes will be generated (`*.full.html').
Prior to generation of the document, results are first compiled into an overview stats file with additional plot attributes, before the "best" assemblies are identified under various criteria. Finally, there is an option to modify some of the plotting attributes before the report is generated.
The main BUSCOMP *.ratings.tdt
output is combined with key genome statistics and BUSCO ratings from the
*.genomes.tdt
table. BUSCOMP ratings are converted into percentage values. BUSCO ratings are converted into
percentage values and reduced to BUSCO
(Single and Duplicated Complete BUSCOs) and NoBUSCO
(Missing BUSCOs).
(Full BUSCO results are plotted directly from the *.genomes.tdt
.)
After results are compiled, additional plotting fields are generated:
col
= Plotting colour. (Default, "red". Genomes with 1+ "best" ratings, "blue")pch
= Point type. (Default, 21 [circle]. Genomes with "best" rating will be 22 [square] for best BUSCO(MP) ratings, 24 [triangle] for best contiguity ratings, or 23 [diamond] for best in both categories.label
= Additional text field to be used for labels. In interactive mode, the option will be given to leave this blank for unlabelled plots.plot
= Whether or not to include a genome in the plot. (Default, TRUE)
In interactive mode, the option will be provided to edit plotting attributes for each assembly, following the
calculation of "best" assemblies (below). Compiled data are then saved to *.LnnIDxx.rdata.tdt
for summary plots
and tables in the RMarkdown output.
There are many ways of assessing genome assembly quality, but they can be broadly broken down into four aspects:
-
Coverage. How much of the genome is included in the assembly.
-
Contiguity. How many fragments is the genome in, and how big are they.
-
Accuracy. How accurate is the assembly in terms of sequence errors.
-
Redundancy. How much of the genome has been included multiple times.
Standard reference-free genome statistics (e.g. number, length and gappiness of scaffolds), can only give limited
insights. Whilst assemblies smaller than the genome size are clearly missing bits, assembly size could be
artificially inflated by redundant regions, making it impossible to assess Coverage. Scaffold sizes can give an
indicator of Contiguity, but are also prone to biases introduced by filtering of small scaffolds, for example.
If an estimated Genome Size can be provided, the most useful measures in this respect are NG50
and LG50
.
These statistics sort scaffolds by length (big to small) and identify the contig/scaffold size at which half the
genome (not the assembly) is covered by contigs/scaffolds of that size or bigger. This is the NG50
value
(called NG50Length
in BUSCOMP), with LG50
(LG50Count
) being the number of contigs/scaffolds needed to
cover half the genome. (N50
and L50
are the same statistics only for the assembly.) These statistics can
still be mislead by redundancy in the assembly. None of theses statistics speak to sequence accuracy.
The power of BUSCO is that it speaks to all four aspects of quality. Complete
and Fragmented
genes give
an indication of Coverage, Continuity and Accuracy; Duplicated
genes give an indication of Redundancy;
Missing
genes give an indication of Coverage and Accuracy. However, the weakness is that these different
aspects cannot easily be disentangled. This makes side-by-side comparisons of different assemblies challenging,
as it is not always clear what a difference is indicating.
BUSCOMP is designed on the principle that Coverage and Contiguity are the two most important aspects of assembly quality. Accuracy can, in principle, be improved by additional error-correction steps (including manual curation). Suspected Redundancy can likewise be identified a flagged. Coverage and Contiguity, in contrast, can only be improved by further assembly - either trying again from scratch, or employing a scaffolding or gap-filling alogrithm.
BUSCOMP has therefore identified seven statistics that can be used to rank assemblies on inferred Completeness or Contiguity:
- Completeness:
Complete
= Maximum number of Complete BUSCOMP ratings.BUSCO
= Maximum number of Complete BUSCO ratings.Missing
= Smallest number of Missing BUSCOMP ratings.NoBUSCO
= Smallest number of Missing BUSCO ratings.
- Contiguity:
MaxLength
= Maximum length of contigs/scaffolds.NG50Length
= Half the genome lies on contigs/scaffolds at least this size. (See above.)LG50Count
= Half the genome can be covered by this many contigs/scaffolds. (See above.)
Individual assemblies are rated as "best" under all seven criteria, with ties allowed. Each assembly can be best under multiple criteria and each criterion can have several best assemblies. BUSCOMP will report all such combinations.
The BUSCOMP HTML reports (full and summary) consist of the following sections:
Details of the BUSCOMP run, including the version, directory and commands.
- 1.1 BUSCOMP summary: A brief overview of the "best" assembly ratings.
This section contains summary details of the genome assemblies analysed, including the loaded data, the summary statistics for the assemblies, and coverage/contiguity assessment plots:
- 2.1 Genome statistics
- 2.2 Genome coverage assessment plots
- 2.3 Genome contiguity assessment plots
The compilation of BUSCO ratings, including details of the BUSCOMP Groups is given in this section. In addition to the summary ratings for each genome/group, the full report will have a table of all the individual gene ratings:
- 3.1 Genome Groups
- 3.2 BUSCO Summary
- 3.3 BUSCO Gene Details
Next, BUSCOMP ratings using the compiled BUSCOSeq dataset are reported. In addition to the summary ratings for each genome/group, the full report will have a table of all the individual gene ratings:
- 4.1 BUSCOMP re-rating of genomes
- 4.2 BUSCOMP re-rating full results
The final report section features direct comparisons of the BUSCO and BUSCOMP ratings, reporting
changes in ratings between BUSCO and BUSCOMP. Unique Complete genes are also identified: those rated as
Complete
in only a single genome, or multiple genomes in a single Group.
The full report also has a series of summary ratings plots for subsets of BUSCO genes that are rated as
Missing
in a genome or group.
- 5.1 BUSCO to BUSCOMP ratings changes
- 5.2 Unique BUSCO and BUSCOMP Complete genes
- 5.3 Ratings for Missing BUSCO genes
Details of the BUSCOMP run in terms of when, where and how (e.g. commandline options) are found in the Appendix.
Any errors or warnings generated during the BUSCO run are reported here. Check the *.log
file generated for
details.
- 6.1 BUSCOMP errors
- 6.2 BUSCOMP warnings
© 2019 Richard Edwards | richard.edwards@unsw.edu.au