Yoruba is a toolset to query and manipulate BAM files. Yoruba has an command-option interface reminiscent of samtools and some other tools:
yoruba <command> [options] [<in.bam>] ...
where <command>
is one of several specific commands.
forget
or gbagbe
: Forget unused reference sequences in a BAM file
inside
or inu
: Summarize BAM file contents
readgroup
or kojopodipo
: Add or replace read group information
duplicate
or seda
: Mark and remove duplicate paired-end and single-end reads, under development
Yoruba uses the BamTools C++ API for handling BAM files and SimpleOpt for handling command-line options.
NOTE: yoruba is not yet in production shape. Contact me if you would like to use yoruba and I'll help get you started.
yoruba forget [options] <in.bam>
yoruba gbagbe [options] <in.bam>
Dynamically reduces the number of reference sequences in a BAM file. Gbagbe is the Yoruba (Nigeria) verb for 'to forget'. Either command invokes this function. At most one input BAM file is allowed.
- NOTE:
forget
does not adjust reference sequence mentioned within tags. There are some de facto standards for these mentions, for examplebwa
with multiply-mapped reads, andforget
will handle these as I learn of them.
yoruba gbagbe
will remove reference sequence descriptions from the BAM header
(@SQ lines) that are not mentioned by alignments in the BAM file. This can be
particularly helpful when a BAM containing a subset of reads from a larger BAM
containing alignments mapped to a large set of reference sequences. This can
be wasteful of space and loading time if many reference sequence descriptions
appeared in the original BAM header. For a few hundred reference sequences,
this may not be a problem, but 10 million reference sequence descriptions can
take a while to load...
yoruba gbagbe
makes two passes over the BAM file, the first to determine
which reference sequences are mentioned, and the second to write the output
BAM. If the --usage-only
option is provided, the second pass is skipped
(see below).
A list of reference sequences to keep regardless of whether they are referred
to can be provided with the --list
option. The file can be in BED format, as
a single name per line, or in any other format for which the reference sequence
name is the first whitespace-separated field. Lines beginning with #
are
ignored.
For paired-end reads with an aligned mate, the reference sequence of the
aligned mate is mentioned in the BAM record for the read. By default, yoruba forget
will keep descriptions of reference sequences mentioned for mates.
With the --no-mate
option, these references mentioned only for mates will be
forgotten, and the reference sequence ID for the mate will be changed to -1
,
indicating a missing reference sequence description.
With the --usage-only
option, reference sequence usage is examined in all
reads and all options are applied toward determining the final reference
sequence set, but no output BAM file is produced. The --output
option is
ignored. One possible use of this option is to determine the number of mate
mappings that are lost by restricting the set of reference sequences.
With the --usage-file
option, which does not imply --usage-only
, a
report of reference mentions is written to FILE, containing seven columns for
each reference in the input BAM: (1) ref, the reference name; (2) input_id,
the input reference ID; (3) m_read, the number of mentions of the reference
by reads; (4) m_mate, the number of mentions of the reference by mates of
reads; (5) m_name, 1 if the reference is mentioned by name (--list
), 0
otherwise; (6) no_mate, 1 if the reference was mentioned only by a mate and
was excluded from the output (--no-mate
), 0 otherwise; and (7) output_id,
the output reference ID. The final line (reference name *
) lists totals for
mapped reads/mates missing their reference sequence in the input (input
reference ID is -1).
Option | Description |
---|---|
--no-mate |
forget references for mates of aligned reads |
--usage-only |
analyze reference usage, do not produce output BAM |
--usage-file FILE |
write details of per-reference usage to FILE |
-L FILE or --list FILE |
list of reference sequences to keep (names or BED) |
-o FILE or --output FILE |
output file name [default is stdout] |
-? or --help |
longer help |
--progress INT |
print reads processed mod INT [100000] |
In the options table, FILE indicates a filename, and INT indicates an integer value.
yoruba inside [options] [<in.bam>]
yoruba inu [options] [<in.bam>]
Summarizes the contents of the BAM file. Inu is the Yoruba (Nigeria) noun
for 'inside'. Either command invokes this function. If <in.bam>
is not
supplied, input is read from stdin
. At most one input BAM file is allowed.
No changes to the BAM file are caused by use of this command.
The contents of a BAM file are printed in six sections, the first five comprise the header and the last is the reads. The sections in the order described in the SAM definition (http://samtools.sourceforge.net/SAM1.pdf):
-
the header line (
@HD
) contains BAM metadata -
the reference sequences (
@SQ
) describe the reference sequences to which the reads in the BAM are aligned -
the read group dictionary (
@RG
), described underreadgroup
above -
the program chain (
@PG
) describes programs which have manipulated the BAM file -
comment lines (
@CO
) which are individual text lines -
finally, reads, which may be aligned or unaligned; not printed (for the moment) are read sequences, base-specific qualities, and additional tags
Option | Description |
---|---|
--refs-to-report INT |
number of reference sequences to provide details about [10] |
--reads-to-report INT |
number of reads to provide details about [10] |
--continue |
continue reading after reporting detailed reads, report read number |
--validate |
check header validity using BamTools API; very strict |
-? or --help |
longer help |
In the options table, INT indicates an integer value.
yoruba readgroup [options] [<in.bam>]
yoruba kojopodipo [options] [<in.bam>]
Add or replace read group information in a BAM file. Kojopodipo is the
Yoruba (Nigeria) verb for 'to group'. Either command invokes this function. If
<in.bam>
is not supplied, input is read from stdin
. At most one input BAM
file is allowed.
yoruba readgroup
is faster and uses less memory than picard AddOrReplaceReadGroups
.
For a 208GB BAM file containing 10.4M reference sequences and 2.41B reads,
AddOrReplaceReadGroups
required ~30 h and ~9 GB RAM to complete, while yoruba readgroup
required ~21 h and ~6 GB RAM.
Read group information appears in two places in a BAM file:
-
the read group dictionary, found in the header, which contains definitions of individual read groups including the read group ID and any other information associated with the ID, such as library, sample name, etc.
-
the RG tag on each read, which specifies an ID that appears in the read group dictionary, and declares the read to be part of the identified read group
By default, all reads in the BAM file will be given the supplied read group. If the dictionary already defines a read group with the same ID, its definition will be replaced with the supplied information. If the dictionary contains other read groups, their definitions will remain in the BAM file header (if present) but all reads will be given the supplied read group.
This behaviour can be changed by using the options --replace
and --clear
.
See table below.
The only argument required to specify a valid read group is --ID
or its
synonym --id
.
Option | Description |
---|---|
--ID STR or --id STR |
read group identifier |
--LB STR or --library STR |
read group library |
--SM STR or --sample-name STR |
read group sample name |
--DS STR or --description STR |
read group description |
--DT STR or --date STR |
read group date |
--PG STR or --programs STR |
read group programs used |
--PL STR or --platform STR |
read group sequencing platform |
--PU STR or --platform-unit STR |
read group platform unit |
--PI STR or --predicted-insert STR |
read group predicted median insert size |
--FO STR or --flow-order STR |
read group flow order |
--KS STR or --key-sequence STR |
read group key sequence |
--CN STR or --sequencing-center STR |
read group sequencing center |
-o FILE or --output FILE |
output file name [default is stdout] |
--replace STR |
replace read group STR with --ID |
--clear |
clear all read group information |
-? or --help |
longer help |
--progress INT |
print reads processed mod INT [100000] |
In the options table, STR indicates a string argument, INT indicates an integer value, and FILE indicates a filename.
No formatting restrictions are imposed on any of the read group strings. It is the user's responsibility to ensure that they conform to the SAM definitions (http://samtools.sourceforge.net/SAM1.pdf) or to any other tool requirements.
If the output file is not specified, then output is written to stdout.
The --replace
option will replace the identified read group to have the name
provided in --ID
, in both its dictionary entry and on reads. If only --ID
is provided, then the read group is simply renamed. If any other read group
options are given, then the read group is redefined as well.
The --clear
option removes all read group information from all reads. If
specified with options defining a read group, then the read group dictionary
will be cleared prior to defining the new read group.
Only one of these may be supplied at a time. To summarize the effects of these options on the read group dictionary and the RG tag on reads:
Option | Read Group (RG) tag on reads | RG dictionary | ||
---|---|---|---|---|
no RG | RG matches STR | RG does not match STR | ||
only --ID , etc. |
new RG set for all reads | RG added | ||
--replace STR |
no change | RG changed to --ID |
no change | RG STR updated with --ID ; replaced if any other RG options |
--clear , no --ID |
no change | RG removed | RG removed | cleared |
--clear , with --ID |
new RG set for all reads | cleared, then RG added |
yoruba duplicate [options] <in.bam>
yoruba seda [options] <in.bam>
Under development, unsafe to use, operation will be unpredictable
Determines duplicate reads in a BAM file, marks them as duplicates, and removes them on option. Seda is the Yoruba (Nigeria) verb for 'to copy'. Either command invokes this function. At most one input BAM file is allowed.
Option | Description |
---|---|
--as-single-end |
all reads treated as single-end, ignore pairing |
--single-end-only |
only look for duplicates in single-end reads |
--paired-end-only |
only look for duplicates in paired-end reads |
--remove |
remove reads from the output BAM |
--duplicate-file FILE |
write duplicate reads to BAM file FILE, note this does not currently imply --remove |
-o FILE or --output FILE |
output file name [default is stdout] |
-? |
--help |
--debug INT |
debug info level INT [1] |
--reads INT |
only process INT reads (-1 = all) [-1] |
--progress INT |
print reads processed mod INT [100000] |
--override |
override the non-usage of this command |
In the options table, INT indicates an integer value, and FILE indicates a filename.