RiboTagger is composed of a few scripts. The main script is called ribotagger.pl, which scans one or more fastq/fasta files to extract sequences from a particular variable region and report estimated observations for each reported sequences due to sequencing error. Each of your samples need to be scanned individually by this script. Then there is a script biom.pl to combine the output files from ribotagger.pl and produces a number of output files, including a few tabular files and a BIOM formated file. The tabular files can be used for your 16S/18S profiling analysis, while the BIOM file can be directly used by tools like QIIME.
Please check the poster.pdf for an overview of the package. For more details, please read the official publication at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1378-x
You need to downaload and unpack the sample data file. You should see 5 fastq files in the "test" folder.
> mkdir out
> ribotagger.pl -r v4 -i test/samp1.fq -o out/samp1.v4
The structure of the output
> head out/samp1.v4
#By: RiboTagger v0.8.0 (PSSM: 2014-08-01)
#Date: 2014-08-23T20:23:05
#CMD: -r v4 -i test/samp1.fq -o out/samp1.v4
#nreads=200000
#ntagreads=1327
#ntags=546
#tag n npos fp long.total.count long1.count long2.count long3.count long1 long2
TTGACTAAGTCGGATGTGAAAGCTCCTGGCTTA 68 28 0.0001 44 43 1 0 TGGGCGTAAAGCGCATGTAGGCGGTTGACTAAGTCGGATGTGAAAGCTCCTGGCTTAACTGGGAGAGGCCATTCGAAACTAGTC TGGGCGTAAAGCGCATGTAGGCGGTTGACTAAGTCGGATGTGAAAGCTCCTGGCTTAGCTGGGAGAGGCCATTCGAAACTAGTC
TTGGGTAAGTCAGATGTGAAATCCCCGGGCTCA 48 24 0.0034 30 21 8 1 TGGGCGTAAAGCGTGCGCAGGCGGTTGGGTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAGACTGCCC TGGGCGTAAAGCGTGCGCAGGCGGTTGGGTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAGACTGTCC
CACGTTAAGTCAGGTGTGAAACCCCCGGGCTCA 32 20 0.0000 15 15 0 0 TGGGCGTAAAGCGCACGTAGGCGGCACGTTAAGTCAGGTGTGAAACCCCCGGGCTCAACCTGGGAATGGCATTTGATACTGGCG
The output is a table file with the following fields:
- tag: the tag sequence for the variable region
- n: the number of reads that contains this tag
- npos: the number of different locations of the tag on their source reads (big value of column "n" with small value of "npos" indicates duplicated reads or amplicon sequencing data)
- fp: the number of reads you would expect to see this tag due to sequencing errors alone
- long.total.count: the number of reads containing a longer sequence of this tag (see the -long option)
- long1.count, long2.count, long3.count: number of reads containing the most abundant variants of this tag's long sequences (low long1.count/long.total.count ratio indicates that this tag is very likely repressenting a mixture of "species")
- long1, long2: the most abundant long representive sequences of this tag
If you have more than one files composing one sample (say pair end reads), and you want to make one ribotag report for this sample, you can supply multiple files for the -i or -in option:
> ribotagger.pl -r v4 -i test/samp2_1.fq test/samp2_2.fq -o out/samp2.v4
You can also use compressed files as input. Both gzip and bzip2 are supported. However, tar.gz or tar are NOT supported.
> ribotagger.pl -r v4 -i test/samp3.fq.gz -o out/samp3.v4
> ribotagger.pl -r v4 -i test/samp4.fq.gz -o out/samp4.v4
Make summary table files and a BIOM file for all your four samples:
> ls -lh out
total 252K
-rw-r--r-- 1 xiechao xiechao 64K 2014-08-23 20:23 samp1.v4
-rw-r--r-- 1 xiechao xiechao 67K 2014-08-23 20:25 samp2.v4
-rw-r--r-- 1 xiechao xiechao 57K 2014-08-23 20:26 samp3.v4
-rw-r--r-- 1 xiechao xiechao 60K 2014-08-23 20:27 samp4.v4
> biom.pl -r v4 -i out/samp*.v4 -o out/my.sample
Then you will see 4 files in your out folder:
-rw-r--r-- 1 xiechao xiechao 56K 2014-09-04 17:05 my.sample.anno
-rw-r--r-- 1 xiechao xiechao 132K 2014-09-04 17:05 my.sample.biom
-rw-r--r-- 1 xiechao xiechao 16K 2014-09-04 17:05 my.sample.tab
-rw-r--r-- 1 xiechao xiechao 71K 2014-09-04 17:05 my.sample.xls
The *.tab file contains the ribotag abundance (raw count) in each sample. The *.anno file contains the annotation of each ribotag (only available if you are using default -tag and -long option for ribotagger.pl). The fileds in this file are as follows:
- tag: the ribotag sequence
- use: "tag" or "long", whether the annotation was based on the short tag or long representive sequence
- taxon_level: taxa rank of this annotation of this tag
- taxon_data: taxa rank of the most specific annotation appeared in the database (silva or greengenes) for this tag
- long: the long representive sequence of this tag
- long_total: the number of samples having any long representive sequence
- long_this: the number of samples having this long sequence as its major representive of this tag
- support: the number of database sequences having this this tag or long sequence
- confidence: the proportion of the database sequences agreed on this annotation
- k, p, c, o, f, g, s: annotation for each of the taxa ranks - kingdom/domain, phylum, class, order, family, genus, and species
The *.xls file is the combination of the *.tab and *.anno files.
The *.biom file can be used by QIIME.
From the above files, you can carry on with your usual community profiling routine. For example, PCoA analysis using the provided R script:
> pcoa.r out/my.sample.tab out/my.sample.pcoa
We also provide a batch script to do all the above steps after ribotagger.pl:
batch.pl -r v4 -i out/samp*.v4 -o out/my.sample
Then check your folder out/my.sample.
There are many options can be tweaked for both ribotagger.pl and biom.pl (see the last section of this page for detailed references). Or if you want to use RiboTagger to do more creative things, the following commands for ribotagger.pl might be useful for you:
> ribotagger.pl -r v4 -i test/samp1.fq -o /dev/null --print-head --print-tag --print-long | head
SOURCE: @HWI-ST884:58:1:1101:14005:33196#0/1
TAG: CTTCGTAAGACAGAGGTGAAATCCCCGGGCTCA
LONG:
SOURCE: @HWI-ST884:58:1:1101:14598:33076#0/1
TAG: CCTGTTAAGTCAGATGTGAAAGCTCTGGGCTCA
LONG: TGGGCGTAAAGGGCGCGTAGGCGGCCTGTTAAGTCAGATGTGAAAGCTCTGGGCTCAACCCAGGAATTGCATTTGATACTGGCA
SOURCE: @HWI-ST884:58:1:1101:16245:33219#0/1
TAG: TTAGTCGCGTCGTAAGTGCAAACTCAGGGCTTA
LONG: TGGGCGTAAAGAGCTTGTAGGCGGTTAGTCGCGTCGTAAGTGCAAACTCAGGGCTTAACCCTGAGCCTGCTTTCGATACGGGCT
SOURCE: @HWI-ST884:58:1:1101:16444:33060#0/1
> ribotagger.pl -r v4 -i test/samp1.fq -o /dev/null --print-head --print-seq | head
SOURCE: @HWI-ST884:58:1:1101:14005:33196#0/1
RAW: GCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGCTTCGTAAGACAGAGGTGAAATCCCCGGGCTCAACC
SOURCE: @HWI-ST884:58:1:1101:14598:33076#0/1
RAW: GAATTATTGGGCGTAAAGGGCGCGTAGGCGGCCTGTTAAGTCAGATGTGAAAGCTCTGGGCTCAACCCAGGAATTGCATTTGATACTGGCAGGCTTGAGTT
SOURCE: @HWI-ST884:58:1:1101:16245:33219#0/1
RAW: CCGGATTTATTGGGCGTAAAGAGCTTGTAGGCGGTTAGTCGCGTCGTAAGTGCAAACTCAGGGCTTAACCCTGAGCCTGCTTTCGATACGGGCTGACTAGA
SOURCE: @HWI-ST884:58:1:1101:16444:33060#0/1
RAW: CTTCCGTACTCAAGCCCGCCAGTTTCGGATGCACTTCCTCGGTTAAGCCGAGGGCTTTCACATCCGACATAGCGAACCGCCTACGTGCGCTTTACGCCCAA
SOURCE: @HWI-ST884:58:1:1101:18755:33044#0/1
RAW: GCGTTGTTCGGAATCATTGGGCGTAAAGCGGGTGTAGGTTGCTCTATAAGTCAGATGTGAAAGCCCTGGGCTTAACCCAGGAAGTGCATTTGATACTGCAG
> ribotagger.pl -r v4 -i test/samp1.fq -o /dev/null --print-head --print-seq --print-no-prefix | head
@HWI-ST884:58:1:1101:14005:33196#0/1
GCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGCTTCGTAAGACAGAGGTGAAATCCCCGGGCTCAACC
@HWI-ST884:58:1:1101:14598:33076#0/1
GAATTATTGGGCGTAAAGGGCGCGTAGGCGGCCTGTTAAGTCAGATGTGAAAGCTCTGGGCTCAACCCAGGAATTGCATTTGATACTGGCAGGCTTGAGTT
@HWI-ST884:58:1:1101:16245:33219#0/1
CCGGATTTATTGGGCGTAAAGAGCTTGTAGGCGGTTAGTCGCGTCGTAAGTGCAAACTCAGGGCTTAACCCTGAGCCTGCTTTCGATACGGGCTGACTAGA
@HWI-ST884:58:1:1101:16444:33060#0/1
CTTCCGTACTCAAGCCCGCCAGTTTCGGATGCACTTCCTCGGTTAAGCCGAGGGCTTTCACATCCGACATAGCGAACCGCCTACGTGCGCTTTACGCCCAA
@HWI-ST884:58:1:1101:18755:33044#0/1
GCGTTGTTCGGAATCATTGGGCGTAAAGCGGGTGTAGGTTGCTCTATAAGTCAGATGTGAAAGCCCTGGGCTTAACCCAGGAAGTGCATTTGATACTGCAG
usage: ribotagger.pl [options] -region v4|v5|v6|v7 -out OUTFILE -in INFILE1 [INFILE2 ...]
Required:
-in INFILE1 INFILE2 ... one or more input files,
can be compressed by gz or bz2, but not tar
files must be in fastq, fasta, or plain sequence format
-out OUTFILE output file
-region [v4|v5|v6|v7] variable region
Optional:
-min-score INT minimum score at each position of
the tag or long sequences
default = 30
-min-pos INT minimum number of different start positions
for a tag to be reported
default = 1
-no-bacteria don't use Bacteria recognition profiles
-no-archaea don't use Archaea recognition profiles
-no-eukaryota don't use Eukaryota recognition profiles
-no-supplementary don't use taxa-specific supplementary profiles
-print-head print sequence header to screen for sequences covering a tag
-print-tag print tag sequences to screen
-print-long print long sequences to screen
-print-seq print raw sequences to screen
-print-no-prefix don't print prefix label for the above output
-filetype TYPE sequence format of input files, available options:
fastq, fasta, mlfasta (multi-line fasta), seqfile (plain sequences)
-phredoff INT phred score offset for fastq encoding
default = 33
-se INT number of sequencing errors used to
estimate expected number of observations of a tag
due to sequencing erros
default = 1
-tag INT length of tag to report
default = 33
-long INT length of long sequence to report
(probe sequence is included in long,
but not counted to this length)
default = 60
-before-tag INT length of sequence before the probe
to be included in the reported long sequence
default = 0
-help
usage: biom.pl [options] -region v4|v5|v6|v7 -out OUTFILE -in INFILE1 [INFILE2 ...]
Required:
-in INFILE1 INFILE2 ... one or more files generated from ribotagger.pl
-out OUTFILE output biom file
-region [v4|v5|v6|v7] variable region
Optional:
-min-pos INT minimum number of different start positions
for a tag to be reported
default = 2
-fp minus|nothing report abundance as observed count minus estimated false positive,
or do nothing,
default = minus
For taxa annotation (only works with default -tag, -long, and -before-tag):
-taxonomy TYPE which taxonomy to use for tag annotation, available:
greengenes, silva
default = silva
-long FLOAT a long sequence is called a short tag's representive if
>= FLOAT proportion of the tags have this long sequence
default = 0.7
-use-long FLOAT try to use long sequence for annotation for a tag if
number of samples sharing the same long sequence representative
for the tag / total number of samples >= FLOAT
default = 0.8
-like if no exact tag annotation found in the dictionary,
find alike taxa annotation from tags with one base difference
Usage: batch.pl -in infiles -out outdir -region REGION [options]
-in the output files from ribotagger.pl
accepts multiple files, one for each sample
-out output directory
-region variable regions [v4|v5|v6|v7]
-prefix prefix for output files
default = ribotag
-name perl regex to rename your sample names
default = 's/\.v\d\b//ig'