Processing Metagenomic Sequencing Reads

A pipeline for assembly, annotation and taxonomic classification of metagenomic reads after sequencing process.

Index

Quality control have using fastqc and adapter trimming using cutadapt.
Assembling the reads using MEGAHIT.
Gene prediction and annotation with PROKKA.
Additional functional annotation using DIAMOND with the eggNOG database.
Taxonomic classification of metagenomic reads.

Initial files

Basically, we will start from the final reads files resulting from the sequencing process. We will assume that we have two files for each metagenome from a paired analysis:

R1.fastq.gz
R2.fastq.gz

Dependencies

FastQC v0.11.8
cutadapt v2.3
MEGAHIT v1.1.3
PROKKA v1.14.5
DIAMOND v0.9.22.123
eggNOG-mapper v2 and eggNOG database v5.0
KRAKEN2 v2.0.8
MaxBin v2.2.6

Step 1: Quality control and trimming

For quality control we are going to use the FastQC toolkit. This is not essential for the rest of the pipeline but could give additional information about the quality of the sequencing process and the length of the reads. This could be helpful in order to choose the parameters to make the trimming.

fastqc -o $OUT_DIR R1.fastq.gz R2.fastq.gz

Inside the $OUT_DIR directory you will find an .html file which will open on the browser a summary about your reads. Now, the second and most important part is to trim the adapter from the reads using cutadapt. In our case we will remove the adapter ligated on the 3' end in both paired reads, that's because we use -a and -A options, followed by the adapter sequence to trim. Also, we going to discard those bases with quality lower than 20 using -q option over both paired reads (-q 20,20). Also, we will use -m option to remove those reads with length smaller than 20 bases. So, final commands would be like following:

cutadapt -q 20,20 -m 20 -a AGTCAA -A AGTCAA -o R1.trimmed.fastq.gz -p R2.trimmed.fastq.gz R1.fastq.gz R2.fastq.gz > $report

Report produced by the program will be saved on $report file. If you want, you could run FastQC again over the trimmed files to check the differences:

fastqc -o $OUT_DIR R1.trimmed.fastq.gz -p R2.trimmed.fastq.gz

Step 2: Assembling the reads using MEGAHIT

Assembly is one of the most important parts porcessing metagenomic reads. All the gene annotation and prediction will depend of the accuracy of this step. There are many assemblers for metagenomes, each with different advantages but there is not a clear evidence showing any to be better than the others. For further information, I would recommend the read of the following paper about the current state of art on metagenome assembly:

Martin Ayling, Matthew D Clark, Richard M Leggett. New approaches for metagenome assembly with short reads. Briefings in Bioinformatics, , bbz020, https://doi.org/10.1093/bib/bbz020

In our case, we have chosen MEGAHIT as assembler. Here we start where we left on the step before, after using cutadapt, using the resulting trimmed files:

R1.trimmed.fastq.gz
R2.trimmed.fastq.gz

We will use MEGAHIT with the following command line:

megahit -1  R1.trimmed.fastq.gz -2 R2.trimmed.fastq.gz -o OUT_DIR -t Nr_of_cores --k-list 21,41,61,81,99

Use -t option only if you want to use more than one CPU cores to accelerate calculation. --k option sets up the list of k-mers size, where is recommended to use odd numbers in the range 15-255 with increment <=28. Contigs of final assembly are storage in OUT_DIR on file final.contigs.fa.

Step 3: Gene prediction and annotation with PROKKA

Once assembly is complete we need to get the coding genes and make a first functional annotation. To do this we will use PROKKA with the results contigs file, final.contigs.fa, from the previus step:

prokka final.contigs.fa --outdir OUT_DIR --norrna --notrna --metagenome --addgenes --cpus Nr_of_cores

Again, you can accelerate the calculation adding CPU cores with the option --cpus. Additional options --norna and --notrna avoid the prediction of rRNA and tRNA genes, --metagenome improve the prediction for highly fragmented genomes and --addgenes is just to add gene name in the final output. Among all output files produced by Prokka the most interesting are the .tsv file, which is a table describing each coding region with the gene name, EC number and product description, and the fasta files with the aminoacid sequences from the predicted genes (.faa).

Step 4: Additional functional annotation using DIAMOND with the eggNOG database

Since Prokka annotation could result a bit insufficient we complement the functional annotation using the eggNOG database, which combines the functional information of different databases (COG, arCOG, Pfam,...). eggNOG-mapper can do the task directly, but if we split the search by using DIAMOND we accelerate the process since DIAMOND is much faster. Therefore, we combine the use of DIAMOND with the eggNOG diamond database, created during the installation of eggNOG-mapper, to make an annotation of our genes based on the eggNOG database.

So, first step then is to launch diamond, for what we need to have located the aminoacids fasta file created by Prokka and the eggNOG diamond database created on eggNOG-mapper installation (usually: eggnog-mapper/data/eggnog_proteins.dmnd).

diamond blastp -d eggnog_proteins.dmnd -q PROKKA_xxxx.faa --threads Nr_of_cores --out diamond_output_file --outfmt 6 -t /tmp --max-target-seqs 1

Here there are some options to take into account. First, -threads is to set the number of CPU cores to perform the calculation. -out sets the output name file, which we will call from now on diamond.hits.txt. --outfmt and --max-target-seqs are blastp options to set the output format (6 corresponds to tabular format) and the number of matching hits in the output (here we set 1 to get just the best hit). Finally, -t is to set a directory for temporal files.

Next step is to use eggMapper using the DIAMOND output on diamond.hits.txt, but first we need to transform this output into a file suitable to use by eggMapper. To this end we developed a small script in Perl, Diamond2eggMapper.pl:

Diamond2eggMapper.pl diamond.hits.txt > eggMapper_input_file

In the first step using DIAMOND we matched the our target genes with its closest target on eggNOG, getting the hits IDs. Then, the resulting output is adapted in order use it by eggMapper and the final step is to run eggMapper to add the full annotation and description corresponding to each of these hits, using the option annotate_hits_table:

emapper.py --annotate_hits_table eggMapper_input_file -o eggMapper_output_file

Finally, we can combine PROKKA annotations with the additional annotations we just created using the following script:

CombinePROKKAeggMapper.pl PROKKA_xxxx.tsv eggMapper_output_file > Final_annotation_file

Look that here we use the tabular output from PROKKA .tsv.

Step 5: Taxonomic classification of metagenomic reads

There are some different programs to classify into a lineage raw reads or assembled contigs produced by metagenomics sequencing. Some of then are based on searching marking genes into the dataset and classify then according to a database, like the case of GrafM (http://geronimp.github.io/graftM). However, most of the programs are based on k-mers, splitting the target sequences into smaller fragments of k length and then process these k-mers according to their different algorithms. For instance, Kaiju (https://github.com/bioinformatics-centre/kaiju) is based on Maximium Exact Matching (MEM), where target sequences are split on small k-mers and match directly against the sequences from reference database, assigning the taxonomy of the hit where the target fragment got higher numer of exact matches. However, currently one of the most cited programs for classification of metagenomic reads are Kraken(http://ccb.jhu.edu/software/kraken/) and its new version Kraken2(https://ccb.jhu.edu/software/kraken2/), which we will use here. It uses k-mers-based algorithm, mapping every target sequence k-mers over the taxonomic tree of all the genomes of the reference database, assigning a taxonomic label according to the Lowest Common Ancestor (LCA) containing that k-mer.

5.1: Classification using Kraken2

In order to run Kraken2 you first need to create a reference database against which we will match our sequences. In our case, we will create the database based on the NCBI RefSeq of Bacteria and Archaea. First, we need to download both taxonomies in a common database folder:

kraken2-build --download-library bacteria --db OurDatabaseName
kraken2-build --download-library archaea --db OurDatabaseName

For additional information, we could add custom sequences to our database using the --add-to-library command, using a fasta format file. Once the library is created have to build the database:

 kraken2-build --build --db OurDatabaseName --threads Nr_of_cores

Now we are ready to fun Kraken2 against our RefSeq database. For a straigh use of the Kraken2 use the following command line:

kraken2 --threads Nr_of_cores --db OurDatabaseName --output OutputName --report Output2Name --use-names Fasta_Input_file

Option --use-names add the scientific names of the assigned taxons to the final output, while --report offers a tab delimited output alternative to the starndard output assigned on --output. Regard that final argument is the input file in fasta format, which could be the raw reads or assembled contigs.

5.2: Improving classification by binning reads with MaxBin

When using marking genes based algorithms (GraftM,...) only reads matching those marking genes (i.e., 16S rRNA genes) are classified, but k-mers-based algorithms try to classified the 100% of the resulting reads from sequencing (Kraken2, ...). In this case, sometimes straight classification of reads can return a high number of reads that were not assigned to any taxonomy. If the percentage of classified sequences is not higher than 70%, then you could try extra strategies which could help you to rise up the number of classified sequences.

One of this strategies could be binning assembled contigs and raw reads in order to get bins that we use after for classification instead the raw reads. This strategy tries to recover individual genomes, what could make easier the classification. We can do this using MaxBin. This program clusters reads and assembled contigs into bins, each in theory consisting into contigs from one species.

To run MaxBin we will need the Initiall R*.fastq.gz files for reads and the file with assembled contigs by MegaHit, final.contigs.fa.

run_MaxBin.pl -contig final.contigs.fa -out OutputDirectory -reads R1.fastq.gz -reads2 R2.fastq.gz -thread Nr_of_cores

Each resulting bin will be in a fasta file on the Output directory. We can concatenate all bins in a single fasta file:

cat OutputDirectory/*.fasta > All.bins.fasta

Now we can use this single file to run again Kraken2 and check if we get a better classification ratio:

kraken2 --threads Nr_of_cores --db OurDatabaseName --output OutputName --report Output2Name --use-names All.bins.fasta

TerenceDong / MetagenomeProcessing