creating new reference datasets for insect

Question

creating new reference datasets for insect

ngeraldi opened this issue 5 years ago · comments

Hi Shaun,
I have generated my own reference libraries for mulitple primers I am using. Details and the datasets (fasta formatted for dada2::assigntaxonomy and insect::learn, as well as the rds produced by learn ) are available here. https://github.com/ngeraldi/eDNA_reference_libraries/blob/master/README.md

I have miseq data that I then take through the DADA2 pipline (https://github.com/ngeraldi/Basic_eDNA_pipeline/blob/master/dada2_insect_pipe_v1.R).
I have then run both assigntaxonomy and classify. For the CO1 data, your classify works best using your reference dataset. However, for all the primers that I used my own reference data (dendogram) the assigntaxonomy seems to be a lot better. My guess is that something is wrong with my custom library for classify. The rds files (dendograms) seem to be similar to yours, but I am not familiar with this file type.

Below is a folder of all the data with taxa assignment.
Any help/suggestions would be much appreciated.
I know you are busy so please don't hesitate to ask for anything that may facilitate your assistance.
Best
Nathan

taxonomy assigned data, taxass50, 70, 90 from dada2::assigntaxonomy with respective minboots; taxass_insect is using insect::classify
https://www.dropbox.com/sh/26wjnk1c90w3xxa/AABa4ROUzgUGoGZLTFwR6EXWa?dl=0

Shaun Wilkinson · Answer 1 · Thu Mar 28 2019 07:48:14 GMT+0800 (China Standard Time)

Hi Nathan,

Thanks for the well structured question and access to all the necessary data.

The most common problem folks tend to encounter with insect is when the input data doesn't have the primers removed. Because it runs strictly global alignments (in default mode), any extra or missing residues will result in a dramatic reduction in sensitivity. I had a quick look at your vertebrate 12S input data and it looks like part of the forward primer sequence and the entire reverse primer sequence are still attached (code below). It's best to trim primers before running the sequences through DADA2, but you can also use insect::trim or insect::shave if you need to (as I have done below).

Another thing I noticed is that some of the insect output is returning e.g. 'uncultured eukaryote' at species level. I would recommend removing all unclassified sequences prior to learning the tree. This is because if insect assigns an input sequence to a cluster of reference sequences that includes an unclassified sequence, it will return the highest taxonomic rank of the members of that cluster, and if the cluster includes an unclassified sequence you will often just get 'Eukaryota' or some other non-informative output. There are various other filters I use to curate the reference database prior to learning the classifier. Here is a link to the script I used recently to create a fish 12S classifier based on the Mfish primers (Miya et al. 2015 DOI: 10.1098/rsos.150088).

Another consideration is that it can be difficult to compare the performance of these classifiers, since performance is a function of both precision (in terms of reducing false positives) and sensitivity/recall (in terms of reducing false negatives). So just because one classifier outputs more low-ranked taxa, it doesn't necessarily mean that it is performing better, eg if a high proportion of the output is overclassified or misclassified. For this reason it is a good idea to perform a cross-validation analysis by allocating a portion of the reference database to a training set and a portion to a test/query set, and comparing both precision and recall for each method. Bear in mind that the results will also depend on the allocation method used, and there is still some debate as to what is the best way to do this (see Edgar 2018 PeerJ 6:e4652). Different applications will have different tolerances for false positives, and the choice of classifier (and confidence threshold) will often depend on the ratio of false positives to false negatives that you are comfortable with. The insect method is designed to err on the side of caution, and from the cross-validation experiments I have done I have found it to be generally more conservative than the RDP Naive Bayes classifier (implemented in dada2::assignTaxonomy) and similar methods. That said, there are a few parameters you can adjust to increase the recall rate, e.g. threshold, decay, ping, mincount, offset, etc (you can run ?insect::classify for more info).

Hope that helps! See code below:

## make sure insect version is up to date
devtools::install_github("shaunpwilkinson/insect")
## read in classifier
tree12s <- readRDS("12s_ruiz_learn.rds")
## read in query data
tmp <- read.csv("Dammam__vert_taxass50.csv", stringsAsFactors = FALSE) 
x <- char2dna(tmp$X)
## read in reference data
z <- insect::readFASTA("ncbi_12s_euk_only_insect_virtualPCR.fasta") 
## compare average sequence length
hist(sapply(x, length), breaks = 80:150)
hist(sapply(z, length), breaks = 80:150)
## looks like primers may still be on
## extract profile HMM from classifier and align with input sequences
model <- insect::decodePHMM(attr(tree12s, "model"))
## Viterbi path should start and ent with 1's (matches) 
aphid::Viterbi(model, x[1], type = "semiglobal")$path
## [1] 2 2 2 1 1 1 ...
## Note 2's on ends of Viterbi path indicate extra residues on input sequences
## Align reverse primer with first input sequence
rprim <- insect::char2dna(rc("TAGAACAGGCTCCTCTAG"))
aphid::Viterbi(rprim, x[1], type = "semiglobal")$path
## [1] 2 2 2 ... 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
## note 1's in Viterbi path indicate a hit, so reverse primer is still on 
## shave primers off and re-run
test <- insect::shave(x, model)
aphid::Viterbi(model, test[1], type = "semiglobal")$path 
## [1] 1 1 1 ... 1 1 1
## path looks ok
## classify shaved sequences 
y <- insect::classify(test, tree12s)

Nathan R. Geraldi · Answer 2 · Sat Mar 30 2019 19:43:46 GMT+0800 (China Standard Time)

Hi Shaun,
thanks for the quick and detailed reply.
I am working on filtering the references dataset as you suggest. I will keep you updated on if this fixes the issues. Below are quick answers/comments to your three sections.

I did remove the primers using virtual PCR from the reference sequences or at least I thought I did (I use cutadap to trim the sample sequences). The plots you made indicate the the samples sequences were longer than the references, correct??? I will use your code to double check primer removal from now on.

Good point about sequences with bad taxonomy. I had this on my list of todo's, but given your comments are redoing it now.

I should have been specific on what I meant by "better" assignment. This is based on assigning the taxonomy to species in mock samples (~16 known species). I not only run assigntaxonomy with different minboots, but also kept the score low in classify (50) so that I can look at the data and determine most appropriate score later on (likely increasing it to 80 or 90).

Thanks again
Nathan

Shaun Wilkinson · Answer 3 · Sat Apr 27 2019 04:21:51 GMT+0800 (China Standard Time)

Hi again Nathan,
Sorry for the delayed response, yes your reference sequences look fine in terms of primer removal, it's just your query set which still seem to have primers attached.
That's great that you are using mock communities to validate the classification methods. Did you manage to get it working okay in the end?
Cheers,
Shaun