biocore / qiime-default-reference

Default reference data files for use with QIIME.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

reference alignment is incorrect

gregcaporaso opened this issue · comments

A QIIME Forum user working with V2 sequences noticed that many more sequences were failing to be aligned with PyNAST in QIIME 1.9.0 than in QIIME 1.8.0. After some investigation, it turns out that the alignments packaged with Greengenes (and thus in the 0.1.0 and 0.1.1 releases of qiime-default-reference) are not in fact the PyNAST aligned sequences, but rather are SSU aligned, and some positions are filtered. We will be fixing this for qiime-default-reference 0.1.2.

This doesn't seem to be an issue when working with sequences generated from the 515F/806R primers (which is what I tested with when I brought the question of switching from the old core set to the 85% aligned OTUs), but it does appear to be a problem with V2 sequences (presumably from the 27F primer).

For users working with non-515F/806R sequences, this means that PyNAST alignments will need to be regenerated. If you're using OTU tables that were generated with pick_open_reference_otus.py, those should be regenerated, as the no_pynast_failures table will have many OTUs filtered from it. You'll need to first upgrade to qiime-default-reference 0.1.2. Once that goes live, you can do that as follows:

pip install --upgrade qiime-default-reference

We'll follow up on this issue, the QIIME blog, and the QIIME Forum when this is ready.

I'm currently working on a fix for this. With PyNAST-aligned versions of the 13_8 85% OTUs, I'm still seeing slightly lower percent id matches for the V2 sequences (data here, though the results are much better now than when comparing against qiime-default-reference 0.1.1. However... not all of the 85% OTUs were align-able with PyNAST (I think) as I'm finding that filtering as follows results in fewer aligned than unaligned sequences:

filter_fasta.py -f $HOME/data/gg_13_8_otus/rep_set_aligned/gg_13_5_pynast.fasta -o $HOME/data/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta -a $HOME/data/gg_13_8_otus/rep_set/85_otus.fasta

and that there are still sometimes very minor differences in the sequences if I unalign the aligned sequence and compare them against the corresponding sequences in gg_13_8/rep_set/85_otus.fasta.

I'm continuing to look into this, and am considering reverting to the core set for the time-being in qiime-default-reference 0.1.2 (if the licensing allows that, which I'll be looking into).

In the meantime, the work-around for QIIME users working with non-515F/806R 16S primers will be to align against the core set, which is available here. This file path can be added as the value for pynast_template_alignment_fp in qiime_config, as described here. See updated work-around below.

@gregcaporaso, the following should do it (pynast aligned here)

gunzip gg_13_5_pynast.fasta.gz
filter_fasta.py -f gg_13_5_pynast.fasta -o $HOME/data/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta -a $HOME/data/gg_13_8_otus/rep_set/85_otus.fasta

Is that different than the command I posted above?

Not sure, is that input file the same as the file I linked? It isn't part
of the otus package hence the confusion

A length check may not be sufficient if pynast can trim 5' or 3'
On Apr 14, 2015 17:53, "Greg Caporaso" notifications@github.com wrote:

Is that different than the command I posted above?


Reply to this email directly or view it on GitHub
#14 (comment)
.

Also, there were about 1500 seqs that didn't align with pynast as youve
noted
On Apr 14, 2015 17:57, "Daniel T. McDonald" Daniel.Mcdonald@colorado.edu
wrote:

Not sure, is that input file the same as the file I linked? It isn't part
of the otus package hence the confusion

A length check may not be sufficient if pynast can trim 5' or 3'
On Apr 14, 2015 17:53, "Greg Caporaso" notifications@github.com wrote:

Is that different than the command I posted above?


Reply to this email directly or view it on GitHub
#14 (comment)
.

I think you forgot the link, but I pulled my file from secondgenome's download page, and the md5 is correct:

MD5 (gg_13_5_pynast.fasta.gz) = 24720085027c35132de0588fee518065

Ok, given that it's known that some of the sequences in 13_5 don't align with PyNAST (which would explain why I have fewer), and that the differences that I notice in the sequences are small and (based on spot-checks) in the ends of sequences, I think we're good with using this PyNAST 85% OTU alignment as the default template alignment. I'll work on building and testing this release, then testing with QIIME.

Updated work-around: In the meantime, the work-around for QIIME users working with non-515F/806R 16S primers will be to align against the Greengenes 13_5 PyNAST 85% OTU alignment, which is available here. After downloading and unzipping that file, it's absolute file path can be added as the value for pynast_template_alignment_fp in qiime_config, as described here. This will achieve the same result as when we release qiime-default-reference 0.1.2.

Blog post announcing this bug is now live.

Some of my experiments here.

Apparently github silently doesn't render FTP links, sorry about that. Thanks for getting this together, and yet more motivation for another GG release

In an additional test of PyNAST alignment of V2 sequences with the core set versus the PyNAST 85% OTU alignment added in #15, I find that only slightly more sequences fail to align with the PyNAST 85% OTU alignment than with the old core set. Details of my test follow. Since we don't have an inherent way to say which is better (i.e., it's possible that we want those additional sequences to fail), and we have complete understanding of how the PyNAST 85% OTU alignment was generated (which is not true for the core set), we'll move forward with the plan to use the PyNAST 85% OTU alignment as the default reference template alignment.

V2 sequence collection

My first test was performed with the V2 sequences deposited under study id 449 in the QIIME database (the Costello Whole Body) dataset. I ran the following command to pick OTUs and perform PyNAST alignment against the PyNAST 85% OTU alignment:

$ pick_open_reference_otus.py -i /home/caporaso/analysis/whole-body/study_449_split_library_seqs.fna --suppress_taxonomy_assignment -aO 29 -o /home/caporaso/analysis/whole-body/or-otus
$ count_seqs.py -i "pynast_aligned_seqs/rep_set_*fasta"

204  : pynast_aligned_seqs/rep_set_failures.fasta (Sequence lengths (mean +/- std): 181.9706 +/- 56.2609)
10200  : pynast_aligned_seqs/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
10404  : Total

This tells us that 204 sequences failed to align with PyNAST out of 10664 (there is one header line).

Then, I re-aligned the representative sequences against the core set

$ parallel_align_seqs_pynast.py -i /home/caporaso/analysis/whole-body/or-otus/rep_set.fna -o /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs_core_set --jobs_to_start 30 -t /data/qiime_software/core_set_aligned.fasta.imputed
$ count_seqs.py -i "pynast_aligned_seqs_core_set/rep_set_*fasta"

171  : pynast_aligned_seqs_core_set/rep_set_failures.fasta (Sequence lengths (mean +/- std): 166.0292 +/- 46.7550)
10233  : pynast_aligned_seqs_core_set/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
10404  : Total

Also, it's important to note that different sequences failed to align with the two data sets (i.e., the sequences that fail to align against the core set are not a subset of the sequences that fail to align against the 85% PyNAST-aligned OTUs).

# find the sequences that didn't align against the 85% PyNAST-aligned OTUs
# but did align against the core set
$ filter_fasta.py -f pynast_aligned_seqs/rep_set_failures.fasta -o pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta -a pynast_aligned_seqs_core_set/rep_set_aligned.fasta
caporaso@bacon 08:07:58 or-otus$ count_seqs.py -i pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta

47  : pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta (Sequence lengths (mean +/- std): 256.3404 +/- 16.3086)
47  : Total

# find the sequences that didn't align against the core set
# but did align against the 85% PyNAST-aligned OTUs
$ filter_fasta.py -f pynast_aligned_seqs_core_set/rep_set_failures.fasta -o pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus.fasta -a pynast_aligned_seqs/rep_set_aligned.fasta
$ count_seqs.py -i pynast_aligned_seqs/failures-that-aligned-v-pyn-85-otus.fasta

14  : pynast_aligned_seqs/failures-that-aligned-v-pyn-85-otus.fasta (Sequence lengths (mean +/- std): 236.9286 +/- 11.0677)
14  : Total

I'm now investigating what those hit, and will follow up on this issue when I have time to analyze.

$ parallel_blast.py -i /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus.fasta -o /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus-blast-results/ -b /data/blast/nt -O 10
$ grep -v '^#' pynast_aligned_seqs/failures-that-aligned-v-core-set-blast-results/failures-that-aligned-v-core-set_blast_out.txt
New.CleanUp.ReferenceOTU10449   gi|238404362|gb|GQ002078.1| 100.00  150 0   0   1   150 317 168 3e-77    297
New.CleanUp.ReferenceOTU10796   gi|296967670|gb|HM274075.1| 100.00  132 0   0   1   132 317 186 1e-66    262
New.CleanUp.ReferenceOTU11004   gi|322225548|gb|JF240143.1| 100.00  148 0   0   1   148 317 170 4e-76    293
New.CleanUp.ReferenceOTU11548   gi|344263683|gb|JN382529.1| 96.95   164 1   3   1   164 304 145 3e-67    264
New.CleanUp.ReferenceOTU12392   gi|322226988|gb|JF241583.1| 100.00  158 0   0   1   158 317 160 4e-82    313
New.CleanUp.ReferenceOTU12511   gi|322128498|gb|JF143092.1| 99.36   156 0   1   1   155 317 162 4e-76    293
New.CleanUp.ReferenceOTU12532   gi|297354760|gb|HM084636.1| 98.94   188 2   0   4   191 202 15  4e-95    357
New.CleanUp.ReferenceOTU13882   gi|326833420|gb|JF487288.1| 96.91   194 1   4   1   194 201 13  3e-80    307
New.CleanUp.ReferenceOTU14202   gi|460693648|gb|JX812596.1| 98.79   165 0   1   1   165 317 155 1e-79    305
New.CleanUp.ReferenceOTU15614   gi|399187736|gb|JQ461075.1| 100.00  141 0   0   1   141 330 190 6e-72    280
New.CleanUp.ReferenceOTU16729   gi|322187448|gb|JF202043.1| 100.00  182 0   0   1   182 305 124 2e-96    361
New.CleanUp.ReferenceOTU17453   gi|297367873|gb|HM097749.1| 99.31   145 1   0   1   145 245 101 6e-72    280
New.CleanUp.ReferenceOTU18041   gi|545339281|gb|KF502513.1| 99.04   209 0   2   1   209 174 380 6e-103   383
New.CleanUp.ReferenceOTU18339   gi|297368394|gb|HM098270.1| 100.00  184 0   0   1   184 237 54  2e-97    365
New.CleanUp.ReferenceOTU18724   gi|385889132|gb|JQ726783.1| 99.44   179 0   1   1   178 312 134 8e-90    339
New.CleanUp.ReferenceOTU1891    gi|399204424|gb|JQ477763.1| 96.24   186 2   3   1   186 348 168 4e-76    293
New.CleanUp.ReferenceOTU2098    gi|290733018|gb|GU536833.1| 100.00  162 0   0   1   162 326 165 2e-84    321
New.CleanUp.ReferenceOTU21175   gi|545342750|gb|KF505982.1| 100.00  141 0   0   1   141 182 322 6e-72    280
New.CleanUp.ReferenceOTU21766   gi|322194586|gb|JF209181.1| 100.00  180 0   0   1   180 308 129 4e-95    357
New.CleanUp.ReferenceOTU21849   gi|238268948|gb|GQ037048.1| 100.00  155 0   0   1   155 316 162 3e-80    307
New.CleanUp.ReferenceOTU22455   gi|270087661|gb|GQ949123.1| 98.45   193 0   2   1   193 229 40  1e-91    345
New.CleanUp.ReferenceOTU22842   gi|238283618|gb|GQ063996.1| 98.63   146 0   2   1   146 298 155 2e-65    258
New.CleanUp.ReferenceOTU22953   gi|214017914|gb|FJ363527.1| 100.00  181 0   0   1   181 340 160 9e-96    359
New.CleanUp.ReferenceOTU2306    gi|297036657|gb|HM343062.1| 98.80   166 0   2   1   166 326 163 3e-77    297
New.CleanUp.ReferenceOTU23446   gi|326814557|gb|JF468425.1| 99.33   150 1   0   1   150 212 63  6e-75    289
New.CleanUp.ReferenceOTU24551   gi|402535884|gb|JX515809.1| 97.95   195 1   3   1   195 192 1   2e-87    331
New.CleanUp.ReferenceOTU24622   gi|238305188|gb|GQ069011.1| 100.00  150 0   0   1   150 317 168 3e-77    297
New.CleanUp.ReferenceOTU24747   gi|322226120|gb|JF240715.1| 100.00  148 0   0   1   148 317 170 4e-76    293
New.CleanUp.ReferenceOTU24753   gi|326824061|gb|JF477929.1| 94.25   226 1   8   1   225 221 7   6e-75    289
New.CleanUp.ReferenceOTU303 gi|295983658|gb|HM021376.1| 100.00  185 0   0   1   185 300 116 3e-98    367
New.CleanUp.ReferenceOTU3067    gi|511487215|gb|KC999379.1| 98.47   196 0   3   1   196 265 73  2e-90    341
New.CleanUp.ReferenceOTU3375    gi|478686098|gb|JX520102.1| 100.00  150 0   0   1   150 278 129 3e-77    297
New.CleanUp.ReferenceOTU3384    gi|297361799|gb|HM091675.1| 96.40   222 8   0   1   222 231 10  3e-101   377
New.CleanUp.ReferenceOTU3739    gi|558620814|gb|KF632504.1| 100.00  188 0   0   1   188 198 11  6e-100   373
New.CleanUp.ReferenceOTU4059    gi|559795125|ref|NR_104711.1|   100.00  139 0   0   1   139 298 160 1e-70    276
New.CleanUp.ReferenceOTU4528    gi|375112268|gb|JQ071822.1| 99.31   144 1   0   1   144 260 117 3e-71    278
New.CleanUp.ReferenceOTU5484    gi|545341529|gb|KF504761.1| 100.00  144 0   0   1   144 181 324 1e-73    285
New.CleanUp.ReferenceOTU6173    gi|409109600|gb|JX489915.1| 96.74   184 6   0   1   184 338 155 3e-83    317
New.CleanUp.ReferenceOTU6680    gi|335334350|emb|FN658915.1|    100.00  144 0   0   1   144 335 192 1e-73    285
New.CleanUp.ReferenceOTU720 gi|547220263|dbj|AB810555.1|    100.00  144 0   0   1   144 236 93  1e-73    285
New.CleanUp.ReferenceOTU7237    gi|326814557|gb|JF468425.1| 99.33   150 1   0   1   150 212 63  7e-75    289
New.CleanUp.ReferenceOTU7455    gi|326833420|gb|JF487288.1| 95.15   206 5   5   1   206 201 1   6e-75    289
New.CleanUp.ReferenceOTU7553    gi|290741669|gb|GU545484.1| 96.46   198 6   1   1   198 197 1   7e-87    329
New.CleanUp.ReferenceOTU7914    gi|322226988|gb|JF241583.1| 100.00  158 0   0   1   158 317 160 5e-82    313
New.CleanUp.ReferenceOTU7942    gi|297357653|gb|HM087529.1| 99.39   164 1   0   1   164 214 51  3e-83    317
New.CleanUp.ReferenceOTU8857    gi|459996585|gb|KC510222.1| 98.26   172 3   0   1   172 334 163 3e-83    317
New.CleanUp.ReferenceOTU9075    gi|326815667|gb|JF469535.1| 96.07   178 6   1   1   177 178 1   7e-75    289
$ parallel_blast.py -i /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta -o /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs/failures-that-aligned-v-core-set-blast-results/ -b /data/blast/nt -O 10
$ grep -v '^#' pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus-blast-results/failures-that-aligned-v-pyn-85-otus_blast_out.txt
New.CleanUp.ReferenceOTU11778   gi|7670301|dbj|AB029368.1|  100.00  203 0   0   1   203 503 301 5e-109   402
New.CleanUp.ReferenceOTU14235   gi|379462481|gb|JQ164969.1| 97.87   188 2   1   1   188 317 132 1e-88    335
New.CleanUp.ReferenceOTU14439   gi|341989644|gb|JF554459.1| 99.55   223 1   0   1   223 331 109 2e-118   434
New.CleanUp.ReferenceOTU15749   gi|545342948|gb|KF506180.1| 97.24   181 4   1   1   181 155 334 2e-81    311
New.CleanUp.ReferenceOTU16523   gi|401661910|emb|FR714868.1|    96.93   228 2   4   1   228 275674  275452  3e-98    367
New.CleanUp.ReferenceOTU20112   gi|545343825|gb|KF507057.1| 99.37   159 0   1   1   159 154 311 6e-78    299
New.CleanUp.ReferenceOTU20493   gi|545343825|gb|KF507057.1| 98.78   164 2   0   1   164 154 317 6e-81    309
New.CleanUp.ReferenceOTU2210    gi|154198989|gb|EF706223.1| 98.78   164 2   0   1   164 255 92  7e-81    309
New.CleanUp.ReferenceOTU2265    gi|301286413|gb|HM785615.1| 98.91   183 2   0   1   183 238 56  3e-92    347
New.CleanUp.ReferenceOTU23577   gi|322201028|gb|JF215623.1| 100.00  179 0   0   1   179 311 133 1e-94    355
New.CleanUp.ReferenceOTU24778   gi|319496516|gb|HQ789377.1| 98.35   182 3   0   1   182 308 127 3e-89    337
New.CleanUp.ReferenceOTU2799    gi|443301425|emb|HE613602.1|    93.94   231 12  2   1   230 353 124 2e-87    331
New.CleanUp.ReferenceOTU5311    gi|322201028|gb|JF215623.1| 100.00  179 0   0   1   179 311 133 1e-94    355
New.CleanUp.ReferenceOTU8921    gi|322201028|gb|JF215623.1| 99.44   179 1   0   1   179 311 133 3e-92    347

V4 sequence collection

My next test was performed with the V4 sequences deposited under study id 550 in the QIIME database (the Moving Pictures) dataset. I ran the following command to pick OTUs and perform PyNAST alignment against the PyNAST 85% OTU alignment:

$ parallel_align_seqs_pynast.py -i $PWD/rep_set.fna -o $PWD/pynast_aligned_seqs_85_pyn --jobs_to_start 30
$ count_seqs.py -i "$PWD/pynast_aligned_seqs_85_pyn/rep_set_*fasta"

17246  : /home/caporaso/analysis/moving-pictures/or_otus/pynast_aligned_seqs_85_pyn/rep_set_failures.fasta (Sequence lengths (mean +/- std): 92.2641 +/- 17.1674)
131449  : /home/caporaso/analysis/moving-pictures/or_otus/pynast_aligned_seqs_85_pyn/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
148695  : Total

This is 11.6% failing to align against the 85% PyNAST-aligned OTUs.

$ parallel_align_seqs_pynast.py -i $PWD/rep_set.fna -o $PWD/pynast_aligned_seqs_core_set --jobs_to_start 30 -t /data/qiime_software/core_set_aligned.fasta.imputed
$ count_seqs.py -i "pynast_aligned_seqs_core_set/*fasta"

18487  : pynast_aligned_seqs_core_set/rep_set_failures.fasta (Sequence lengths (mean +/- std): 94.8743 +/- 19.5905)
130208  : pynast_aligned_seqs_core_set/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
148695  : Total

This is 12.4% failing to align against the core set.