Read quality 20 with low percentage of identity
Liukvr opened this issue · comments
Dear Nanoplot developer, i'm using NanoPlot to assess the percentage read identity of a ONT plant sequencing sequenced using P2 instrument. Looking at the tsv file output from Nanoplot we noticed that there are some reads with an average read quality greater than 20 (e.g. a read identity around 99% would be expected) which identity percentage is far below 99%
Here some exaples:
239509bc-8a13-411f-8c1d-9b0448e98a58 21.20879 21.478092 91889 73248 1 68.85691
3ab4c17e-cf80-4439-827b-b77c3d952e01 21.44239 21.838873 21769 8510 1 67.87681
1f2ce2c5-632b-4784-8b81-db7a5de56dbd 28.54513 28.556702 56165 56156 20 62.638786
35ba71d8-2952-4924-a4a3-3451909edd27 20.533953 20.673409 30288 30245 47 66.25387
c9cac6d1-d9b2-472f-b0de-6e938f1e1e19 34.925575 35.084343 15144 15018 60 68.29852
Resulting in the following plot:
Did you already faced situation like this? If so, how did you explain that?
Thanks in advance,
Luca
Hi Luca,
That Q-score is just something the basecaller made up or calculated. It doesn't know the true accuracy of the read. It just thinks, "well, this signal looks pretty decent, so I'll give it a high quality". Based on what you show here, it is not well-calibrated.
Wouter
Hi Wouter,
Thanks for the explanation. The plot was generated using ONT reads basecalled using Guippy v6.3.8. From a naive point of view, i did not expect a such number of reads with a poor quality/identity values correlation. From your experience, is this a typical ONT reads identity plot? Did you already faced situation where the Q value revealed to be overestimated by the basecaller?
Thanks in advance,
Luca
It seems most of your reads are at the expected accuracies, looking at the top histogram.
It would presumably be more informative to convert those empirical percent identities to the Phred scale, and plot the accuracy "according to the basecaller" vs "according to the aligner". Note that also structural variants may affect the reference identity, which is an argument for using a gap-compressed reference identity (https://lh3.github.io/2018/11/25/on-the-definition-of-sequence-identity).