Unexpected relative profile results (possible user error)

Question

Unexpected relative profile results (possible user error)

rsharris opened this issue 3 years ago · comments

Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect.

Details:

The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X.

Running this
FastK -k40 -t1 apple.fa
FastK -k40 -t1 orange.fa
FastK -k40 -t1 -p:orange.ktab apple.fa
Profex apple.prof 1 > orange_on_apple.profile
I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X).

(orange_on_apple.profile is in that same repo)
Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding.

I wonder if the fact that the kmer counts in apple itself are all 1 is relevant.

FWIW I'm using a fetch of the repo from about an hour ago.

Eugene W Myers Jr · Answer 1 · Fri Apr 02 2021 06:00:44 GMT+0800 (China Standard Time)

Are you sure about orange.fa, it does not look like an 8x of apple. When I look at the histogram orange.hist produced in the second step with Histex it is: Histogram of unique 40-mers of orange Input: 17,475 unique 40-mers Freq: Count Cum. % 17: 12 0.1% 16: 35 0.3% 15: 1 0.3% 12: 49 0.6% 11: 51 0.8% 10: 3 0.9% 9: 54 1.2% 7: 18 1.3% 6: 156 2.2% 5: 20 2.3% 4: 2 2.3% 3: 96 2.8% 1: 16978 100.0% In otherwords most k-mers occuring once would be consistent with the output your are getting when you build the relative profile. So I looked at the k-mer table for orange.fa with Tabex and its the strangest thing, identical k-mers are not collapsed, e.g.: Opening 40-mer table with 17,475 entries 0: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1 1: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1 2: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1 3: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 4: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 5: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 6: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 7: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 8: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 9: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 10: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 11: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 12: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 13: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1 14: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1 15: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1 16: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 17: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 18: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 19: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 20: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 21: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 22: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 23: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 24: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 25: aaaaagcacatttccgcacagaccggactcaacactgcgc = 3 26: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1 27: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1 28: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1 Hmm, its midnight here in Europe, I'll look at it tomorrow. Could again be some facet of a super small data set, albeit I did test a couple small ones previously. I appreciate your kicking the tires in ways I obviously haven't, and am committed to solving every problem you report, albeit I hope (for your sake and my pride) there aren't too many more. Best, Gene

…

On 4/1/21, 8:57 PM, Bob Harris wrote: Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect. Details: The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X. Running this |FastK -k40 -t1 apple.fa| |FastK -k40 -t1 orange.fa| |FastK -k40 -t1 -p:orange.ktab apple.fa| |Profex apple.prof 1 > orange_on_apple.profile| I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X). (orange_on_apple.profile is in that same repo) Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding. I wonder if the fact that the kmer counts in apple itself are all 1 is relevant. FWIW I'm using a fetch of the repo from about an hour ago. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUSINXDZAHQWSZZRD2AJZDTGS62VANCNFSM42HT5OVA>.

Bob Harris · Answer 2 · Fri Apr 02 2021 07:32:42 GMT+0800 (China Standard Time)

Oh yeah, I see what you mean about duplicates in the kmer table for orange. Very strange.

I'm going to try the same test on a couple different architectures, on the outside change that it's platform-dependent.

Bob Harris · Answer 3 · Fri Apr 02 2021 07:52:18 GMT+0800 (China Standard Time)

I've tried it on three machines, all of which have duplicates in the orange kmer table.

One machine on campus running some version of linux. And two macs here at home. I can send you architecture details, but my guess about platform differences looks to be irrelevant.

Bob Harris · Answer 4 · Fri Apr 02 2021 08:10:09 GMT+0800 (China Standard Time)

I tried a few similar datasets with various sizes. All with same result UNTIL I happened to try FastK -k40 -t2 orange.fa instead of FastK -k40 -t1 orange.fa. The table for -t2 looks more reasonable — by eyeball I didn't notice any duplicated kmers. But the histogram still seems to have too many for freq=1 (even though it shows a distribution that otherwise makes more sense than t=1 did, with a nice peak at freq=7).

I'm probably chasing after cats that are of no use to you, so I'll stop with that one.

Eugene W Myers Jr · Answer 5 · Fri Apr 02 2021 14:10:56 GMT+0800 (China Standard Time)

Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the
core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more
testing before I check the fix in.

Richard Durbin · Answer 6 · Fri Apr 02 2021 18:11:21 GMT+0800 (China Standard Time)

Let me know when you have checked in the fix. I am planning to do a whole set of FastK runs today and I want them to be correct.

…

On 2 Apr 2021, at 07:11, Eugene W Myers Jr ***@***.***> wrote: Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more testing before I check the fix in. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2FXZUICZ4QNGAXJTI7JSTTGVNXZANCNFSM42HT5OVA>.

Eugene W Myers Jr · Answer 7 · Fri Apr 02 2021 21:45:11 GMT+0800 (China Standard Time)

@richarddurbin @rsharris

I just checked in the fixed version of FastK. I'm so sorry, I inadvertently screwed up the core MSD sort
in the trunk version while working on trying to get more speed. And I did this on Saturday the day after
my talk to the VGP gang. So I'm going to also send an e-mail to the group to let them know as there has
been a bump in downloads since the talk. Sheisse.

Anyhow its fixed now (and a little faster than it was before :-) ).

Thanks Bob for catching this.