thegenemyers / FASTK

A fast K-mer counter for high-fidelity shotgun datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unexpected relative profile results (possible user error)

rsharris opened this issue · comments

Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect.

Details:

The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X.

Running this
FastK -k40 -t1 apple.fa
FastK -k40 -t1 orange.fa
FastK -k40 -t1 -p:orange.ktab apple.fa
Profex apple.prof 1 > orange_on_apple.profile
I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X).

(orange_on_apple.profile is in that same repo)
Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding.

I wonder if the fact that the kmer counts in apple itself are all 1 is relevant.

FWIW I'm using a fetch of the repo from about an hour ago.

Oh yeah, I see what you mean about duplicates in the kmer table for orange. Very strange.

I'm going to try the same test on a couple different architectures, on the outside change that it's platform-dependent.

I've tried it on three machines, all of which have duplicates in the orange kmer table.

One machine on campus running some version of linux. And two macs here at home. I can send you architecture details, but my guess about platform differences looks to be irrelevant.

I tried a few similar datasets with various sizes. All with same result UNTIL I happened to try FastK -k40 -t2 orange.fa instead of FastK -k40 -t1 orange.fa. The table for -t2 looks more reasonable — by eyeball I didn't notice any duplicated kmers. But the histogram still seems to have too many for freq=1 (even though it shows a distribution that otherwise makes more sense than t=1 did, with a nice peak at freq=7).

I'm probably chasing after cats that are of no use to you, so I'll stop with that one.

Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the
core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more
testing before I check the fix in.

@richarddurbin @rsharris

I just checked in the fixed version of FastK. I'm so sorry, I inadvertently screwed up the core MSD sort
in the trunk version while working on trying to get more speed. And I did this on Saturday the day after
my talk to the VGP gang. So I'm going to also send an e-mail to the group to let them know as there has
been a bump in downloads since the talk. Sheisse.

Anyhow its fixed now (and a little faster than it was before :-) ).

Thanks Bob for catching this.