Koeng101 / dnadesign

A Go package for designing DNA.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Easy performance optimizations for MegaMash

CamelCaseCam opened this issue · comments

I'm currently working on a CUDA implementation for MegaMash, and as I'm re-implementing it I'm finding ways you could make it more efficient in GO. I'll throw them in this thread as I think of them.

I was thinking maybe a single map would be most efficient of kmer -> target. That would be the single best improvement, I think.

Idea 1: change how you generate standardized DNA sequences. Considering the numerical value of the whole string is kind of inefficient, but if that's how you want to do it, you should approach this differently.

Before taking the reverse complement of the sequence, loop through the string and add up both the value of the base and the value of the complement to the base at that position. Since addition is commutative, it doesn't matter that you haven't reversed it at this point.

If the non-complement value is lower, return the string as-is (saving two iterations of the string - since you don't have to calculate the reverse complement or sum it). Otherwise, calculate the complement and return it (still saves one iteration).

I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways?

I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways?

Was easier to implement, didn't know it would be a limiting factor

Since addition is commutative, it doesn't matter that you haven't reversed it at this point.

I would imagine AAAT and AATA would have the same additive effect in this case - or I might be reading this wrong.

I would imagine AAAT and AATA would have the same additive effect in this case - or I might be reading this wrong

Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value)

As an aside, if this is how it does it, can't we just compare the first base instead? The way I'm reading the code, it doesn't actually matter if the string is alphabetically lesser. It just needs to be a single deterministic representation.

Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value)

No, that is not the case. It does not sum the string and return the one with the lesser value. It sorts alphabetically the two strings.

Oh, so it already does exactly what I was suggesting. Ignore my suggestion, then. The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA

The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA

Would love to learn about what you come up with! Would be great to have a fast megamash algorithm. Thinking more about the minimizers from minimap2, it might be useful to limit the quantity of possible matches for a given sequence.

One question: if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in? If you only care if the kmer was found or not, this would be a lot easier because it'd mean I don't have to care about race conditions

if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in?

Yes, you do. This is because you're trying to link a sequence's kmers to a set of unique kmers of sequences that you're searching upon. I'm curious what would start a race condition though

Based on your code, it looks like you're checking what fraction of the kmers are present in each sequence, right? When you compile code to increment a variable, it doesn't end up being a single instruction. Each thread needs to grab the value of the variable from memory, increment it, and write it back to memory. If multiple threads do this at the same time, they can overwrite it and you'll loose some fraction of the kmers that were actually there.

There's definitely a way I could implement this to avoid the race condition, but it'll take some thinking.

That is true. The incrementing at the end can cause some problems...

Hey - I've done some thinking and I've got a plan for how to implement this in CUDA. Let me know if you have any issues with it.

The plan:

  1. Implement MegaMashMap as an array of sequences and associated k-mer bloom filters. The bloom filter method introduces a (quantifiable) source of error, but it'll make it much, much easier to transfer the maps to shared memory, which we want since shared memory is on the GPU's cache.
  2. Each block in the kernel is assigned to a specific sequence from the MegaMashMap. The bloom filter is copied into shared memory and each thread compares all the k-mer hashes from some sequence in the input to the bloom filter to see if they're present, with an error rate
  3. Each thread in the block writes the fraction of found hashes for its sequence

This works well for up to 256 input sequences, and for more than that I'd just run the function multiple times. What are your thoughts? Especially on the bloom filter, since I know you avoided that when writing the algorithm.

I'd like to know your thoughts on #64 first actually - there are certain conditions where megamash simply fails.

I think the problem is still the fact that there is a need for longer kmer pair interactions. Ie, in those conditions where megamash fails.

I think there is actually a different way of thinking about the algorithm - instead of comparing the target sequences to the reference kmers, compare the reference kmers to the target sequences. Each reference kmer would be a list of hash groups (usually of size 1, but sometimes of a larger size, to compensate for longer range interactions). This complicates things a little, because I think bloom filters fit worse (you can't do the upfront computation), but also allows for the long range interactions.

Thoughts?

Cont at #64