Easy performance optimizations for MegaMash

Question

Easy performance optimizations for MegaMash

CamelCaseCam opened this issue 6 months ago · comments

I'm currently working on a CUDA implementation for MegaMash, and as I'm re-implementing it I'm finding ways you could make it more efficient in GO. I'll throw them in this thread as I think of them.

Koeng101 · Answer 1 · Sun Feb 11 2024 06:06:45 GMT+0800 (China Standard Time)

I was thinking maybe a single map would be most efficient of kmer -> target. That would be the single best improvement, I think.

Cameron Kroll · Answer 2 · Sun Feb 11 2024 06:16:11 GMT+0800 (China Standard Time)

Idea 1: change how you generate standardized DNA sequences. Considering the numerical value of the whole string is kind of inefficient, but if that's how you want to do it, you should approach this differently.

Before taking the reverse complement of the sequence, loop through the string and add up both the value of the base and the value of the complement to the base at that position. Since addition is commutative, it doesn't matter that you haven't reversed it at this point.

If the non-complement value is lower, return the string as-is (saving two iterations of the string - since you don't have to calculate the reverse complement or sum it). Otherwise, calculate the complement and return it (still saves one iteration).

Cameron Kroll · Answer 3 · Sun Feb 11 2024 06:16:44 GMT+0800 (China Standard Time)

I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways?

Koeng101 · Answer 4 · Sun Feb 11 2024 06:19:44 GMT+0800 (China Standard Time)

I forgot to send that earlier - the map idea seems interesting. Why do you have the multiple maps, anyways?

Was easier to implement, didn't know it would be a limiting factor

Since addition is commutative, it doesn't matter that you haven't reversed it at this point.

I would imagine AAAT and AATA would have the same additive effect in this case - or I might be reading this wrong.

Cameron Kroll · Answer 5 · Sun Feb 11 2024 06:22:01 GMT+0800 (China Standard Time)

I would imagine AAAT and AATA would have the same additive effect in this case - or I might be reading this wrong

Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value)

As an aside, if this is how it does it, can't we just compare the first base instead? The way I'm reading the code, it doesn't actually matter if the string is alphabetically lesser. It just needs to be a single deterministic representation.

Koeng101 · Answer 6 · Sun Feb 11 2024 06:23:55 GMT+0800 (China Standard Time)

Yes, but isn't that already the case? (I'm assuming "alphabetically lesser string" sums up the string and returns the one with the overall lesser value)

No, that is not the case. It does not sum the string and return the one with the lesser value. It sorts alphabetically the two strings.

Cameron Kroll · Answer 7 · Sun Feb 11 2024 06:26:28 GMT+0800 (China Standard Time)

Oh, so it already does exactly what I was suggesting. Ignore my suggestion, then. The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA

Koeng101 · Answer 8 · Sun Feb 11 2024 06:30:43 GMT+0800 (China Standard Time)

The kmer -> target map seems like a good idea. I'm going to think about an efficient way to implement it in CUDA

Would love to learn about what you come up with! Would be great to have a fast megamash algorithm. Thinking more about the minimizers from minimap2, it might be useful to limit the quantity of possible matches for a given sequence.

Cameron Kroll · Answer 9 · Sun Feb 11 2024 06:41:38 GMT+0800 (China Standard Time)

One question: if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in? If you only care if the kmer was found or not, this would be a lot easier because it'd mean I don't have to care about race conditions

Koeng101 · Answer 10 · Sun Feb 11 2024 07:05:23 GMT+0800 (China Standard Time)

if you have say N kmers and you're checking them against a bunch of sequences, do you care about which sequence each kmer is in?

Yes, you do. This is because you're trying to link a sequence's kmers to a set of unique kmers of sequences that you're searching upon. I'm curious what would start a race condition though

Cameron Kroll · Answer 11 · Sun Feb 11 2024 07:40:44 GMT+0800 (China Standard Time)

Based on your code, it looks like you're checking what fraction of the kmers are present in each sequence, right? When you compile code to increment a variable, it doesn't end up being a single instruction. Each thread needs to grab the value of the variable from memory, increment it, and write it back to memory. If multiple threads do this at the same time, they can overwrite it and you'll loose some fraction of the kmers that were actually there.

There's definitely a way I could implement this to avoid the race condition, but it'll take some thinking.

Koeng101 · Answer 12 · Sun Feb 11 2024 08:18:20 GMT+0800 (China Standard Time)

That is true. The incrementing at the end can cause some problems...

Cameron Kroll · Answer 13 · Fri Feb 16 2024 02:52:32 GMT+0800 (China Standard Time)

Hey - I've done some thinking and I've got a plan for how to implement this in CUDA. Let me know if you have any issues with it.

The plan:

Implement MegaMashMap as an array of sequences and associated k-mer bloom filters. The bloom filter method introduces a (quantifiable) source of error, but it'll make it much, much easier to transfer the maps to shared memory, which we want since shared memory is on the GPU's cache.
Each block in the kernel is assigned to a specific sequence from the MegaMashMap. The bloom filter is copied into shared memory and each thread compares all the k-mer hashes from some sequence in the input to the bloom filter to see if they're present, with an error rate
Each thread in the block writes the fraction of found hashes for its sequence

This works well for up to 256 input sequences, and for more than that I'd just run the function multiple times. What are your thoughts? Especially on the bloom filter, since I know you avoided that when writing the algorithm.

Koeng101 · Answer 14 · Fri Feb 16 2024 04:43:10 GMT+0800 (China Standard Time)

I'd like to know your thoughts on #64 first actually - there are certain conditions where megamash simply fails.

Koeng101 · Answer 15 · Sat Feb 17 2024 01:57:12 GMT+0800 (China Standard Time)

I think the problem is still the fact that there is a need for longer kmer pair interactions. Ie, in those conditions where megamash fails.

I think there is actually a different way of thinking about the algorithm - instead of comparing the target sequences to the reference kmers, compare the reference kmers to the target sequences. Each reference kmer would be a list of hash groups (usually of size 1, but sometimes of a larger size, to compensate for longer range interactions). This complicates things a little, because I think bloom filters fit worse (you can't do the upfront computation), but also allows for the long range interactions.

Thoughts?

Koeng101 · Answer 16 · Tue Feb 20 2024 12:25:38 GMT+0800 (China Standard Time)

Cont at #64