CMU-SAFARI / BLEND

In the paper https://arxiv.org/abs/2112.08687 BLEND was tested on only one non-HiFi read dataset. That was a simulated read dataset for one of the smallest eukaryotic genomes — the genome of Saccharomyces cerevisiae.

To test how well BLEND performs on real (non-simulated) datasets of genomes which have more typical sizes, I used it to assemble genomes from these two sets of reads:

Caenorhabditis elegans, PacBio CLR reads used in the article https://www.sciencedirect.com/science/article/pii/S2589004220305770 . For polishing I also used Illumina reads from that article. The nematode genome size is approximately 100 Mbp.
Arabidopsis thaliana, Nanopore reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR5530736 . For polishing I also used Illumina reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR2173372 . The size of arabidopsis' genome is approximately 120 Mbp.

I searched for overlaps, then assembled the genomes with Miniasm using default parameters, then polished the assemblies using long reads with Racon, and then polished the assemblies using both long and short reads with HyPo. The assemblies were compared with references using QUAST.

The search for overlaps was performed with Blend 1.0 and, for comparison, with Minimap 2.22, using 22 threads of Intel Xeon X5670.

For the nematode, results are as follows:

	Minimap2	BLEND
Time to find overlaps	10m	3h 37m
Maximum RAM consumption	20G	44G
N50	2,056,511	1,915,190
NGA50	589,675	563,498
misassemblies	740	707
Genome fraction	99.692%	99.683%
Total length	109,516,352	108,958,103

So, the assemblies of the nematode genome made with Minimap2 and with BLEND are similar. However, Blend required 20x more time to find overlaps and 2x more RAM.

For arabidopsis Minimap found overlaps in 30 minutes using 29G RAM. I terminated BLEND because it didn't finish in 24 hours. At the moment I terminated it, BLEND was using 300G RAM.

So, it seems that on non-HiFi datasets for genomes not as small as the genome of Saccharomyces cerevisiae BLEND is slower than Minimap2 and uses more RAM. This may be so because BLEND doesn't deal efficiently with repetitive seeds.

Hi @shelkmike,

Thanks for testing BLEND out with different read sets. This is very much appreciated. I have several questions and comments:

Can you please provide me with the parameter settings that you used for BLEND and Minimap2?
I agree with your observation that you make regarding working with genomes with repetitive regions. I believe this can actually even be speculated from the results we show in our paper. For example, the HiFi human genome overlapping results do not provide the similar speedup that we observe when using E. coli reads. We have not included this observation in the paper as it is still an ongoing progress to be definite about this observation. We believe the main reason can be inferred from the following code snippet:

BLEND/src/seed.c

Lines 86 to 96 in 51c5e46

    
           //@IMPORTANT: We do not deal with high frequency kmers now because we may intentionally 
        
           //be generating those high frequency kmers due to the blend values 
        
           // if (dist > 0 && max_max_occ > max_occ){ 
        
           // 	// printf("%s1\n", __func__); 
        
           // 	mm_seed_select(n_m0, m, qlen, max_occ, max_max_occ, dist); 
        
           // } else{ 
        
           // 	// printf("%s2\n", __func__); 
        
           // 	for (i = 0; i < n_m0; ++i) 
        
           // 		if (m[i].n > max_occ) 
        
           // 			m[i].flt = 1; 
        
           // }

The above code is not commented out in the original Minimap2 implementation. It filters out the k-mers that occur with a high frequency. Perhaps this is mainly useful for such repetitive genomes. However, since we do not know yet how to efficiently setup such a filter without hurting the accuracy (because BLEND assigns the same hash value for potentially similar seeds and this may increase the number of seeds that fall into the same hash value bucket), we disabled this feature. We are likely to re-enable this feature if we can confidently filter out some seeds based on the number of their occurrence without hurting the accuracy. Once re-enabled, we expect a speedup especially for repetitive genomes as we can hopefully deal with repetitive seeds efficiently.

I cannot currently comment on the performance of the ONT reads. It is currently on my agenda to test ONT reads out and come up with a good preset options for ONT reads, too.

Best,

Can Firtina

Thank you for the answer. The commands for the reads of the nematode were

blend -x ava-pb -t 22 $long_reads $long_reads 2>blend_logs.txt | gzip -1 > overlaps_of_long_reads.paf.gz
minimap2 -x ava-pb -t 22 $long_reads $long_reads 2>minimap2_logs.txt | gzip -1 > overlaps_of_long_reads.paf.gz

For the reads of arabidopsis the commands were the same, except for "ava-ont" instead of "ava-pb".

I think that BLEND has good potential for improvement. Good luck with it!

	//@IMPORTANT: We do not deal with high frequency kmers now because we may intentionally
	//be generating those high frequency kmers due to the blend values
	// if (dist > 0 && max_max_occ > max_occ){
	// // printf("%s1\n", __func__);
	// mm_seed_select(n_m0, m, qlen, max_occ, max_max_occ, dist);
	// } else{
	// // printf("%s2\n", __func__);
	// for (i = 0; i < n_m0; ++i)
	// if (m[i].n > max_occ)
	// m[i].flt = 1;
	// }

A test of BLEND on two real datasets of PacBio CLR and Nanopore reads