bench_leon

Just a little story

In the 1970's Sanger and colleagues and Maxam and Gilbert developed a rapid method to sequence the DNA. Twenty years after the sequencing by Sanger method is the common way and permit the whole genome sequencing for the first organism : Haemophilus influenzae, in 1995. In 2004, almost thirty years after Sanger developed his method, the Human Genome project sequenced for the first the entire Human genome. Since 2004, the sequencing methods changed and the Next Generation Sequencing (NGS) emerge. In approximately ten years the cost and time to sequence a whole human genome decreases considerably. NGS technology offers the possibility to sequence routinely a large number of samples. So the data generated by NGS dramatically increase in the last decade, and the storage and transmission of these data is actually a major concern.

What is currently done?

GZIP

Actually the common way to compress those data is the GZIP format. GZIP is based on the Deflate algrithm, in fact it is the combination of the LZ77 algorithm and Huffmann coding (more explanation here). This algorithm was developed to compress text data, wich means data with a large set of characters.

What is LEON ?

LEON is a new software to compress data issue from NGS (Fasta and FastQ). This approach shared some similarities with those used a reference genome. The particularity of LON is that this reference is built de novo with a de Bruijn Graph whose the pieces are k-mers. As the de Bruijn Graph must be stored with the compressed data its size could be a problem. To work around this problem, the de Bruijn Graph needs a good parametrization and its implementation is based on probabilistic data structure. Based on bloom filters the de Bruijn Graph this is not exact but efficient to store large data.

Comparison between compression of fastq by Gzip and LEON

With this little magic script we produce some awesome graphs to compare the efficiency of GZIP and LEON. To compare this two soft we are interested to the global rate of compression, the rate of compression depending the size of the initial FastQ and the time of compression/decompression. We use FastQ from Human data with size between 100 Mo and 26 Go.

We can see that the ratio compression of LEON mode lossy (red) is between 90 and 95%, unregards the size of the FastQ.

Citations

International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. issn: 1476-4687 (Oct. 2004).
Fleischmann, R. D. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science (New York, N.Y.) 269, 496– 512. issn: 0036-8075 (July 1995).
Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain- terminating inhibitors. Proceedings of the National Academy of Sci- ences of the United States of America 74, 5463–5467. issn: 0027-8424 (Dec. 1977).
Zhang, Y. et al. Light-weight reference-based compression of FASTQ data. BMC bioinformatics 16, 188. issn: 1471-2105 (2015).
Benoit, G. et al. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC bioinformatics 16, 288. issn: 1471-2105 (2015).
Van Dijk, E. L., Auger, H., Jaszczyszyn, Y. & Thermes, C. Ten years of next-generation sequencing technology. Trends in genetics: TIG 30, 418–426. issn: 0168-9525 (Sept. 2014).

pausrrls / bench_leon