knmkr / genozip

Compressor for genomic files (VCF/BCF, SAM/BAM, fastq, fasta, GVF, 23andMe), up to 5x better than gzip and faster too

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

genozip



(also available on Conda and Docker Hub)

genozip is a compressor for genomic files - it compresses VCF/BCF, SAM/BAM, fastq, fasta, GVF and 23andMe files. If can even compress them if they are already compressed with .gz .bz2 .xz (for full list of supported file types see 'genozip --input --help').

It achieves x2 to x5 better compression ratios than gzip because it leverages some properties specific to genomic data to compress better. It is also a lot faster than gzip.

The compression is lossless - the decompressed file is 100% identical to the original file.

The command line options are similar to gzip and bcftools, so if you're familiar with these, it works pretty much the same. To get started, try: genozip --help

Commands:
genozip - compress one or more files
genounzip - decompress one or more files
genols - show metadata of one or more files or the entire directory
genocat - view one or more files

Some advanced options:

Lookups:
genocat -r ^Y,MT file1.vcf -- displays all chromosomes except Y and MT
genocat -r -10000 file1.vcf -- displays positions up to 10000
genocat -s SMPL1,SMPL2 file1.vcf -- displays 2 samples
Note: there is no need for a separate indexing step or index file

Concatenating & splitting:
genozip file1.vcf file2.vcf -o concat.vcf.genozip
genounzip concat.vcf.genozip -O

Calculating the MD5:
genozip file.vcf --md5

Encryption:
genozip file.vcf --password abc

Even better compression, with some minor modifications of the data:
genozip file.vcf --optimize

Compress and then verify that the compressed file decompresses correctly:
genozip file.vcf --test

Do you find genozip to be helpful in your research? Please be so kind as to support continued development by citing Citing: https://doi.org/10.1093/bioinformatics/btaa290
Feature requests and bug reports: bugs@genozip.com

genozip is free for non-commercial use. For a commercial license, please contact sales@genozip.com

Usage is subject to terms and conditions. The non-commercial license can be viewed with genozip --license

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Compressor for genomic files (VCF/BCF, SAM/BAM, fastq, fasta, GVF, 23andMe), up to 5x better than gzip and faster too

License:Other


Languages

Language:C 94.9%Language:C++ 3.1%Language:Makefile 1.0%Language:Shell 0.6%Language:Objective-C 0.4%Language:HTML 0.0%Language:Dockerfile 0.0%Language:Batchfile 0.0%