kedartatwawadi / readcompression

Fastq compression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

readcompression

HARC - Tool for compression of genomic reads in FASTQ format. Compresses only the read sequences. Achieves near-optimal compression ratios and fast decompression. Supports upto 4.29 Billion fixed-length reads with lengths at most 256. Requires around 50 Bytes of RAM/read for read length 100. The algorithm requires C++11 and g++ compiler and works on Linux.

Installation

git clone https://github.com/shubhamchandak94/readcompression.git
cd readcompression
./install.sh

Usage

Compression - compresses FASTQ reads. Output written to .tar file

./run_default.sh -c PATH_TO_FASTQ [-p] [-t NUM_THREADS]

-p = Preserve order of reads (compression ratio 2-4x worse if order preserved)

-t NUM_THREADS - default 8

Decompression - decompresses reads. Output written to .dna.d file

./run_default.sh -d PATH_TO_TAR [-p] [-t NUM_THREADS]

-p = Get reads in original order (slower). Only applicable if -p was used during compression.

-t NUM_THREADS - default 8

Help (this message)

./run_default.sh -h
Downloading datasets
Usual reads
wget -b ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/SRR065390_1.fastq.gz
wget -b ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/SRR065390_2.fastq.gz
gunzip SRR065390_1.fastq.gz SRR065390_2.fastq.gz
cat SRR065390_1.fastq SRR065390_2.fastq > SRR065390.fastq

For some datasets (e.g. SRR327342 and SRR870667), the two fastq files may have reads of different lengths

Metagenomics data
wget -b http://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/MH0001/081026/MH0001_081026_clean.1.fq.gz
wget -b http://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/MH0001/081026/MH0001_081026_clean.2.fq.gz
gunzip MH0001_081026_clean.1.fq.gz MH0001_081026_clean.2.fq.gz
cat MH0001_081026_clean.1.fq MH0001_081026_clean.2.fq
Human genome (hg19 - for generating simulated reads)
wget -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz

Inside this file different chromosomes are demarcated.

Generating reads using gen_fastq (from orcom repo)
Error-free reads (without reverse complementation) - 35M reads of length 100 from chrom 22
cd util/gen_fastq_noRC
make
./gen_fastq_noRC 35000000 100 PATH/chrom22.fasta PATH/chrom22_reads.fastq
Reads with 1% uniform substitution rate (each base is equally likely to be changed to any of the 4 possibilities e.g. A - C,T,G,N) (without reverse complementation) - 35M reads of length 100 from chrom 22
./gen_fastq_noRC 35000000 100 PATH/chrom22.fasta PATH/chrom22_reads.fastq -e
Typical fastq format
@seq id
read
+
quality score

Other compressors (for evaluation)

Installing & Running orcom (boost should be installed)
git clone https://github.com/lrog/orcom.git
cd orcom
make boost
cd bin
./orcom_bin e -iPATH/SRR065390_clean.fastq -oPATH/SRR065390_clean.bin
./orcom_pack e -iPATH/SRR065390_clean.bin -oPATH/SRR065390_clean.orcom
Getting orcom ordered file
./orcom_pack d -iPATH/SRR065390_clean.orcom -oPATH/SRR065390_clean.dna

The dna file is of the form:

read1
read2
..
Installing and running Leon (seq-only mode)
git clone https://github.com/GATB/leon.git
cd leon
sh INSTALL
./leon -file PATH_TO_FASTQ -c -seq-only -nb-cores NUM_THR
Leon Decompression (order preserved)
./leon -file PATH_TO_LEON -d -nb-cores NUM_THR

The fasta.d file is of the form:

>1
read1
>2
read2
..

About

Fastq compression


Languages

Language:C++ 75.7%Language:Python 21.3%Language:Shell 2.9%Language:Makefile 0.1%Language:C 0.0%