jaudoux / kamix

Index and query k-mer matrices in BGZF

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kamix - Query k-mers matrices index as BGZF files.

kamix leverages the fantastic BGZF library in samtools to provide random access into k-mer matrices files that have been compressed with bgzip.

Kamix expect k-mer matrices compressed with bgzip. In this matrice, k-mer must be sorted in lexicographic.

Example :

tag	SRR2966453	SRR2966456	SRR2966471	SRR2966474	SRR2966454	SRR2966457	SRR2966472	SRR2966475	SRR2966455	SRR2966458	SRR2966473	SRR2966476
AAAAAATGTTTTGTAAGAAT	3	7	4	0	3	0	0	8	0	5	0	6
AAAAAATGTTTTGTAAGGAC	3	0	0	0	0	0	0	0	0	0	0	0
AAAAAATGTTTTGTAATTGA	5	5	5	7	3	8	4	5	6	0	3	9
AAAAAATGTTTTGTACAAAA	8	6	6	4	5	0	6	6	7	0	6	4
AAAAAATGTTTTGTAGAAAC	0	3	0	0	0	0	0	0	0	0	0	0
AAAAAATGTTTTGTAGAAAT	0	0	6	8	0	0	3	5	0	5	6	0
AAAAAATGTTTTGTAGACAT	6	4	0	6	3	0	0	0	3	0	0	0

Examples

# 1. Index the k-mer matrice with bgzipped
bgzip counts-matrix.tsv

# 2. create a kamix index (big.vcf.gz.gbi)
kamix index counts-matrix.tsv.gz

# 3. Query a k-mer
kamix query counts-matrix.tsv.gz AAAAAAAAAAGGCTAAACAT

# 4. query a sequence (that will be splitted into k-mers)
kamix query counts-matrix.tsv.gz TGCTGAGCTGGATCGAAACGCTAGCCCCATGTAAAAAGGCTAAACAT

# Query many k-kmers from a file containing 1-kmer per-line
cat kmers.txt | xargs kamix query counts-matrix.tsv.gz

# Extract the 100 random kmers
kamix random counts-matrix.tsv.gz 100

# Is the file bgzipped?
kamix check counts-matrix.tsv.gz

# get total number of lines in the file (minus the header)
kamix size counts-matrix.tsv.gz

Create a k-mer matrice with jellyfish and JoinCounts

Counting and sorting 32-mer with DSK or Jellyfish

  • with DSK:
dsk -file sample.fastq.gz
dsk2ascii -file sample.h5 -out >(sort -k 1) > counts.tsv
  • with Jellyfish:
jellyfish count -m 32 -s 10000 -o sample.jf <(zcat sample.fastq.gz)
jellyfish dump  -c sample.jf | sort -k 1 > counts.tsv

Consider using the sort command with -S {resources.ram}G and --parallel {threads} parameters to speed-up the sorting for large k-mer libraries.

Join counts from multiple libraries with joinCounts

Download and install joinCounts, to join counts files generated for each libraries.

joinCounts counts1.tsv counts2.tsv > counts-matrix.tsv

Credit

kamix is derived from grabix that uses BGZF library.

TODO

  • Speed-up the random options for large files
  • Add a Helper for each sub_command
  • Add an option to output JSON in kamix Query
  • Add the version of kamix in the index
  • When indexing check if the file is good (same k-length, sorted, same number of samples)
  • Check if the index is newer than the files
  • Check header line
  • Check if the number of samples is the same in all lines
  • Make some performances test and adjust chunk size
  • In k-mer index add an option to set the chunk size

About

Index and query k-mer matrices in BGZF

License:MIT License


Languages

Language:C 71.9%Language:C++ 23.0%Language:Shell 3.2%Language:Python 1.9%Language:Makefile 0.1%