brentp / hts-nim

nim wrapper for htslib for parsing genomics data files

Home Page:https://brentp.github.io/hts-nim/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fetch contigs and lengths

danielecook opened this issue · comments

Is there a way to fetch the list of chromosomes from the header of a VCF along with their lengths?

there is not currently, but I'd be open to adding this or getting a PR to implement. there's a method in in cyvcf2:seqnames which checks the header first and then checks for an index and reads from there if it can. but it returns only the names. still, that can be a starting point.

vcf.contigs is probably a better name

there's a seqlens too.

Ok - so after digging in a bit it looks like something more involved than this will be required:

var contigs = v.header.get("contig", BCF_HEADER_TYPE.BCF_HL_CTG)

II don't know C - but I might be able to make this work. Are you suggesting the C code be imported from cyvcf2? I can take a stab at it.

I will implement on monday. I just meant to use cyvcf2 as a guide for the C functions to use.

I started this in master. v.contigs will give you a seq of Contig objects, each with a name attribute.
there is also a length attribute, but it is not set (yet).
I'll have a look at that this week.

Wow thank you! Yeah no need to rush to do this or anything, just wanted it to add some additional functions to my own little command-line utility.

this would probably be better solved by exposing more/better wrapping for the bcf_rec_t structs and the header in general. i don't see an obvious, clean solution so i will let this marinate for a bit.

Ok that sounds good. I like how hash tables are returned with info columns but that doesn’t quite work with contigs because there are multiple... I’ll parse the header for now from the string into a list of hash tables.

this is now implemented. if lengths are available in the header, they will be set correctly on the contigs that have been returned from vcf.contigs

Thank you! I will make use of this in seq-collection

test.vcf.gz

I am seeing -1 contig lengths:

Contig(name:"I", length:-1'i64)
Contig(name:"II", length:-1'i64)
Contig(name:"III", length:-1'i64)
Contig(name:"IV", length:-1'i64)
Contig(name:"V", length:-1'i64)
Contig(name:"X", length:-1'i64)
Contig(name:"MtDNA", length:-1'i64)

Even though the header looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.1-4-g9c4099f+htslib-1.1-3-g5b98adc
##samtoolsCommand=samtools mpileup -t DP,DV,DP4,SP -g -f /lscr2/andersenlab/dec211/pyPipeline/genomes/WS245/c_elegans.PRJNA13758.WS245.genomic.fa.gz -r I:1-1000000 v2_snpset/bam/AB1.bam
##reference=file:///lscr2/andersenlab/dec211/pyPipeline/genomes/WS245/c_elegans.PRJNA13758.WS245.genomic.fa.gz
##contig=<ID=I,length=15072434>
##contig=<ID=II,length=15279421>
##contig=<ID=III,length=13783801>
##contig=<ID=IV,length=17493829>
##contig=<ID=V,length=20924180>
##contig=<ID=X,length=17718942>
##contig=<ID=MtDNA,length=13794>

I'll see if I can track down what is going on...

make sure you have the latest version of hts-nim as the change was quite recent.

Thank you very much I figured it out!