fetch contigs and lengths
danielecook opened this issue · comments
Is there a way to fetch the list of chromosomes from the header of a VCF along with their lengths?
there is not currently, but I'd be open to adding this or getting a PR to implement. there's a method in in cyvcf2:seqnames
which checks the header first and then checks for an index and reads from there if it can. but it returns only the names. still, that can be a starting point.
vcf.contigs is probably a better name
there's a seqlens too.
Ok - so after digging in a bit it looks like something more involved than this will be required:
var contigs = v.header.get("contig", BCF_HEADER_TYPE.BCF_HL_CTG)
II don't know C - but I might be able to make this work. Are you suggesting the C code be imported from cyvcf2? I can take a stab at it.
I will implement on monday. I just meant to use cyvcf2 as a guide for the C functions to use.
I started this in master. v.contigs
will give you a seq
of Contig
objects, each with a name
attribute.
there is also a length
attribute, but it is not set (yet).
I'll have a look at that this week.
Wow thank you! Yeah no need to rush to do this or anything, just wanted it to add some additional functions to my own little command-line utility.
this would probably be better solved by exposing more/better wrapping for the bcf_rec_t
structs and the header in general. i don't see an obvious, clean solution so i will let this marinate for a bit.
Ok that sounds good. I like how hash tables are returned with info columns but that doesn’t quite work with contigs because there are multiple... I’ll parse the header for now from the string into a list of hash tables.
this is now implemented. if lengths are available in the header, they will be set correctly on the contigs that have been returned from vcf.contigs
Thank you! I will make use of this in seq-collection
I am seeing -1 contig lengths:
Contig(name:"I", length:-1'i64)
Contig(name:"II", length:-1'i64)
Contig(name:"III", length:-1'i64)
Contig(name:"IV", length:-1'i64)
Contig(name:"V", length:-1'i64)
Contig(name:"X", length:-1'i64)
Contig(name:"MtDNA", length:-1'i64)
Even though the header looks like this:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.1-4-g9c4099f+htslib-1.1-3-g5b98adc
##samtoolsCommand=samtools mpileup -t DP,DV,DP4,SP -g -f /lscr2/andersenlab/dec211/pyPipeline/genomes/WS245/c_elegans.PRJNA13758.WS245.genomic.fa.gz -r I:1-1000000 v2_snpset/bam/AB1.bam
##reference=file:///lscr2/andersenlab/dec211/pyPipeline/genomes/WS245/c_elegans.PRJNA13758.WS245.genomic.fa.gz
##contig=<ID=I,length=15072434>
##contig=<ID=II,length=15279421>
##contig=<ID=III,length=13783801>
##contig=<ID=IV,length=17493829>
##contig=<ID=V,length=20924180>
##contig=<ID=X,length=17718942>
##contig=<ID=MtDNA,length=13794>
I'll see if I can track down what is going on...
make sure you have the latest version of hts-nim as the change was quite recent.
Thank you very much I figured it out!