quinlan-lab / vcf2db

create a gemini-compatible database from a VCF

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

skipping 'AC' because it has Number=A

matthdsm opened this issue · comments

Hi,
When I try to load my VEP annotated vcf's into a db I get the error above. I traced this back to the code (line 656)

            if d['Number'] in "RA":
                print("skipping '%s' because it has Number=%s" % (d["ID"], d["Number"]), 
                      file=sys.stderr)
                continue

Would it be possible to explain why those columns get skipped? Is there a way to force include them? The AC, AF MLEAC en MLEAF info fields are quite important, and the info fields are compliant with the VCF standard.

Thanks
M

There is a 1:1 mapping of INFO fields to database fields. When your AF value is something like 0.233,0.444 because there are multiple ALTs, we can't store that as a single float.
If you decompose your VCF and make sure the header is adjusted so that records that had Number=A are replaced with Number=1.

If you annotate with vcfanno, you could use a [[postannotation]] section and use max(AF) to get a new field and then save that in the database.

I would like to handle this more cleanly, but I haven't thought of anything that works well with RDBMS.

We always decompose and normalize with vt before we do our annotation. If I'm understanding you correctly, I can then safely edit the vcf header and replace the A by 1 and I should be good to go?

Nice, I though it was going to be more complicated! looking forward to the next release of vcf2db to bioconda!

Thanks,
M

Yes, that should work. Remember for GATK AD tag, you need to do: sed 's/ID=AD,Number=./ID=AD,Number=R/' before decompose.

thanks for the tip!

M