genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: Disagreement in the coordinates of VCF and internal formats for Pindel

javang opened this issue · comments

I am observing strange discrepancies between the information present in the VCF created by Pindel and the internal file format. Here is what I did:

  • I downloaded Pindel from github 2 days ago with command:
    git clone https://github.com/genome/pindel.git

  • I ran pindel on a sample BAM file using as reference the human GRCh37 g1k_v37 decoy genome sequence.

  • I noticed that the VCF file does not have the support information, score, or quality measures, so I decided to recover it by merging the VCF file with the lines from the internal format. I recovered those lines with grep ChrID internal_file

  • However, not all the coordinates matched. For one particular case I had in the VCF:

1 10290621 . ATAGCTGGGATTACAGGTGTGTGCCACCACACCTGGTTAATTTTTGTATTTTTAATAGAGACGGGGTTTCACCGTGTTGGCTAGGCTGGTCTTGAT GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA . PASS END=10290715;HOMLEN=0;SVLEN=-96;SVTYPE=RPL;NTLEN=96 GT:AD 0/0:0,1

and the internal format:

3305 D 96 NT 96 "GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA" ChrID 1 BP 10290620 10290717 BP_range 10290620 10290717 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1
  • Notice that in the VCF the position for the variant 10290621 and the internal file says 10290620. I went to UCSC Genome Browser and checked that the REF sequence is 96 bp and starts at 10290621 and ends at 10290716.

  • So now I have the following discrepancies for the (begin, end): reference (10290621,10290716). VCF (10290621, 10290715). internal (10290620, 10290717)

  • I repeated the exercise with a line where the start position matches:

1 10289908 . GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGCTATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC . PASS END=10290003;HOMLEN=0;SVLEN=95;SVTYPE=DUP:TANDEM;NTLEN=95 GT:AD 0/0:0,1 168 TD 95 NT 95 "TATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACA" ChrID 1 BP 10289908 10290004 BP_range 10289908 10290004 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1

and in this case the (begin, end) coordinates are reference (10289908, 10290003), VCF (10289908, 10290003), and internal format (10289908, 10290004)

  • I cannot make sense of it.
    • In the first case the VCF coordinates are wrong and the internal format coordinates seem to be flanking.
    • In the second case the VCF coordinates are correct but the internal format coordinates are not flanking.

The user manual does not explain anything of this, so I am clueless. Any help is appreciated.

If I remember correctly, whereas biologists start a chromosome at position 1, Pindel starts a genome at position 0, pindel2vcf therefore has to 'shift' the raw pindel position 1 place. That may explain the first discrepancy.

In general, the raw pindel output is really 'raw', and pindel2vcf is not only meant as a simple converter too, but also to remove duplicates, shift events that have not been reported at the correct place, etc. Trying to work with the raw Pindel data is something you can do, but is not straightforward.

Note that anyway, working with VCF files isn't very straightforward at all, since while there is an official
VCF standard, the details are a bit vague; for example the GATK format is different from the more general VCF format, hence pindel2vcf has a GATK option.

I do understand your rationale for wanting quality data. The easiest way to achieve that (in my opinion) is to use the different filtering options in pindel2vcf so you can select events which all have a certain minimum support etc.; pindel2vcf has lots of filtering options.

Hope this helps!