gymrek-lab / EnsembleTR

Tools for merging Tandem Repeat VCF files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

inconsistent start/end positions

xbwdk opened this issue · comments

commented

Hi EnsembleTR developers,

I was really excited when I saw this EnsembleTR tool which aims to merge different STR callsets; the merging problem has really bothered me a lot for a long time.

I installed this package and did some test running using the data you provided in the ensembletr/testsupport/ExampleData folder. But I found a minor issue with regards to inconsistency of the start positions:

Basically, I extracted one entry from hipstr-chr20.vcf.gz and gangstr-chr20.vcf.gz which represents the same STR locus (chr20:78835-78850), then I run EnsembleTR to merge HipSTR and GangSTR results.
The original HipSTR entry (line number 3439) already got the inconsistency problem where the POS column (POS=78833) and the START position inside the FOMAT columns (START=78835) are different by 2 bases (this issue has already been logged in the HipSTR page before tfwillems/HipSTR#80 because HipSTR tend to include few extra bases if SNPs occur in the flanking regions).
After merging by EnsembleTR, the result sequence follows the longer HipSTR sequence, which has the extra bases ahead of the STR locus, so the start position should be 78833, but ensembleTR gives 78835 as output. The end pos is also shifted by 2 bases.

Here is the code I used when I found this issue:

zcat hipstr-chr20.vcf.gz|grep "#"  > hipstr-chr20_test.vcf 
zcat gangstr-chr20.vcf.gz|grep "#" > gangstr-chr20_test.vcf 
zcat hipstr-chr20.vcf.gz|sed -n 3439p >> hipstr-chr20_test.vcf
zcat gangstr-chr20.vcf.gz|sed -n 3709p >> gangstr-chr20_test.vcf
EnsembleTR --out test.vcf --ref ${REF} --vcfs hipstr-chr20_test.vcf,gangstr-chr20_test.vcf

I also found that in the example data, there are multiple entries for the same loci inside the same vcf file, and seems like EnsembleTR is only doing inter-merging across different vcf files but not doing intra-merging. Although I think it's not a very common case, I'm still wondering if you have plan to implement intra-merging.

Thanks a lot, I really appreciate your work! :)

Best,
xbwdk

Dear xbwdk,

Thank you for your precise comment. I looked at the example files and I noticed the HipSTR example input is not updated. We wrote a script to address the issue of multiple records for the same repeat in HipSTR input files as they are all overlapping and EnsembleTR wouldn't be able to merge the records correctly. The script gets the HipSTR VCF file as input and tries to merge records from the same repeat together and report a single record. I'll update the example file to reflect this. Thank you again for bringing up this issue.

Best,
Helia

Dear xbwdk,

I updated the example VCF files in ExampleData Directory and added a note on using HipSTR files. Please let me know if there was further issues.

Best,
Helia