gymrek-lab / EnsembleTR

Tools for merging Tandem Repeat VCF files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GangSTR X-chromosome filtering errors, HipSTR repeated read ID problem and question on use of PCR data?

Noma-M opened this issue · comments

commented

Hi EnsembleTR Developers

Thank you for developing this tool, I think that it is incredible to have a tool where we can ‘merge’ vcfs from different STR tools and thus increase our confidence of STR calls generated from these different tools.

Since EnsembleTR can take vcf files generated by hipSTR, GangSTR, adVNTR, and ExpansionHunter, and in my pipeline I will be heading towards also using EnsembleTR - I thought I could post some questions regarding some of the tools on this platform.
I have managed to generate vcf files for adVNTR and ExpansionHunter as well as to run these vcf files through DumpSTR. My first question is on GangSTR. I have managed to generate vcf files, however, when I use dumpSTR, more specifically the GangSTR call-level filter: “--gangstr-filter-badCI” I get this error:
Traceback (most recent call last):
File "dumpSTR", line 8, in
sys.exit(run())
File “python3.8/site-packages/trtools/dumpSTR/dumpSTR.py", line 1272, in run
retcode = main(args)
File "python3.8/site-packages/trtools/dumpSTR/dumpSTR.py", line 1210, in main
record = ApplyCallFilters(record, call_filters, sample_info, invcf.samples)
File "python3.8/site-packages/trtools/dumpSTR/dumpSTR.py", line 594, in ApplyCallFilters
filt_output = filt(record)
File "python3.8/site-packages/trtools/dumpSTR/filters.py", line 718, in call
ci = np.stack(ci)
File "<array_function internals>", line 180, in stack
File "python3.8/site-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

…which I presume (after reading up on the error and seeing that DumpSTR only filtered until chromosome 22 and not on the X-chromosome) has something to do with the ploidy of the X-chromosome in males?
Since the “--gangstr-filter-badCI“ filter is based on REPCN and REPCI, wanted to ask about this filter to clarify if I understand what this filter does: does it have something to do with removing calls where the REPCN and REPCI do not match (i.e. removing if REPCN is 4,4 but REPCI is 3-4,3-4)?
While I am on the topic of GangSTR filters, also wanted to ask about the “--gangstr-filter-spanbound-only”. If I am interested in also uncovering possible repeat expansions, is it necessary to include this filter in DumpSTR?

My second question is on HipSTR. I noticed that the EnsembleTR team also came across an error I had and posted it on the github page (tfwillems/HipSTR#50 "ERROR: Failed to extract passes filters tag from BAM alignment"). Wanted to ask the EnsembleTR team if you managed to resolve this error and how did you go about doing so?

Lastly, wanted to ask about using PCR-free but also PCR data. I know for example ExpansionHunter was developed using PCR-free data. However, I have both PCR as well as PCR-free data. If I subset my samples to only include PCR-free data I decrease my sample size. I wanted to ask what would your advice be on using PCR NGS data in these different tools and eventually EnsemlbeTR as well?

Best,
Noma

Hi Noma,

I shared the issue with running dumpSTR on GangSTR files with the TRTool team, I'll let you know when I hear back.

As for the GangSTR-specific filters, the gangstr--filter-badCI does the following:

If 95% confidence interval does not include the maximum likelihood genotype call, the call is filtered.

and for -gangstr-filter-spanbound-only, if a locus has only spanning or bounding reads, it is usually a flag that something is weird and/or our estimation is unlikely to be correct. If the repeat is really short, we'd expect some enclosing reads. If it is really long, we'd expect some fully repetitive reads. so I believe it is a good filter in general.

For the HipSTR error, I personally did not have this problem, but it seems that HipSTR tries to extract the information in "PF" tag from your bam files and it fails...

Regarding PCR, our dataset has samples with both PCR-free (sequencing data for 1000Genomes samples) and PCR+ (sequencing data for H3Africa samples). When using TR genotypers on PCR+ data, please be aware that the final genotypes tend to be more prone to errors, particularly within homopolymers, based on our experience. So the consensus EnsembleTR call would have higher error rates as well.

Best,
Helia

Thank you Helia