Recommended filtering
iranmdl opened this issue · comments
Hi!
Thank you for this extremely useful tool.
I have just used gtc2vcf
to convert 300 GTC files from Infinium Global Screening Array
into VCF.
bcftools +gtc2vcf
-Ov \
--adjust-clusters \
--bpm ${bpm_manifest_file} \
--csv ${csv_manifest_file} \
--egt ${egt_manifest_file}
--gtcs ${gtc_folder} \
--fasta-ref ${ref} \
--extra ${prefix}.tsv \
--output ${prefix}.vcf
Now I am trying to understand the different quality metrics and how to use them for downstream filtering. It is my first time working with arrays (all my experience is on WES/WGS), and I was recommended to follow the quality filters suggested in Strategies for processing and quality control of Illumina genotyping arrays. In that paper GenomeStudio
is used, and I can see that for those SNPs with low GenTrain scores, they manually realign the cluster position and therefore GenTrain score increases. Is this something that should be done after running gtc2vcf
? Or is the argument --adjust-clusters
taking care of it?
Are there any recommended thresholds for filtering? GenTrain_Score
threshold, Cluster_Sep
... Also, I have noticed a lot of SNPs with GenTrain_Score 0, while Orig_Score has a good score, is this expected? E.g:
ID GenTrain_Score Orig_Score
10:135332149_CNV_CYP2E1 0 0.85
10:89725294_CNV_PTEN_e9_9 0 0.88
10:43625509_CNV_RET_e20_20 0 0.87
Thank you in advance!
That cluster file information included in the VCF is only for informative purposes and I do not have much experience with it. When you convert with BCFtools/gtc2vcf you use the cluster file to compute the normalized intensities but you don't use it to recall the genotypes. Even if you use --adjust-clusters
you might get better normalized intensities, but the genotypes will stay the same, so I do not recommend using it. You can use GenomeStudio to update your cluster centers and then use the iaap_cli
to generate new genotypes with the updated cluster file, but you have to do that separately as BCFtools/gtc2vcf does not have a framework for generating cluster files. I personally never bothered to use cluster files other than those provided by Illumina. The variant QC I perform is only based on genotypes missingness and HWE
Thank you for the quick answer :).
Ok, so you don't recommend --adjust-clusters
? After reading the README I understood it was recommended: "If you convert hundreds of GTC files at once, you can use the --adjust-clusters option which will recenter the genotype clusters rather than using those provided in the EGT cluster file and will compute less noisy LRR values."
So the step of manual reviewing of the clusters has to be done using GenomeStudio (GS). I guess one could then export a reviewed cluster.egt
file from GS, then run iaap_cli
, and then bcftools/gtc2vcf
? According to several links "Genome Studio’s automatic clustering algorithms are reported to be accurate for ~ 99 % of SNPs. The other ~ 1 % need to be manually reviewed", I was hoping to skip this manual reviewing using gtc2vcf
, but I guess this is asking too much ! :)
I have never used GenomeStudio so I have no experience with it. You can use --adjust-clusters
if you want, but if you use BAF and LRR values with BCFtools/mocha, then it should not make a meaningful difference as BCFtools/mocha has its own approach to re-center the clusters on the fly
Aha! I see, thank you! And you only use genotypes missingness and HWE for variant filtering, no GenTrain_score, call frequency, genotype quality..?
Genotypes will become missing when the genotype quality is too low (see iaap-cli
option --gencall-cutoff
) so the genotype quality is taken into account that way
Right, so genotype quality check is done when idat files are converted into gtc. Thanks!
I intend to use this VCF for phasing+imputation+GWAS
, not the full mocha
pipeline for the moment, and I am trying to figure out some standardized filtering criteria to do to the VCF file. For example, in WES/WGS, you can find common filtering cutoffs such as DP>=10, QUAL>=30.. etc. I'm all ears if you've got any tips or suggestions :)
Also, is there a metric in the INFO field of the VCF with Call_Freq
information (the proportion of samples at each locus successfully genotyped)?
As IDAT to GTC is a sample-by-sample conversion, you don't get statistics across samples. But you can easily compute those from the final VCF with different BCFtools plugins. I always perform phasing and imputation using mocha.wdl and impute.wdl from the MoChA WDL pipeline
Can MoChA WDL pipeline work in a cluster (to be specific, SLURM) or just in the cloud? I would love to give a try to the pipeline phasing and imputing submodules.
It can work wherever you can get Cromwell to run. Most of my collaborators run it with SLURM. Detailed instructions for Cromwell setup are here
Thank you @freeseek ! I will give a try.
One last question, could you provide a gtc file that I can use to test the pipeline? I tried to find a publicly available one but no success so far.