Error(s) during WES analysis on UKBB

Question

Error(s) during WES analysis on UKBB

aLaine1 opened this issue a year ago · comments

Hello,
I have been working with SAIGE-GENE+ for the last couple of days, learning the basics on a local dataset first, and reading all about the UKBB WES analysis it was able to perform. I have been following the steps you documented here https://saigegit.github.io/SAIGE-doc/docs/UK_Biobank_WES_analysis.html to try and reproduce what you did for a phenotype of interest in our lab.

I've used ~460000 White people's hard coded genotypes to build the sparse GRM, for which I used /Bulk/Genotype Results/Genotype calls/ukb22418* plink files. I pruned those in the same way it is done in your analysis.

Then I ran step 1 for a phenotype we're interested in : Apolipoprotein_B (quantitative trait)
For this I created the input plink file with the subset of 2000 markers (subset-ed from the hard coded GT data).
Step 1 seems to work fine and output the .rda and the following variance ratios :

0.999998724446117 sparse 1
1.00359091377333 null 1
0.999999058976022 sparse 2
0.995340421007546 null 2

It's for the step 2 I encounter some issues.
I used /Bulk/Exome sequences/Population level exome OQFE variants, PLINK format - interim 450k release/ukb23149_c19_b0_v1.bed/.fam/.bim as inputs.

The first I encountered was :
"At least one subject requested is not in Plink file." I figured it might be because some samples where not sequenced in the WES final datas, so I used the "--subSampleFile" to only use the samples that are both in my pruned dataset and are WES sequenced.
So I got through this error but I don't know if it's a good way.

Anyway now I'm facing a new error : it seems to read the inputs fine, but right as it starts the analysis it fails :

I don't really know if my inputs are wrong (which I suspect might be the real issue) but I'm stuck in this state and I can't really find a workaround for now !
Thanks for your time
Antoine L.

EDIT :
After further testing, I think the issue is related with the "fastTest". I couldn't figure out why but when I run the same command with is_fastTest=FALSE, it runs until the end without an issue (but it's way slower and thus hardly possible to run on a large dataset)

Antoine Lainé · Answer 1 · Thu Aug 24 2023 17:07:24 GMT+0800 (China Standard Time)

After some digging, this behavior happens when you use --subSampleFile in step2. And I found the culprit :

https://github.com/saigegit/SAIGE/blob/5fa0a2bda54656df1def789dd8c45ac513e3c21f/R/readInGLMM.R#L285C39-L285C39

This specific lines is supposed to remove the informations about excluded samples, but in the line above we define both that :

modglmm$mu should take the value of modglmm$fitted.values which was already filtered for excluded samples in line 210.
modglmm$mu is transformed to a vector

Now with the following line, two problems happens :

modglmm$mu = modglmm$mu[includeIndex]

We try to remove datas that were already removed before, and when doing this on a vector, it introduces NA in your vector, which then causes the code to break as NA don't do well with statistical tests.

Anyway, simply commenting or removing line 285 was enough to fix everything for me !