Extracted genotypes format?
lvclark opened this issue · comments
Somalier extracts genotypes from BAMs so much faster than anything I've attempted to write. I would love to be able to use those genotypes in other analysis. (In my particular case, I need principal components to use as pop structure covariates in association analysis, where I am just running the analysis on a few genes and don't want to have to genotype the whole genome.) Could the format of somalier extract
be documented a little more thoroughly so that someone like me could read those bytes into Python or R and convert them to numeric genotypes?
Hi, you can see the python code here: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py
that reads the somalier files. Remember that it is only minimal information and not true genotypes.
Note that function discards y-sites but you can see the format.
Happy to answer any questions.
Wonderful, thank you for the quick reply!