brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extracted genotypes format?

lvclark opened this issue · comments

Somalier extracts genotypes from BAMs so much faster than anything I've attempted to write. I would love to be able to use those genotypes in other analysis. (In my particular case, I need principal components to use as pop structure covariates in association analysis, where I am just running the analysis on a few genes and don't want to have to genotype the whole genome.) Could the format of somalier extract be documented a little more thoroughly so that someone like me could read those bytes into Python or R and convert them to numeric genotypes?

Hi, you can see the python code here: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py
that reads the somalier files. Remember that it is only minimal information and not true genotypes.
Note that function discards y-sites but you can see the format.
Happy to answer any questions.

Wonderful, thank you for the quick reply!