molgenis / systemsgenetics

Generic Java genotype reader / writer, QTL mapping software, Strand alignment tool

Home Page:https://github.com/molgenis/systemsgenetics/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alignment of monomorphic SNPs (plink format) by Genotype harmonizer

zhengxuhao opened this issue · comments

Hi,

I am a user of Genotype harmonizer and am writing to report two issues during my work:

  1. In plink format, a monomorphic SNP will be stored as alleles "0\G", for example, if "G" is the major allele. Then the problem will occur when aligning these monomorphic SNPs to a reference panel, one example as follows:
    "5 14159 rs112363107 0\G Excluded Found variant with same ID but alleles are not comparable
    "
    It is because these SNPs are not monomorphic in reference panel (as the number of individuals is usually large), but are monomorphic in our own data. Then Genotype harmonizer will recognize them as strand problems, which are in fact not.
    Will it be possible to keep these monomorphic SNPs as they are instead of excluding them?

  2. I also found another small issue when performing strand alignment on shapeit2 format.
    The ".sample" files accompany with shapeit2 format are usually structured as follows:
    "
    ID_1 ID_2 missing father mother sex plink_pheno
    0 0 0 D D D B
    XXX-0963 XXX-0963 0 0 0 0 2 0
    XXX-0965 XXX-0965 0 0 0 0 1 0
    XXX-0966 XXX-0966 0 0 0 0 2 0
    "
    But after strand alignment, the output ".sample" file will be changed as follows, with an extra dot between thrid and forth columns:
    "
    ID_1 ID_2 missing father mother sex plink_pheno
    0 0 0 D D D B
    XXX-0963 XXX-0963 0.0 0 0 2 0
    XXX-0965 XXX-0965 0.0 0 0 1 0
    XXX-0966 XXX-0966 0.0 0 0 2 0
    "

Hope these two issues could be fixed in later versions. Thanks for all your excellent contributions for this amazing tool.

Best regards,
Tenghao

Hello, issue number 1 would be indeed a very useful one to solve, I have encountered it as well. In my case, I am merging two datasets from the same population, and so some SNPs are fixed in one dataset but not in the other, so in the total dataset they are needed, they shouldn't be excluded. This programme works well, but it would be great if it could be adjusted so it doesn't exclude SNPs just because they are fixed in either the data or the reference panel. Many many thanks!

Dear Mircea83,

If you use binary plink format and you correctly specify the alleles it should be possible to do the alignment also for monomorphic SNPs. Only for GC and AT SNPs the LD based alignment will not be possible.

Regards Patrick