* in snp is not allowed in MASH pipeline `rds_to_vcf`
rfeng2023 opened this issue · comments
Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg
GC C
as opposed to:
G *
?
Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg
GC C
as opposed to:
G *
?
One immediate challenge for this is that, the C
for GC C
is not readily available when all we have is G *
As discussed with @rfeng2023 I think we are going to use the standard N
symbol for "any basepair". ...
As discussed with @rfeng2023 I think we are going to use the standard
N
symbol for "any basepair". ...
should we replace it in the rds_to_vcf
process or in the input file (which may need to be fixed in merged.sumstat.vcf
)?
They were not introduced by me, instead, they were inherited from the raw vcf.gz file, as indicated below:
hs3163@node96:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/data_preprocessing/genotype$ zcat DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.add_chr.add_chr.leftnorm.filtered.vcf.gz | grep "*" | cut -f 1,2,3,4,5,6 | head
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##bcftools_filterCommand=filter -i 'GT="hom" | TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.15 | TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.2'; Date=Wed Oct 19 15:56:29 2022
chr21 9540614 chr21:9540614:G:* G * 3256.49
chr21 9550890 chr21:9550890:A:* A * 11424.9
chr21 9553296 chr21:9553296:G:* G * 3665.16
Changing the * to n for only mash output may make the future comparisons between mash output and the output of other parts of our analysis pipeline difficult.