All D-stat values are zero
carlos-sarabia opened this issue · comments
Hi, I am trying to run a test on D-stat with a single chromosome (chr38) of dog using an output of PLINK. I am only considering autosomes.
I first ran convertf in this .par file:
genotypename: DATA/chr38.bed
snpname: DATA/chr38.bim
indivname: DATA/chr38.pedind
outputformat: EIGENSTRAT
genotypeoutname: testchr38.geno
snpoutname: testchr38.snp
indivoutname: testchr38.ind
numchrom: 38
Input pedind:
1 speciesO.ind1 0 0 1 1
2 species1.ind1 0 0 2 1
3 species1.ind2 0 0 1 1
4 species1.ind3 0 0 2 1
5 species1.ind4 0 0 2 1
6 species1.ind5 0 0 2 1
7 species1.ind6 0 0 1 1
8 species2.ind1 0 0 1 1
9 species2.ind2 0 0 1 1
...
Output was:
testchr38.geno
000000100000000100000010
111101101010010001000101
011001000110100000001100
000001001010000100011000
001001000101101100011000
000000000000000010000000
000000000000000000000001
200000000000000000000000
001101100000000000000000
...
testchr38.snp
chr38_5 38 0.000000 5 C T
chr38_6 38 0.000000 6 G A
chr38_7 38 0.000000 7 A T
chr38_8 38 0.000000 8 T C
chr38_10 38 0.000000 10 C T
chr38_38 38 0.000000 38 G C
chr38_43 38 0.000000 43 C A
chr38_242 38 0.000002 242 A G
testchr38.ind
1:speciesO.ind1 M Control
2:species1.ind1 F Control
3:species1.ind2 M Control
4:species1.ind3 F Control
5:species1.ind4 F Control
6:species1.ind5 F Control
7:species1.ind6 M Control
8:species2.ind1 M Control
9:species2.ind2 M Control
...
I am trying to see admixture from a species2 into a specific population of species1 (pop1) vs another (species1, pop2). I have speciesO as outgroup. species1.ind1 and .ind2 belong to pop1, species1.ind3 ind4 ind5 and ind6 belong to pop2.
I modified the .ind file to work with pops:
ind1 M speciesO
ind1 F species1.POP1
ind2 M species1.POP1
ind3 F species1.POP2
ind4 F species1.POP2
ind5 F species1.POP2
ind6 M species1.POP2
ind1 M species2
ind2 M species2
...
And set up a .popfile with pops (test of contribution of species.POP2 to species2 also included):
speciesO species2 species1.POP1 species1,POP2
speciesO species1.POP2 species2 species1.POP1
.par file is:
genotypename: testchr38.geno
snpname: testchr38.snp
indivname: testchr38.modified.ind
popfilename: popfile.divided.pops
numchrom: 38
when I run qpDstat -p dstat.testchr38.par:
THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
genotypename: testchr38.geno
snpname: testchr38.snp
indivname: testchr38.modified.ind
popfilename: popfile.divided.groups
numchrom: 38
qpDstat version: 755
number of quadruples 3
0 speciesO 1
1 species2 1
2 species1.POP1 4
3 species1.POP2 2
jackknife block size: 0.050
snps: 27956 indivs: 8
number of blocks for jackknife: 1
nrows, ncols: 8 27956
result: speciesO species2 species1.POP1 species1.POP2 0.0000 0.000 0 0 0
result: speciesO species1.POP1 species2 species1.POP2 0.0000 0.000 0 0 0
##end of qpDstat: 0.306 seconds cpu 21.373 Mbytes in use
What could be failing?
- There are some additional species in the .bed and .bim not considered for ABBA-BABA
- I am using 38 chromosomes
- Original naming of the individuals when calling .bed and .bim in PLINK were species1.ind1, species1.ind2, species2.ind1, speciesO.ind1 ...
Thanks!
Hi, thanks for the quick answer. In fact I had modified the .ind file to have the following information:
ind1 M speciesO
ind1 F species1.POP1
ind2 M species1.POP1
ind3 F species1.POP2
ind4 F species1.POP2
ind5 F species1.POP2
ind6 M species1.POP2
ind1 M species2
ind2 M species2
...
The problem arose after working with this new .ind file.
I had a similar issue with simulated data. As I was just testing the approach, I only simulated 20e6 sites, giving genetic length of 0.2. I run qpDstat for the data and got all zeros, including the last column specifying the number of sites in the input.
After hours of searching for possible formatting errors, I looked into the source code. I don't remember details but I think that the program requires at least five jackknife blocks to work and thus genetic length (the third column in *.snp file) of at least 0.25. If the genetic length is below that, the computation is skipped and only zeros are outputted.
I'm not sure if this feature is documented somewhere. It would be nice if a warning was given when the data are too short.
Hi Nick,
My interpretation may be wrong. I simulated data with msprime and qpDstat failed until I increased the length of the simulated sequence. Below I've analysed a dataset that works (based on a 50 Mbp sequence) and a subset (first 100k SNPs) that fails. When I multiply the genetic distances by three, the subset also works. I concluded that the genetic distances make a difference (through the number of jackknifing blocks).
Regards, Ari
[lx8:test]$ head -100000 data.snp > data.snp2
[lx8:test]$ head -100000 data.geno > data.geno2
[lx8:test]$ tail -n1 data.snp*
==> data.snp <==
1:49999975 1 0.500000 49999975 G T
==> data.snp2 <==
1:18688448 1 0.186884 18688448 C T
[lx8:test]$ wc data.*
271697 271697 38309277 data.geno
100000 100000 14100000 data.geno2
140 420 1990 data.ind
271697 1630182 17116911 data.snp
100000 600000 6300000 data.snp2
743534 2602299 75828178 total
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt
../AdmixTools/bin/qpDstat: parameter file: params.txt
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp
genotypename: data.geno
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
0 pop4 20
1 pop3 20
2 pop0 20
3 pop1 20
4 pop2 20
5 pop6 20
jackknife block size: 0.050
snps: 271697 indivs: 120
number of blocks for block jackknife: 11
nrows, ncols: 120 271697
result: pop4 pop3 pop0 pop6 -0.1492 -14.794 5427 7329 271697
result: pop4 pop3 pop1 pop6 -0.1925 -12.247 5721 8446 271697
result: pop4 pop3 pop2 pop6 -0.2105 -14.193 5571 8539 271697
##end of qpDstat: 8.058 seconds cpu 286.966 Mbytes in use
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt2
../AdmixTools/bin/qpDstat: parameter file: params.txt2
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp2
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
0 pop4 20
1 pop3 20
2 pop0 20
3 pop1 20
4 pop2 20
5 pop6 20
jackknife block size: 0.050
snps: 100000 indivs: 120
number of blocks for block jackknife: 5
nrows, ncols: 120 100000
result: pop4 pop3 pop0 pop6 0.0000 0.000 0 0 0
result: pop4 pop3 pop1 pop6 0.0000 0.000 0 0 0
result: pop4 pop3 pop2 pop6 0.0000 0.000 0 0 0
##end of qpDstat: 2.971 seconds cpu 105.652 Mbytes in use
[lx8:test]$ awk '{print $1,$2,3*$3,$4,$5,$6}' data.snp2 > data.snp3
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt3
../AdmixTools/bin/qpDstat: parameter file: params.txt3
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp3
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
0 pop4 20
1 pop3 20
2 pop0 20
3 pop1 20
4 pop2 20
5 pop6 20
jackknife block size: 0.050
snps: 100000 indivs: 120
number of blocks for block jackknife: 13
nrows, ncols: 120 100000
result: pop4 pop3 pop0 pop6 -0.1494 -9.959 1987 2684 100000
result: pop4 pop3 pop1 pop6 -0.1954 -6.903 2133 3168 100000
result: pop4 pop3 pop2 pop6 -0.2165 -9.015 2075 3220 100000
##end of qpDstat: 3.003 seconds cpu 105.652 Mbytes in use