DReichLab / AdmixTools

Tools test whether admixture occurred and more

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

All D-stat values are zero

carlos-sarabia opened this issue · comments

Hi, I am trying to run a test on D-stat with a single chromosome (chr38) of dog using an output of PLINK. I am only considering autosomes.

I first ran convertf in this .par file:
genotypename: DATA/chr38.bed
snpname: DATA/chr38.bim
indivname: DATA/chr38.pedind
outputformat: EIGENSTRAT
genotypeoutname: testchr38.geno
snpoutname: testchr38.snp
indivoutname: testchr38.ind
numchrom: 38

Input pedind:
1 speciesO.ind1 0 0 1 1
2 species1.ind1 0 0 2 1
3 species1.ind2 0 0 1 1
4 species1.ind3 0 0 2 1
5 species1.ind4 0 0 2 1
6 species1.ind5 0 0 2 1
7 species1.ind6 0 0 1 1
8 species2.ind1 0 0 1 1
9 species2.ind2 0 0 1 1
...

Output was:
testchr38.geno
000000100000000100000010
111101101010010001000101
011001000110100000001100
000001001010000100011000
001001000101101100011000
000000000000000010000000
000000000000000000000001
200000000000000000000000
001101100000000000000000
...

testchr38.snp
chr38_5 38 0.000000 5 C T
chr38_6 38 0.000000 6 G A
chr38_7 38 0.000000 7 A T
chr38_8 38 0.000000 8 T C
chr38_10 38 0.000000 10 C T
chr38_38 38 0.000000 38 G C
chr38_43 38 0.000000 43 C A
chr38_242 38 0.000002 242 A G

testchr38.ind
1:speciesO.ind1 M Control
2:species1.ind1 F Control
3:species1.ind2 M Control
4:species1.ind3 F Control
5:species1.ind4 F Control
6:species1.ind5 F Control
7:species1.ind6 M Control
8:species2.ind1 M Control
9:species2.ind2 M Control
...

I am trying to see admixture from a species2 into a specific population of species1 (pop1) vs another (species1, pop2). I have speciesO as outgroup. species1.ind1 and .ind2 belong to pop1, species1.ind3 ind4 ind5 and ind6 belong to pop2.

I modified the .ind file to work with pops:
ind1 M speciesO
ind1 F species1.POP1
ind2 M species1.POP1
ind3 F species1.POP2
ind4 F species1.POP2
ind5 F species1.POP2
ind6 M species1.POP2
ind1 M species2
ind2 M species2
...

And set up a .popfile with pops (test of contribution of species.POP2 to species2 also included):
speciesO species2 species1.POP1 species1,POP2
speciesO species1.POP2 species2 species1.POP1

.par file is:
genotypename: testchr38.geno
snpname: testchr38.snp
indivname: testchr38.modified.ind
popfilename: popfile.divided.pops
numchrom: 38

when I run qpDstat -p dstat.testchr38.par:

THE INPUT PARAMETERS

##PARAMETER NAME: VALUE
genotypename: testchr38.geno
snpname: testchr38.snp
indivname: testchr38.modified.ind
popfilename: popfile.divided.groups
numchrom: 38

qpDstat version: 755

number of quadruples 3
0 speciesO 1
1 species2 1
2 species1.POP1 4
3 species1.POP2 2
jackknife block size: 0.050
snps: 27956 indivs: 8
number of blocks for jackknife: 1
nrows, ncols: 8 27956
result: speciesO species2 species1.POP1 species1.POP2 0.0000 0.000 0 0 0
result: speciesO species1.POP1 species2 species1.POP2 0.0000 0.000 0 0 0
##end of qpDstat: 0.306 seconds cpu 21.373 Mbytes in use


What could be failing?

  1. There are some additional species in the .bed and .bim not considered for ABBA-BABA
  2. I am using 38 chromosomes
  3. Original naming of the individuals when calling .bed and .bim in PLINK were species1.ind1, species1.ind2, species2.ind1, speciesO.ind1 ...

Thanks!

Hi, thanks for the quick answer. In fact I had modified the .ind file to have the following information:

ind1 M speciesO
ind1 F species1.POP1
ind2 M species1.POP1
ind3 F species1.POP2
ind4 F species1.POP2
ind5 F species1.POP2
ind6 M species1.POP2
ind1 M species2
ind2 M species2
...

The problem arose after working with this new .ind file.

I had a similar issue with simulated data. As I was just testing the approach, I only simulated 20e6 sites, giving genetic length of 0.2. I run qpDstat for the data and got all zeros, including the last column specifying the number of sites in the input.

After hours of searching for possible formatting errors, I looked into the source code. I don't remember details but I think that the program requires at least five jackknife blocks to work and thus genetic length (the third column in *.snp file) of at least 0.25. If the genetic length is below that, the computation is skipped and only zeros are outputted.

I'm not sure if this feature is documented somewhere. It would be nice if a warning was given when the data are too short.

Hi Nick,
My interpretation may be wrong. I simulated data with msprime and qpDstat failed until I increased the length of the simulated sequence. Below I've analysed a dataset that works (based on a 50 Mbp sequence) and a subset (first 100k SNPs) that fails. When I multiply the genetic distances by three, the subset also works. I concluded that the genetic distances make a difference (through the number of jackknifing blocks).

Regards, Ari

[lx8:test]$ head -100000 data.snp > data.snp2
[lx8:test]$ head -100000 data.geno > data.geno2 

[lx8:test]$ tail -n1 data.snp*
==> data.snp <==
          1:49999975     1        0.500000        49999975 G T

==> data.snp2 <==
          1:18688448     1        0.186884        18688448 C T

[lx8:test]$ wc  data.*
  271697   271697 38309277 data.geno
  100000   100000 14100000 data.geno2
     140      420     1990 data.ind
  271697  1630182 17116911 data.snp
  100000   600000  6300000 data.snp2
  743534  2602299 75828178 total

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt 
../AdmixTools/bin/qpDstat: parameter file: params.txt
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp
genotypename: data.geno
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
  0                 pop4   20
  1                 pop3   20
  2                 pop0   20
  3                 pop1   20
  4                 pop2   20
  5                 pop6   20
jackknife block size:     0.050
snps: 271697  indivs: 120
number of blocks for block jackknife: 11
nrows, ncols: 120 271697
result:       pop4       pop3       pop0       pop6     -0.1492    -14.794    5427   7329 271697 
result:       pop4       pop3       pop1       pop6     -0.1925    -12.247    5721   8446 271697 
result:       pop4       pop3       pop2       pop6     -0.2105    -14.193    5571   8539 271697 
##end of qpDstat:        8.058 seconds cpu      286.966 Mbytes in use

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt2 
../AdmixTools/bin/qpDstat: parameter file: params.txt2
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp2
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
  0                 pop4   20
  1                 pop3   20
  2                 pop0   20
  3                 pop1   20
  4                 pop2   20
  5                 pop6   20
jackknife block size:     0.050
snps: 100000  indivs: 120
number of blocks for block jackknife: 5
nrows, ncols: 120 100000
result:       pop4       pop3       pop0       pop6      0.0000      0.000       0      0      0 
result:       pop4       pop3       pop1       pop6      0.0000      0.000       0      0      0 
result:       pop4       pop3       pop2       pop6      0.0000      0.000       0      0      0 
##end of qpDstat:        2.971 seconds cpu      105.652 Mbytes in use

[lx8:test]$ awk '{print $1,$2,3*$3,$4,$5,$6}' data.snp2 > data.snp3

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt3
../AdmixTools/bin/qpDstat: parameter file: params.txt3
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp3
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
  0                 pop4   20
  1                 pop3   20
  2                 pop0   20
  3                 pop1   20
  4                 pop2   20
  5                 pop6   20
jackknife block size:     0.050
snps: 100000  indivs: 120
number of blocks for block jackknife: 13
nrows, ncols: 120 100000
result:       pop4       pop3       pop0       pop6     -0.1494     -9.959    1987   2684 100000 
result:       pop4       pop3       pop1       pop6     -0.1954     -6.903    2133   3168 100000 
result:       pop4       pop3       pop2       pop6     -0.2165     -9.015    2075   3220 100000 
##end of qpDstat:        3.003 seconds cpu      105.652 Mbytes in use