standard-aaron / clues

Coalescent Likelihood Under Effects of Selection - Inferring selection & allele frequency trajectories from nucleotide data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

inference.py not returning the output files

vasilipankratov opened this issue · comments

Hi Aaron!
Thank you very much for implementing CLUES with Relate. I guess the popgen world was waiting for that :).

I encountered a somewhat strange problem: when I run the following command

python inference.py  --times 500_subset_chr${chr}_pos${pos} \
--coal 500_subset.coal --popFreq 0.49 -o 500_subset_chr${chr}_pos${pos}

(this is for the LCT mutation)

I get the following in the sdtout

logLR: 23.4617

MLE:
========
epoch   selection
0-1000  0.01299

Trajectory for 100 gens before present:
=============
gen     freq
0 0.46170708562006263
50 0.43311158387458726
100 0.3103181445757087
150 0.2454585447143809
200 0.21020270699684712
250 0.18844648743470482
300 0.17311540908272202
350 0.16082278456416838
400 0.14556859702825017
450 0.1311794503509021
500 0.11639614277487412
550 0.09286595138312356
600 0.08057476687506046
650 0.07211435368097045
700 0.06343349112729695
750 0.04972621487012437
800 0.03896682620968329
850 0.034583826416153186
900 0.031143549373102642
950 0.028128592723572182

Finished.

but I don't get any out files, so I don't have the posterior probability distribution.
Don't know whether this is related, but inference.py help says there is an option "--out" which is not covered in the documentation, so it's not clear what it does.
I run python 3.6.3 with all the dependencies installed and all previous steps worked fine.

And if non-code-related questions are allowed, I have a few regarding the general approach

  1. I have a rather big dataset from one population (~2000 genomes). I first built the trees with relate on the whole dataset, then subsampled 500 individuals and did the branch length estimating procedure on that subset (first 5 rounds of MCMC with a subset of SNPs and then 1 round on all SNPs using the .coal file from the previous step). Then I use those trees with estimated branch lengths to sample trees at the focal SNP. I cannot do that on the whole dataset because that would just take too long to run. Did I understand it correctly that I can overcome the issue by just building the trees again but using the .coal file I have instead of the fixed Ne and this will result in trees suitable for CLUES input? (Sorry if this question should better go to Leo Speidel).
  2. Is there any rule of thumb about the number of trees to sample with SampleBranchLengths.sh and the values for burn-in and thin with extract_coals.py? For a trial run I sampled 100 trees, then burned-in 10 and took every 5th, so 18 trees in total. How far is that from what you would actually like to have?

Thanks a lot,

Vasili

Hi Vasili,

  1. you need to specify --out . Sorry if that was unclear in the docs! I'll change them to make that more clear.

  2. I'm not sure I follow your question. But basically I recommend that you following the guidelines in Relate for inferring the .coal file (see 'Inferring population size'). Is this prohibitive for you? If so, perhaps you could just use preexisting Ne estimates for your population of interest, and format them to .coal specifications. At the SNP of interest, you should then run SampleBranchLengths.sh on the selected SNP only. Does that answer your question?

  3. I'd recommend using at least ~20 trees. With ARGweaver we found 20 was sufficient to detect ongoing sweeps down to s = 0.001 (very weak). For good measure you could go up as high as you like (maybe 100? 500?) but both the runtime of SampleBranchLengths and the optimization/inference step will scale linearly with the number of MCMC samples.

Closing this issue because the subject line has been addressed