mcveanlab / treeseq-inference

Work for the tree sequence inference paper.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Another argweaver bug

hyanwong opened this issue · comments

argweaver/bin/arg-sample --sites data/argweaver_bug.sites --popsize 5000 --recombrate 2.5e-08 --mutrate 3.76782964726e-06 --overwrite --quiet --randseed 1355090636 --iters 5000 --sample-step 5000 --output tmp/bug
arg-sample: src/argweaver/sample_thread.cpp:517: int argweaver::sample_hmm_posterior_step(const argweaver::TransMatrixSwitch*, const double*, int): Assertion `matrix->get(k, state2) != 0.0' failed.
Aborted

This only fails on holly, though. It works OK on my laptop. Some sort of rounding / maths bug that is processor or C library dependent?

May be worth posting this (and the argweaver_bug.sites file) to the ARGweaver github repo.

NB. This causes the following error in the plots.py script:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "src/plots.py", line 494, in infer_worker
    return int(row[0]), runner.run()
  File "src/plots.py", line 293, in run
    ret = self.__run_ARGweaver()
  File "src/plots.py", line 367, in __run_ARGweaver
    self.row.aw_iter_out_freq, int(self.row.aw_burnin_iters))
  File "src/plots.py", line 452, in run_argweaver
    '--output', burn_prefix])
  File "src/plots.py", line 243, in time_cmd
    " ".join(cmd), exit_status, stderr.read()))
ValueError: Error running '/home/yan/treeseq-inference/src/../argweaver/bin/arg-sample --sites data/raw__NOBACKUP__/metrics_by_mutation_rate/simulations/msprime-n10_Ne
5000.0_l5000_rho0.000000025_mu0.00000376783-gs1355090636_ms1355090636err0.1.sites --popsize 5000 --recombrate 2.5e-08 --mutrate 3.76782964726e-06 --overwrite --quiet -
-randseed 1355090636 --iters 5000 --sample-step 5000 --output data/raw__NOBACKUP__/metrics_by_mutation_rate/simulations/aweaver+msprime-n10_Ne5000.0_l5000_rho0.0000000
25_mu0.00000376783-gs1355090636_ms1355090636err0.1+ws1355090636_burn': status=134:stderrb"arg-sample: src/argweaver/sample_thread.cpp:517: int argweaver::sample_hmm_po
sterior_step(const argweaver::TransMatrixSwitch*, const double*, int): Assertion `matrix->get(k, state2) != 0.0' failed.\nCommand terminated by signal 6\n12716 2.40 96
.44\n"
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/plots.py", line 1486, in <module>
    main()
  File "src/plots.py", line 1480, in main
    args.func(cls, args)
  File "src/plots.py", line 1366, in run_infer
    f.infer(args.processes, args.threads, args.force)
  File "src/plots.py", line 698, in infer
    for row_id, updated in pool.imap_unordered(infer_worker, work):
  File "/usr/lib/python3.4/multiprocessing/pool.py", line 689, in next
    raise value
ValueError: Error running '/home/yan/treeseq-inference/src/../argweaver/bin/arg-sample --sites data/raw__NOBACKUP__/metrics_by_mutation_rate/simulations/msprime-n10_Ne
5000.0_l5000_rho0.000000025_mu0.00000376783-gs1355090636_ms1355090636err0.1.sites --popsize 5000 --recombrate 2.5e-08 --mutrate 3.76782964726e-06 --overwrite --quiet -
-randseed 1355090636 --iters 5000 --sample-step 5000 --output data/raw__NOBACKUP__/metrics_by_mutation_rate/simulations/aweaver+msprime-n10_Ne5000.0_l5000_rho0.0000000
25_mu0.00000376783-gs1355090636_ms1355090636err0.1+ws1355090636_burn': status=134:stderrb"arg-sample: src/argweaver/sample_thread.cpp:517: int argweaver::sample_hmm_po
sterior_step(const argweaver::TransMatrixSwitch*, const double*, int): Assertion `matrix->get(k, state2) != 0.0' failed.\nCommand terminated by signal 6\n12716 2.40 96
.44\n"

This is raised as a ValueError on line 241. But we should probably carry on so that any such error doesn't doesn't kill the entire run. For later output, it doesn't matter if the AW run fails. The plots.py script should just omit a row if it can't find the right output files.

Now hacked around by wrapping in a try-except block - if the error message contains ''src/argweaver/sample_thread.cpp:517", the exception is caught, logged, and the process continues. Otherwise the exception is re-raised and the process should stop. This should be enough to work around this specific bug until we can solve why ARGweaver is complaining.

Reported at mdrasmus/argweaver#21, so closing

Sounds good to me @hyanwong. Re the ArgWeaver bug, a possible cause might be differences between GCC and clang, and specifically wrt to default optimisations enabled. It might be worth hacking the makefile to set "CXX = clang++" on holly and seeing if problem persists.

I doubt the problem is processor dependent, as all intel processors look very much the same these days, and IEEE float semantics takes nearly all the nastiness out of floats.