mcveanlab / treeseq-inference

Work for the tree sequence inference paper.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't export newick trees - is `simplify` working OK?

hyanwong opened this issue · comments

try switching to the test-newick-export branch, and running

plots.py -v setup num_records_by_sample_size
plots.py -v generate num_records_by_sample_size

I get

Traceback (most recent call last):
  File "/Users/yan/Documents/Research/Wellcome/treeseq-inference/src/plots.py", line 695, in <module>
    main()
  File "/Users/yan/Documents/Research/Wellcome/treeseq-inference/src/plots.py", line 692, in main
    args.func(cls, extra_args)
  File "/Users/yan/Documents/Research/Wellcome/treeseq-inference/src/plots.py", line 609, in run_generate
    f.generate()
  File "/Users/yan/Documents/Research/Wellcome/treeseq-inference/src/plots.py", line 476, in generate
    inferred_ts.write_nexus_trees(out)
  File "/Users/yan/Documents/Research/Wellcome/treeseq-inference/src/msprime_extras.py", line 29, in write_nexus_trees
    for t, (_, newick) in zip(ts.trees(), ts.newick_trees()): #TO DO: should do ts.newick_trees(zero_based_tip_numbers)
  File "/Users/yan/Documents/Research/Wellcome/treeseq-inference/src/../msprime/msprime/trees.py", line 1441, in newick_trees
    for length, tree in iterator:
_msprime.LibraryError: Newick export not supported for non binary trees.

@jeromekelleher am I being dense here, or is there some problem with the trees generated with your tsinf program? Can they normally be simplify() ed and then exported via newick_trees()?

No, this is a known limitation of the current newick trees implemenation @hyanwong --- it only works for binary trees right now. We infer lots of non-binary events in tsinfer, so this is definitely going to break.

https://github.com/jeromekelleher/msprime/issues/117

I'll get on this as soon as I can.

Do you think the various libs will accept non-binary trees in Newick?

Is the RF metric well defined for non-binary trees?

I geta warning in R, but it still gives a result for the RF metric. In the docs it says "The normalized Robinson-Foulds distance is derived by dividing d(T_1, T_2) by the maximal possible distance i(T_1) + i(T_2). If both trees are unrooted and binary this value is 2n-6."

Which implies that it can cope with non-binary trees.

Will check for the other stats.

Slightly worried about this, I suppose. To be comparable, we probably want a metric which is the average over all possible random resolvings of the multifurcations (polytomies). Otherwise we aren't providing a fair metric when we do the comparisons. I'm not sure if the RF metric for non-binary trees has this property.

Indeed not:

library(phytools)
library(phangorn)
dist.metric <- RF.dist
true.tree<-read.newick(text="(((Human,Chimp),Gorilla),Orang);")
dist.metric(true.tree, read.newick(text="((Human,Chimp,Gorilla),Orang);"))

resolved <- c(
    dist.metric(true.tree, read.newick(text="(((Human,Chimp),Gorilla),Orang);")),
    dist.metric(true.tree, read.newick(text="((Human,(Chimp,Gorilla)),Orang);")),
    dist.metric(true.tree, read.newick(text="(((Human,Gorilla),Chimp),Orang);")))

resolved
mean(resolved)

gives

> c(nonbinary.metric=dist.metric(true.tree, read.tree(text="((Human,Chimp,Gorilla),Orang);")))
Trees are not binary!
nonbinary.metric 
               1 
> 
> resolved <- c(
+     dist.metric(true.tree, read.tree(text="(((Human,Chimp),Gorilla),Orang);")),
+     dist.metric(true.tree, read.tree(text="((Human,(Chimp,Gorilla)),Orang);")),
+     dist.metric(true.tree, read.tree(text="(((Human,Gorilla),Chimp),Orang);")))
> 
> resolved
[1] 0 2 2
> c(av.binary.metric=mean(resolved))
av.binary.metric 
        1.333333 

NB: SPR distance measure cannot be calculated by R for trees with polytomies.

This is a tricky issue. I think we'll have great difficulty in persuading libraries to compute tree metrics on non binary trees... Options:

  1. Hack the output to insert zero length branches and force binary trees (ugly, but practical);
  2. Implement tree metrics in msprime to deal with non binary trees properly (clean, but time consuming).
  3. Something else?

What do you think @hyanwong?

  1. randomly resolve polytomies and take an average of the distance metric over all, or a sample of, resolved trees (advantage: can use all the standard metrics)
  2. use a metric that performs "properly" (for our purposes) when confronted with polytomies - i.e. it gives the average metric over all possible binary resolutions.

(3) is easy, and I am in the middle of implementing it. But it adds (yet another) source of variation in our metrics. Hopefully that means (in the worst case) we need to run more replicates.

(4) requires some thought - I have emailed my tree metrics contact about it.

What do you think?

I presume that if we do random resolving, it will provide a fair comparison with ARGweaver, which provides a large sample of possible (binary) trees, with their likelihoods. That's equivalent to us doing random polytomy breaking, I think.

In an ideal world I would lean towards (2) coupled with (4), as at least this produces reusable code, and other users of msprime can benefit from it. This is leaning dangerously off-topic though, so if (3) is easy then that seems like a good approach.

(3) is trivial. Let's go with that for the moment, but wait to hear back from my contact.

(2) still has the problems that you need a metric to use which is comparable to other results. The difficulty is not (just) in implementing something in msprime vs R, it's in finding what to implement.

OK, sounds good. So; we need a newick output for msprime that'll give us non-binary trees?

I'll add one in to the source here, as it's quicker than updating the (heavily optimised) C version in the main library.

OK, sounds good. So; we need a newick output for msprime that'll give us non-binary trees?

exactly

OK, will do ASAP.

I've added a newick_trees function to the msprime_extras module which supports nonbinary trees.

I've also tidied the module up a bit and added some test cases. It would be good if we could simplify the interface to write_nexus_trees (possibly splitting into two functions?) and write some tests to ensure it's doing what we think it's doing.

Closing this issue as the immediate problem has been resolved.