alephreish / ggtree_guide

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Setup

We’ll be using the following packages:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(castor)

## Loading required package: Rcpp

library(phangorn)

## Loading required package: ape

library(phytools)

## Loading required package: maps

library(ggtree)

## ggtree v3.2.1  For help: https://yulab-smu.top/treedata-book/
## 
## If you use ggtree in published research, please cite the most appropriate paper(s):
## 
## 1. Guangchuang Yu. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
## 2. Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution. 2018, 35(12):3041-3043. doi:10.1093/molbev/msy194
## 3. Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:ape':
## 
##     rotate

## The following object is masked from 'package:tidyr':
## 
##     expand

library(treeio)

## treeio v1.18.1  For help: https://yulab-smu.top/treedata-book/
## 
## If you use treeio in published research, please cite:
## 
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution 2020, 37(2):599-603. doi: 10.1093/molbev/msz240

## 
## Attaching package: 'treeio'

## The following object is masked from 'package:phytools':
## 
##     read.newick

## The following object is masked from 'package:ape':
## 
##     drop.tip

All are available from conda as well, so smth like the following environment .yaml file would work:

channels:
 - conda-forge
 - bioconda
dependencies:
 - r=4.1
 - r-dplyr
 - r-tidyr
 - r-svglite
 - r-ape=5.5
 - r-phangorn=2.8.0
 - r-phytools=0.7_90
 - r-castor=1.7.2
 - bioconductor-biostrings=2.62.0
 - bioconductor-ggtree=3.2.0
 - bioconductor-treeio=1.18.0

Reading trees

Different software generates trees in different text formats. For most of them there are read.* import functions. The most popular formats are as follows:

Format Example R functions Example software
newick ape::read.tree (phylo), treeio::read.astral (treedata), treeio::read.iqtree (treedata) very widespread, e.g. iqtree (.treefile), astral
nexus ape::read.nexus (phylo), treeio::read.beast (treedata) PAUP, BEAST
nhx treeio::read.nhx (treedata) Notung and other reconciliation software

The tree data are represented in R in the objects of three main R classes:

  • phylo which is a basic format and a common denominator across multiple packages;
  • treedata that is very useful when it comes to storing annotations (it includes a slot @phylo that stores the topology) and
  • tbl_tree which is essentially a tibble containing node data in tabular format.

phylo and treedata objects are what the read functions produce (as indicated). treedata is supported by ape and ggtree and is useful for plotting trees with annotations, tbl_tree is not a strict tree format and can be modified as any other tibble as long as no rows are added or removed. Inter-conversion is mostly straightforward: there are functions as.phylo, as.treedata, as_tibble, but notice that conversion to phylo is not lossless as it preserves only topology, branch lengths and node labels. Also, table operations on tbl_tree often lead to losing the correct class and obscure errors with as.treedata, so in case this happens enforce classes with class(c("tbl_tree", "tbl_df", "tbl", "data.frame")) (see below).

As an example, we’ll take an example RAxML tree of XXXXXXX:

tree_file <- system.file("extdata/RAxML", "RAxML_bipartitions.H3", package = "treeio")
big_tree <- read.tree(tree_file)

To make a cursory look at the tree will use the ggtree function:

ggtree(big_tree)

The tree is quite big, so let’s pick every 4th tip using ape::keep.tip:

tips_to_keep <- with(big_tree, tip.label[seq(to = length(tip.label), by = 4)])
tree <- keep.tip(big_tree, tips_to_keep)
ggtree(tree) + geom_tiplab()

This looks more manageable for us (of course in real life you remove tips only if you have good reasons).

Now let’s add some metadata to the tree using its tabular representation. Now, in most of the cases these annotations would probably come from a separate annotation file, but in this case the tip label are quite verbose, so we can them for our toy example – we’ll extract the year and the “type” (it’s a mock – just the first letter of the label):

tree_data <- as_tibble(tree) %>%
    extract(label, into = c("type", "year"), regex = "(.).+_([0-9]+)", remove = F, convert = T) %>%
    mutate(isInner = node %in% parent, isRoot = node == parent) %>%
    mutate(support = ifelse(isInner | isRoot, label, NA), support = as.numeric(support)) %>%
    `class<-`(c("tbl_tree", "tbl_df", "tbl", "data.frame")) %>%
    as.treedata
ggtree(tree_data) + geom_tiplab()

Tree rooting

To display trees in rooted layouts and for analyses that rely on rooted trees, trees must be rooted. In some cases, the trees are already rooted to begin with (e.g. obtained by phylogenetic placement with an already rooted reference tree). In general case though, common phylogenetic programs produce either explicitly unrooted trees or trees that should be treated as unrooted. In such cases, the trees should be (re-)rooted using one of the following strategies:

  • outgroup rooting – OTUs a priori known not to belong to the group of interest, i.e. the ingroup, the recommended strategy
  • midpoint rooting – this procedure places the root in the geometric middle of the tree, should be used with caution and mostly only for visualization
  • minimal ancestor deviation and minimum variance rooting
  • nonreversible models (see root_digger and rootstrap)
  • molecular clock (e.g. UPGMA in the extreme unrealistic case)

One important point about tree rooting concerns inner node and edge labels. Most of the formats and classes for storing phylogenetic trees code both, iternal node and edge attributes as inner nodes labels. This becomes relevant when an edge reverses its orientation due to (re-)rooting and is particularly important for support values that are edge attributes. It should be kept in mind that different rooting programs and functions eiter simply treat all inner node labels as node attributes and thus produce incorrect output most of the time or have parameters/arguments that them to treat the labels as either node or edge attributes.

Depending on the circumstances, it is sometimes desirable to root the trees before importing them in R, e.g. with nw_reroot from newick utilities – it has a switch for treating inner node as edge attributes). For MAD rooting there are mad and madRoot, although both have issues with edge labels. For nonreversible models there are root_digger and iqtree’s implementation of rootstrap, although note that iqtree will only assign rootstrap support values and not actually re-root the input tree.

Among R packages, the following functions can correctly treat edge labels:

  • ape::root(edgelabel = T)
  • phangorn::midpoint(node.labels = "support")

Some code for future:

scaleClades <- function(p, df) {
    with(df, Reduce(function(.p, .node) {
        offs <- offspring(.p$data, .node)
        scale <- 0.5 / (nrow(offs) - 1)
        scaleClade(.p, .node, scale)
    }, node, p))
}
collapseClades <- function(p, df) {
    with(df, Reduce(function(.p, .node) {
        fill <- unlist(colors[Host[node == .node]])
        .p$data[.p$data$node == .node, "label.show"] <- label.show[node == .node]
        collapse(.p, .node, "mixed", fill = fill)
    }, node, p))
}

multi_species <- allDescendants(tree_data@phylo) %>%
    lapply(function(x) filter(tree, node %in% x)) %>%
    bind_rows(.id = "ancestor") %>%
    group_by(ancestor) %>%
    filter(n_distinct(Collapse, na.rm = T) == 1, sum(!isInternal) > 1) %>%
    ungroup %>%
    mutate(ancestor = as.numeric(ancestor)) %>%
    filter(! ancestor %in% node) %>%
    filter(!is.na(Collapse)) %>%
    group_by(ancestor, Collapse) %>%
    summarize(num_tips = sum(!isInternal), Host = first(na.omit(Host))) %>%
    mutate(label.show = sprintf("%s (%d)", Collapse, num_tips)) %>%
    rename(node = ancestor)
p <- ggtree(tree_data) +
    geom_nodepoint(aes(x = branch, subset = !is.na(UFboot) & UFboot >= 90, size = UFboot)) +
    geom_tiplab(aes(label = label.show), size = 4, align = T, linesize = 0) +
    geom_text2(aes(subset = node %in% multi_species$node, x = max(x, na.rm = T), label = label.show), nudge_x = 0.01, size = 4, hjust = 0) +
    geom_tippoint(aes(color = Host), size = 3) +
    geom_treescale(width = 0.5) +
    scale_size_continuous(limits = c(90, 100), range = c(1, 3)) +
    scale_shape_manual(values = seq(0,15)) +
    scale_color_manual(values = colors)

p <- scaleClades(p, multi_species)
p <- collapseClades(p, multi_species)

About

License:Creative Commons Attribution 4.0 International