---
title: "Getting started: Exploring tree space"
author: "Martin R. Smith"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
csl: ../inst/apa-old-doi-prefix.csl
vignette: >
  %\VignetteIndexEntry{Getting started: Exploring tree space}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Mapping the distribution of trees in "tree space" can reveal patterns in
the set of optimal trees, and allow additional details to be obtained
from results that may appear to lack resolution [@SmithSpace].

The 'TreeSearch' GUI allows rapid visualisation of tree spaces and the
construction of cluster consensus trees, to depict the underlying structure
inherent in a set of optimal trees.

More detailed tree space analysis can also be conducted using the
['TreeDist' package](
https://ms609.github.io/TreeDist/articles/treespace.html).


# Loading trees into the GUI

This example uses the @Sun2018 dataset included with the
TreeSearch package, which contains trees generated under four optimality
criteria.
The approach will work on search results generated by tree search within the
GUI, or any set of trees loaded using the "Load trees" function.

For rapid processing, large tree sets are down-sampled when loaded: here,
48 of the 125 trees are used for analysis.
This is useful for obtaining an initial overview of results; including more
trees gives a more complete picture but results will take quadratically longer
to generate.

If loading the output of a Bayesian analysis, you may wish to modify the range
of trees sampled to exclude trees generated during the burn-in phase.


![Loading trees for analysis](gui_trees_loaded.png)

Equivalent R code can be downloaded using the "Save plot: R script" button,
and is provided here for reference:

```{r load-trees}
# Load required libraries 
library("TreeTools", quietly = TRUE) 
library("TreeDist") 
library("TreeSearch") 
  

# Load Sun et al. 2018 trees from TreeSearch package
treeFile <- system.file("datasets/Sun2018.nex", package = "TreeSearch")

# Load trees from your own file with
# treeFile <- "Path/to/my.file"
# 
# To check and set working directory, use:
# getwd() # Should match location of data / tree files 
# setwd(".") # Replace . with desired/directory to change 

# Read all trees from file
trees <- read.nexus(treeFile) 

# Down-sample to a maximum of 48 trees
trees <- trees[unique(as.integer(seq.int(1, length(trees), length.out = 48)))] 
```

# Exploring the consensus

The GUI will initially display the strict consensus of the input trees, after
gaining as much information as possible by removing any "rogue" taxa
whose position is highly variable [@SmithCons].

If trees were generated by a probabilistic sampling process (e.g. Markov Chain)
then a majority rule consensus (figure: Majority = 0.5)
may reveal additional resolution,
but note that the frequency of splits in trees found by parsimony search does
not correspond to split support -- only the strict consensus tree has any
direct interpretation in this setting.


![A majority rule consensus has no interpretation with this dataset, but its improved resolution does hint that the strict consensus hides some underlying structure](gui_consensus.png)

Equivalent R code can be downloaded using the "Save plot: R script" button
within the GUI, and is provided here for reference:

```{r exploring-consensus, fig.keep = "none"}
# Select an appropriate majority value
# When analysing parsimony results, this value should be 1.
majority <- 0.5

# Identify rogue taxa
exclude <- Rogue::QuickRogue(trees, p = majority)$taxon[-1]
exclude

# Select a rogue whose positions should be depicted
plottedRogue <- exclude[1]

# Remove other excluded taxa from tree
consTrees <- lapply(trees, DropTip, setdiff(exclude, plottedRogue))

# Colour tip labels according to their original 'instability' (Smith 2022a)
tipCols <- Rogue::ColByStability(trees)

# Our plotted rogue will not appear on the tree
tipCols <- tipCols[setdiff(consTrees[[1]]$tip.label, plottedRogue)] 


# Set up plotting area 
par(
 mar = c(0, 0, 0, 0), # Zero margins 
 cex = 0.8            # Smaller font size 
)

# Plot the reduced consensus tree, showing position of our plotted rogue 
plotted <- RoguePlot( 
 trees = consTrees, 
 tip = plottedRogue,
 p = majority, 
 edgeLength = 1, 
 tip.color = tipCols 
) 

# Calculate split concordance
concordance <- SplitFrequency(plotted$cons, trees) / length(trees)

# Annotate splits by concordance 
LabelSplits(
 tree = plotted$cons,
 labels = signif(concordance, 3),
 col = SupportColor(concordance),
 frame = "none",
 pos = 3
) 
```

# The structure of tree space

The "Tree space" display option plots trees as points, and attempts to map the
distribution of points such that the distance between mapped points corresponds
to the tree distance between each pair of trees.

The default method (Clustering Information distance) has many advantages over
the widely-used Robinson--Foulds distance [@Smith2020; @SmithSpace];
the quartet distance, whilst slower to calculate, can also produce good
low-dimensional mappings of tree space.

The number of dimensions of space can be reduced to make mappings easier to
interpret, but do check that the mapping quality gauge remains at least "good",
or perceived structure may be an artefact of distortion.

![Tree space mapping reveals three tree clusters](gui_sun_space.png)

In this dataset, trees are assigned to three distinct clusters, and display
"good" structure: clusters are marked only when their silhouette coefficient
is greater than the threshold, whose interpretation is marked above the slider
(here, 0.68 is "good").
As it happens, the clusters largely correspond to the tree reconstruction
method used:
@Fitch1971 parsimony (TNT_EW) trees form one distinct cluster;
trees under "BGS" inapplicable-corrected parsimony [@Brazeau2019] another;
and Bayesian trees a final, more diffuse cluster.

## Tree space analysis in R


More sophisticated tree space analysis can be conducted using the
['TreeDist' package](
https://ms609.github.io/TreeDist/articles/treespace.html).

Sample R code can be downloaded using the "Save plot: R script" button
within the GUI, and is provided here for reference:

## Generate distances

Different tree distances reflect different aspects of tree similarity, which
may or may not be relevant to phylogenetic questions [@Smith2020; @SmithSpace].

```{r tree-distances, fig.keep = "none"}
# Compute distances between pairs of trees
dists <- TreeDist::ClusteringInfoDistance(trees)
```

```{r quartet-distance, eval = FALSE}
# The quartet distance is a slower alternative:
library("Quartet")
dists <- as.dist(Quartet::QuartetDivergence( 
   Quartet::ManyToManyQuartetAgreement(trees), similarity = FALSE)
)
```

## Identify clustering structure

Here we compute clusters of trees using a selection of clustering methods
[@Hartigan1979; @Maechler2019; @Bien2011],
allowing the optimal clustering to be highlighted on the plot.

```{r compute-clusters, fig.keep = "none"}
# Try K-means clustering (Hartigan & Wong 1979):
kClusters <- lapply(
  2:15, # Minimum 2 clusters; maximum 15
  function (k) kmeans(dists, k)
)
kSils <- vapply(kClusters, function (kCluster) { 
 mean(cluster::silhouette(kCluster$cluster, dists)[, 3]) 
}, double(1)) 
bestK <- which.max(kSils) 
kSil <- kSils[bestK] # Best silhouette coefficient 
kCluster <- kClusters[[bestK]]$cluster # Best solution 

# Try partitioning around medoids (Maechler et al. 2019):
pamClusters <- lapply(2:15, function (k) cluster::pam(dists, k = k)) 
pamSils <- vapply(pamClusters, function (pamCluster) { 
 mean(cluster::silhouette(pamCluster)[, 3]) 
}, double(1)) 
bestPam <- which.max(pamSils) 
pamSil <- pamSils[bestPam] # Best silhouette coefficient 
pamCluster <- pamClusters[[bestPam]]$cluster # Best solution 

# Try hierarchical clustering with minimax linkage (Bien & Tibshirani 2011):
hTree <- protoclust::protoclust(dists) 
hClusters <- lapply(2:15, function (k) cutree(hTree, k = k)) 
hSils <- vapply(hClusters, function (hCluster) { 
 mean(cluster::silhouette(hCluster, dists)[, 3]) 
}, double(1)) 
bestH <- which.max(hSils) 
hSil <- hSils[bestH] # Best silhouette coefficient 
hCluster <- hClusters[[bestH]] # Best solution

# Set threshold for recognizing meaningful clustering 
# no support < 0.25 < weak < 0.5 < good < 0.7 < strong 
threshold <- 0.5

# Compare silhouette coefficients of each method 
bestMethodId <- which.max(c(threshold, pamSil, hSil, kSil)) 
bestCluster <- c("none", "pam", "hmm", "kmn")[bestMethodId] 

# Best clustering:
c("Structure below threshold",
  "Partitioning around medoids",
  "Hierarchical-minimax",
  "K-means")[bestMethodId]

# Best silhouette coefficient:
max(c(threshold, pamSil, hSil, kSil))

# Store the cluster to which each tree is optimally assigned: 
clustering <- switch(bestCluster, pam = pamCluster, hmm = hCluster, kmn = kCluster, 1) 
nClusters <- length(unique(clustering))
```

## Mapping tree space

Now we can map tree space into a few dimensions and visualise the distribution
of trees:

```{r map-space, fig.keep = "none"}
nDim <- 3 # Number of dimensions to plot

# Generate first dimensions of tree space using PCoA 
map <- cmdscale(dists, k = nDim) 

# Prepare plot layout 
nPanels <- nDim * (nDim - 1L) / 2L # Lower-left triangle 
plotSeq <- matrix(0, nDim, nDim)
plotSeq[upper.tri(plotSeq)] <- seq_len(nPanels)
plotSeq[nDim - 1, 2] <- max(plotSeq) + 1L
layout(t(plotSeq[-nDim, -1]))

# Set plot margins 
par(mar = rep(0.2, 4))

# Set up tree plotting symbols
treePch <- 0 # Square
clusterCol <- palette.colors(nClusters, "Alphabet", alpha = 0.8)
treeCols <- clusterCol[clustering]
for (i in 2:nDim) for (j in seq_len(i - 1)) { 
  
  # Set up blank plot 
  plot( 
    x = map[, j], 
    y = map[, i], 
    ann = FALSE,        # No annotations 
    axes = FALSE,       # No axes 
    frame.plot = TRUE,  # Border around plot 
    type = "n",         # Don't plot any points yet 
    asp = 1,            # Fix aspect ratio to avoid distortion 
    xlim = range(map),  # Constant X range for all dimensions 
    ylim = range(map)   # Constant Y range for all dimensions 
  ) 
   
  # Plot minimum spanning tree (Gower 1969) to depict distortion (Smith 2022)
  mst <- MSTEdges(as.matrix(dists)) 
  segments( 
    x0 = map[mst[, 1], j], 
    y0 = map[mst[, 1], i], 
    x1 = map[mst[, 2], j], 
    y1 = map[mst[, 2], i], 
    col = "#bbbbbb", # Light grey 
    lty = 1          # Solid lines 
  ) 
   
  # Add points
  points( 
    x = map[, j], 
    y = map[, i], 
    pch = treePch, 
    col = treeCols, 
    cex = 1.7, # Point size 
    lwd = 2 # Line width 
  ) 
   
  # Mark clusters 
  for (clI in seq_len(nClusters)) { 
    inCluster <- clustering == clI 
    clusterX <- map[inCluster, j] 
    clusterY <- map[inCluster, i] 
    hull <- chull(clusterX, clusterY) 
    polygon( 
      x = clusterX[hull], 
      y = clusterY[hull], 
      lty = 1, # Solid line style 
      lwd = 2, # Wider line width 
      border = clusterCol[clI] 
    ) 
  } 
}
```
# Using tree clustering to understand tree space structure

Plotting the consensus of trees within a specific cluster can give
more insight into the underlying relationships within a tree set than viewing 
a single strict consensus, particularly where the data do not decisively 
distinguish between a small number of well-defined but conflicting phylogenetic
hypotheses [@Stockham2002].

Visualising the consensus of each cluster can reveal this "hidden" agreement
between trees and allow more detailed and thoughtful analysis of the meaning
of the phylogenetic output.

As the limited screen space available can render large trees difficult to read,
it may be helpful to output the analysis using the "Save plot" options,
either as an R script to allow further analysis within R, or as a PDF for
more comfortable on-screen viewing.


![Consensus of tree clusters](gui_cluster_cons.png)

The R code below is modified from that output by the GUI, in order to conduct
rogue taxon analysis for each cluster separately.
Note that it draws on the results of the clustering analysis above.

```{r cluster-consensus, fig.keep = "none"}
# Select proportion for majority rule consensus trees
# 1 is the appropriate value when parsimony trees are included
majority <- 1

# Plot each consensus tree in turn: 
for (i in seq_len(nClusters)) {
  clusterTrees <- trees[clustering == i]
  # Identify rogue taxa for this cluster
  clusterRogues <- Rogue::QuickRogue(clusterTrees, p = majority)$taxon[-1]
  
  # Colour tree labels based on stability across cluster
  tipCols <- Rogue::ColByStability(clusterTrees)
  
  cons <- ConsensusWithout( 
    trees = clusterTrees, 
    tip = clusterRogues,
    p = majority
  ) 
  
  # Root tree 
  cons <- RootTree(cons, trees[[1]]$tip.label[1])
  
  # Set unit edge lengths for viewing
  cons$edge.length <- rep.int(1, nrow(cons$edge)) 
  
  # Rotate nodes, to display clades in order of size 
  cons <- SortTree(cons, order = names(trees[[1]]$tip.label))
  
  # Plot tree
  par(mar = c(0, 0, 0, 0), cex = 0.8) # Set up plotting area
  plot( 
    cons, 
    edge.width = 2,             # Widen lines 
    font = 3,                   # Italicize labels 
    cex = 0.83,                 # Shrink tip font size 
    edge.color = clusterCol[i], # Colour tree according to previous plot
    tip.color = tipCols[cons$tip.label] 
  ) 
  legend(
    "bottomright", 
    paste("Cluster", i), 
    pch = 15,            # Filled circle icon 
    pt.cex = 1.5,        # Increase icon size 
    col = clusterCol[i], 
    bty = "n"            # Don't plot legend in box 
  ) 
  if (length(clusterRogues)) {
    # Mark omitted taxa, if any rogues present
    legend(
      "topright",
      clusterRogues,
      col = tipCols[clusterRogues],
      text.col = tipCols[clusterRogues],
      pch = "-",      # Mark as excluded
      cex = 0.8,      # Smaller font
      text.font = 3,  # Italicize
      bty = "n"       # Don't plot legend in box 
    )
  }
}
```

## Where next?

- [Documentation home](https://ms609.github.io/TreeSearch/)

- [Guide to installation](getting-started.html)

- Search for trees using
  - [standard parsimony](tree-search.html) (corrected for inapplicable data)
  - [profile parsimony](profile.html)
  - [custom optimality criteria](custom.Rmd)

# References