--- title: "Ancestral Sequence Reconstruction" author: - name: Klaus Schliep affiliation: Graz University of Technology email: klaus.schliep@gmail.com date: "`r Sys.Date()`" bibliography: phangorn.bib output: rmarkdown::html_vignette vignette: | %\VignetteIndexEntry{Ancestral Sequence Reconstruction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo=FALSE} # set global chunk options: images will be bigger knitr::opts_chunk$set(fig.width=6, fig.height=4) #, global.par=TRUE options(digits = 4) suppressPackageStartupMessages(library(phangorn)) ``` # Introduction These notes describe the ancestral sequence reconstruction using the _phangorn_ package [@Schliep2011]. _phangorn_ provides several methods to estimate ancestral character states with either Maximum Parsimony (MP) or Maximum Likelihood (ML). For more background on all the methods see e.g. [@Felsenstein2004] or [@Yang2006]. # Parsimony reconstructions To reconstruct ancestral sequences we first load some data and reconstruct a tree: ```{r parsimony} library(phangorn) fdir <- system.file("extdata/trees", package = "phangorn") primates <- read.phyDat(file.path(fdir, "primates.dna"), format = "interleaved") tree <- pratchet(primates, trace=0) |> acctran(primates) |> makeNodeLabel() parsimony(tree, primates) ``` For parsimony analysis of the edge length represent the observed number of changes. Reconstructing ancestral states therefore defines also the edge lengths of a tree. However there can exist several equally parsimonious reconstructions or states can be ambiguous and therefore edge length can differ. In _phangorn_ all the ancestral reconstructions for parsimony used to be based on the fitch algorithm (below version 3.0) and needed bifurcating trees. However trees can get pruned afterwards using the function `multi2di` from _ape_. Recently we replaced the acctran routine with a method based on the sankoff algorithm adopting the algorithm for joint reconstruction [@Pupko2000] and breaking ties at random. This has the additional benefit that it allows us to infer phylogenies with multifurcations. "MPR" reconstructs the ancestral states for each (internal) node as if the tree would be rooted in that node. However the nodes are not independent of each other. If one chooses one state for a specific node, this can restrict the choice of neighboring nodes (figures 2 and 3). There is also an option "POSTORDER" which is only a one pass algorithm, which is useful for teaching purposes. The function acctran (accelerated transformation) assigns edge length and internal nodes to the tree [@Swofford1987]. ```{r ancestral_reconstruction} anc.pars <- anc_pars(tree, primates) ``` The `plotSeqLogo` function is a wrapper around the from the _ggseqlogo_ function in the _ggseqlogo_ package [@ggseqlogo] and provides a simple way to show proportions of a nucleotides of ancestral states (see figure 1). ```{r seqLogo, fig.cap="Fig 1. Ancestral reconstruction for a node.", eval=TRUE} plotSeqLogo(anc.pars, node=getRoot(tree), 1, 20) ``` ```{r MPR, fig.cap="Fig 2. Ancestral reconstruction using MPR."} plotAnc(anc.pars, 17) title("MPR") ``` # Likelihood reconstructions _phangorn_ also offers the possibility to estimate ancestral states using ML. The advantages of ML over parsimony is that the reconstruction accounts for different edge lengths. Currently a marginal construction (see [@Yang2006][@Koshi1996]) and the joint reconstruction [@Pupko2000] is implemented. Joint reconstructions is only for models without rate variation (e.g. gamma models) or invariant sites. ```{r fit_ML} fit <- pml(tree, primates) fit <- optim.pml(fit, model="F81", control = pml.control(trace=0)) ``` We can assign the ancestral states according to the highest likelihood ("ml"): \[ P(x_r = A) = \frac{L(x_r=A)}{\sum_{k \in \{A,C,G,T\}}L(x_r=k)} \] and the highest posterior probability ("bayes") criterion: \[ P(x_r=A) = \frac{\pi_A L(x_r=A)}{\sum_{k \in \{A,C,G,T\}}\pi_k L(x_r=k)}, \] where $L(x_r)$ is the joint probability of states at the tips and the state at the root $x_r$ and $\pi_i$ are the estimated base frequencies of state $i$. Both methods agree if all states (base frequencies) have equal probabilities. ```{r ML_reconstruction} anc.ml <- anc_pml(fit) ``` The differences of the two approaches for a specific site (17) are represented in the following figures. ```{r plotML, fig.cap="Fig 4. Ancestral reconstruction the using the maximum likelihood."} plotAnc(anc.ml, 17) title("ML") ``` ```{r plotB, fig.cap="Fig 5. Ancestral reconstruction using (empirical) Bayes."} #plotAnc(anc.bayes, 17) #title("Bayes") ``` # Fitting for discrete comparative data Often have already a phylogeny and only want estimate the ancestral reconstruction for this tree. This is a common problem in phylogentic comparative methods and we can use the function *ace* in the ape [@Paradis2018], *fitDiscrete* in the geiger [@Pennell2014] or *fitMK* in the phytools [@Revell2012] package. Here we want to show how to fit these models using *optim.pml*. First we load a tree and create some data. ```{r read_geospiza_data} data("bird.orders") x <- c(rep(0, 5), rep(1, 18)) x[c(20,22,23)] <- 2 x <- factor(x) names(x) <- bird.orders$tip.label dat <- phyDat(x, "USER", levels=c(0,1,2)) ``` We than set up the *pml* object and optimize the model. Instead of optimizing the edge length we only optimize the rate. ```{r ER_model} fit <- pml(bird.orders, dat) fit_ER <- optim.pml(fit, optEdge = FALSE, optRate=TRUE, control = pml.control(trace=0)) fit_ER ``` We can also fit the symmetric (*model="SYM"*) or ordered metristic model (*model="ORDERED"*). ```{r SYM_model} fit_SYM <- optim.pml(fit, optEdge = FALSE, optRate=TRUE, model="SYM", control = pml.control(trace=0)) fit_SYM ``` We can compare the estimate with the one from *ace* from *ape*. ```{r ace} fit_ace <- ace(x, bird.orders, model="SYM", type = "d") ``` The log-likelihood values differ slightly as in phangorn the values get multiplied by the state frequencies. Thus if we add `log(1/3)` as we have three states to ace estimate the two estimates are almost identical. ```{r comparison} fit_SYM$logLik fit_ace$loglik+log(1/3) all.equal(fit_SYM$logLik, fit_ace$loglik+log(1/3)) ``` ```{r SYM_reconstruction} anc_SYM <- anc_pml(fit_SYM) plotAnc(anc_SYM) ``` More complicated models can be applied using defining the rate matrix as shown in the vignette _Markov models and transition rate matrices_. The "ARD" model is currently not available as *phangorn* only fits reversible models. # Session info ```{r sessionInfo, echo=FALSE} sessionInfo() ``` # References