--- title: "Testing RRphylo methods overfit" author: "Silvia Castiglione, Carmela Serio, Pasquale Raia" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{overfitRR} %\VignetteEngine{knitr::rmarkdown_notangle} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} if (!requireNamespace("rmarkdown", quietly = TRUE) || !rmarkdown::pandoc_available()) { warning(call. = FALSE, "Pandoc not found, the vignettes is not built") knitr::knit_exit() } knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup,echo=FALSE, message = FALSE} require(RRphylo) options(rmarkdown.html_vignette.check_title = FALSE) ``` ## Index 1. [overfitRR basics](#basics) 2. [Results](#results) a. [search.shift results](#ss) b. [search.trend results](#st) c. [search.conv results](#sc) d. [PGLS_fossil results](#pgls) 3. [Guided examples](#examples) ## overfitRR basics {#basics} Methods using a large number of parameters risk being overfit. This usually translates in poor fitting with data and trees other than the those originally used. With RRphylo methods this risk is usually very low. However, the user can assess how robust the results got by applying [`search.shift`](search.shift.html), [`search.trend`](search.trend.html), [`search.conv`](search.conv.html), or `PGLS_fossil` are by running `overfitRR`. The basic idea of `overfitRR` is using alternative tree topologies to test for both phylogenetic and sampling uncertainty at the same time. Such alternative phylogenies can be provided by the user as a `multiPhylo` object, otherwise are automatically generated within `overfitRR`. In this latter case, the original tree and data are subsampled by specifying a `s` parameter, that is the proportion of tips to be removed from the tree, and species positions are shuffled by using the function [`swapONE`](swapONE.html). `overfitRR` always takes an object generated by [`RRphylo`](RRphylo.html) and all the data used to produce it (besides necessary phenotypic data, any other argument such as covariate, predictor, and so on, passed to `RRphylo`). If no `phylo.list` is available, the arguments `s` and `swap.args` can be used to set the intensity of subsampling and phylogenetic alterations to be applied. Depending on which tool is under testing, the user supplies to the function one or more among `trend.args`, `shift.args`, `conv.args`, and `pgls.args` each of them being a list of arguments specific to the namesake function (see the [examples](#examples) below). ## Results {#results} The output of `overfitRR` is a `RRphyloList` object whose elements depend on the case under testing (see below). In some cases, removing as many tips as imposed by `s` would delete too many tips right in clades and/or states under testing. In these cases, the function maintains no less than 5 species at least in each clade/state under testing (or all species if there is less), reducing the sampling parameter `s` if necessary. Thus, the first element of the output list (`$mean.sampling`) is the mean proportion of species actually removed over the iterations. In any case, the function returns a `multiPhylo` and a `RRphyloList` object including the modified phylogenies (`$tree.list`) and the outputs of `RRphylo` performed on them (`$RR.list`), respectively. Both objects are treated as regular lists. `overfitRR` also derives the 95% confidence interval around the original phenotypic value estimated at the tree root (`$rootCI`) and the regression parameters describing the relation between the original values at internal nodes and the corresponding figure after subsampling and swapping (`$ace.regressions`). A regression slope close to one indicates a better matching between original and subsampled values, suggesting the estimation is robust to phylogenetic uncertainty and subsampling. ### search.shift results {#ss} When the robustness of `search.shift` is tested, the function returns separate results for [`clade`](search.shift.html#clade) and [`sparse`](search.shift.html#sparse) conditions (`$shift.results`). The first (**clade**) includes the proportion of simulations producing significant and positive (**p.shift+**) or significant and negative (**p.shift-**) rate shifts for each single node, and for all the clades taken as a whole (see [*Testing rate shifts pertaining to entire clades*](search.shift.html#clade) for further details). Under the `sparse` condition (**sparse**), the same figures as before are reported for each state category compared to the rest of the tree and for all possible pair of categories (see [*Testing rate shifts pertaining to phylogenetically unrelated species*](search.shift.html#sparse) for further details). ### search.trend results {#st} When testing for `search.trend` robustness, `overfitRR` returns results for both the entire tree and specific clades if indicated (`$trend.results`). Results for the entire tree (**tree**) summarize the proportion of simulations producing positive (**slope+**) or negative (**slope-**) slopes significantly higher (**p.up**) or lower (**p.down**) than BM simulations for either phenotypes or rescaled rates versus time regressions. Such evaluations is based on **p.random** only (see [*Temporal trends on the entire tree*](search.trend.html#tree),for further details). When specific clades are under testing, the same set of results as for the whole tree is returned for each node (**node**). In this case, for phenotype versus age regression through nodes, the proportion of significant and positive/negative slopes (**slope+p.up**,**slope+p.down**,**slope-p.up**,**slope-p.down**) is accompanied by the same figures for the estimated marginal mean differences (**p.emm+** and **p.emm-**). As for the temporal trend in absolute rates through node, the proportion of significant and positive/negative estimated marginal means differences (**p.emm+** and **p.emm-**) and the same figure for slope difference (**p.slope+** and **p.slope-**) are reported (see [*Temporal trends at clade level*](search.trend.html#nodes)). Finally when more than one node is tested, the `$trend.results` object also includes results for the pairwise comparison between nodes. ### search.conv results {#sc} Results for robustness of `search.conv` (`$conv.results`) include separate objects for convergence between [`clades`](search.conv.html#nodes) or between/within [`states`](search.conv.html#state). Under the first case (**clade**), the proportion of simulations producing significant instance of convergence (**p.ang.bydist**) or convergence and parallelism (**p.ang.conv**) between selected clades are returned (see [*Morphological convergence between clades*](search.conv.html#nodes) for further details). As for convergence between/within discrete categories (**state**), `overfitRR` reports the proportion of simulations producing significant instance of convergence either accounting (**p.ang.state.time**) or not accounting (**p.ang.state**) for the time intervening between the tips in the focal state [*Morphological convergence within/between categories*](search.conv.html#state) for explanations). ### PGLS_fossil results {#pgls} Results for robustness of `PGLS_fossil` (`$pgls.results`) include separate objects for the pgls performed on the original tree (`$tree`) or on the tree rescaled according to `RRphylo` rates (i.e. tree branches rescaled to the absolute branch-wise rate values while keeping the total evolutionary time constant; `$RR`). ## Guided examples {#examples} ```{r eval=FALSE} library(ape) # load the RRphylo example dataset including Ornithodirans tree and data DataOrnithodirans$treedino->treedino # phylogenetic tree DataOrnithodirans$massdino->massdino # body mass data DataOrnithodirans$statedino->statedino # locomotory type data ### Testing search.shift # perform RRphylo Ornithodirans tree and data RRphylo(tree=treedino,y=massdino)->dinoRates # perform search.shift under both "clade" and "sparse" condition search.shift(RR=dinoRates, status.type= "clade")->SSnode search.shift(RR=dinoRates, status.type= "sparse", state=statedino)->SSstate # test the robustness of search.shift results overfitRR(RR=dinoRates,y=massdino,swap.args =list(si=0.2,si2=0.2), shift.args = list(node=rownames(SSnode$single.clades),state=statedino), nsim=10) ### Testing search.trend # Extract Pterosaurs tree and data extract.clade(treedino,748)->treeptero # phylogenetic tree massdino[match(treeptero$tip.label,names(massdino))]->massptero # body mass data massptero[match(treeptero$tip.label,names(massptero))]->massptero # perform RRphylo and search.trend on Pterosaurs tree and data # by specifying a clade to be tested RRphylo(tree=treeptero,y=log(massptero))->RRptero search.trend(RR=RRptero, y=log(massptero),node=143,cov=NULL,ConfInt=FALSE)->STnode # test the robustness of search.trend results overfitRR(RR=RRptero,y=log(massptero),trend.args = list(node=143),nsim=10) ### Applying overfitRR on multiple RRphylo # load the RRphylo example dataset including Cetaceans tree and data data("DataCetaceans") DataCetaceans$treecet->treecet # phylogenetic tree DataCetaceans$masscet->masscet # logged body mass data DataCetaceans$brainmasscet->brainmasscet # logged brain mass data DataCetaceans$aceMyst->aceMyst # known phenotypic value for the most recent # common ancestor of Mysticeti # cross-reference the phylogenetic tree and body and brain mass data. Remove from # both the tree and vector of body sizes the species whose brain size is missing drop.tip(treecet,treecet$tip.label[-match(names(brainmasscet), treecet$tip.label)])->treecet1 masscet[match(treecet1$tip.label,names(masscet))]->masscet1 # peform RRphylo on the variable (body mass) to be used as additional predictor RRphylo(tree=treecet1,y=masscet1)->RRmass RRmass$aces[,1]->acemass1 # create the predictor vector: retrieve the ancestral character estimates # of body size at internal nodes from the RR object ($aces) and collate them # to the vector of species' body sizes to create c(acemass1,masscet1)->x1.mass # peform RRphylo and search.trend on the brain mass # by using the body mass as additional predictor RRphylo(tree=treecet1,y=brainmasscet,x1=x1.mass)->RRmulti search.trend(RR=RRmulti, y=brainmasscet,x1=x1.mass)->STcet # test the robustness of search.trend results overfitRR(RR=RRmulti,y=brainmasscet,trend.args = list(),x1=x1.mass,nsim=10) ### Testing PGLS_fossil # peform RRphylo on cetaceans brain mass RRphylo(tree=treecet1,y=brainmasscet)->RRbrain # perform PGLS_fossil by using the original tree PGLS_fossil(y~x,data=list(y=brainmasscet,x=masscet1),tree=treecet1)->pgls_noRR # perform PGLS_fossil rescaling the tree according to RRphylo rates PGLS_fossil(y~x,data=list(y=brainmasscet,x=masscet1),tree=RRbrain$tree,RR=RRbrain)->pgls_RR # test the robustness of PGLS_fossil results overfitRR(RR=RRbrain,y=brainmasscet, pgls.args=list(modform=y~x,data=list(y=brainmasscet,x=masscet1),tree=TRUE,RR=TRUE), nsim=10) ### Testing search.conv # load the RRphylo example dataset including Felids tree and data data("DataFelids") DataFelids$PCscoresfel->PCscoresfel # mandible shape data DataFelids$treefel->treefel # phylogenetic tree DataFelids$statefel->statefel # conical-toothed or saber-toothed condition # perform RRphylo on Felids tree and data RRphylo(tree=treefel,y=PCscoresfel)->RRfel # search for morphologicl convergence between clades (automatic mode) # and within the category search.conv(RR=RRfel, y=PCscoresfel, min.dim=5, min.dist="node9"))->SC.clade as.numeric(c(rownames(SC.clade[[1]])[1],as.numeric(as.character(SC.clade[[1]][1,1]))))->conv.nodes search.conv(tree=treefel, y=PCscoresfel, state=statefel)->SC.state # test the robustness of seach.conv results overfitRR(RR=RRfel, y=PCscoresfel,conv.args= list(node=conv.nodes,state=statefel,declust=TRUE),nsim=10) ```