Title: | Statistical Inference and Prediction of Annotations in Phylogenetic Trees |
---|---|
Description: | Implements a parsimonious evolutionary model to analyze and predict gene-functional annotations in phylogenetic trees as described in Vega Yon et al. (2021) <doi:10.1371/journal.pcbi.1007948>. Focusing on computational efficiency, 'aphylo' makes it possible to estimate pooled phylogenetic models, including thousands (hundreds) of annotations (trees) in the same run. The package also provides the tools for visualization of annotated phylogenies, calculation of posterior probabilities (prediction) and goodness-of-fit assessment featured in Vega Yon et al. (2021). |
Authors: | George Vega Yon [aut, cre] , National Cancer Institute (NCI) [fnd] (Grant Number 5P01CA196569-02), USC Biostatistics [cph] |
Maintainer: | George Vega Yon <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3-4 |
Built: | 2024-11-05 04:43:21 UTC |
Source: | https://github.com/USCbiostats/aphylo |
Statistical Inference in Annotated Phylogenetic Trees
Uses SIFTER's 2011 definition of accuracy, where a protein is tagged as accurately predicted if the highest ranked prediction matches it.
accuracy_sifter(pred, lab, tol = 1e-10, highlight = "", ...) ## S3 method for class 'aphylo_estimates' accuracy_sifter(pred, lab, tol = 1e-10, highlight = "", ...) ## Default S3 method: accuracy_sifter(pred, lab, tol = 1e-10, highlight = "", nine_na = TRUE, ...)
accuracy_sifter(pred, lab, tol = 1e-10, highlight = "", ...) ## S3 method for class 'aphylo_estimates' accuracy_sifter(pred, lab, tol = 1e-10, highlight = "", ...) ## Default S3 method: accuracy_sifter(pred, lab, tol = 1e-10, highlight = "", nine_na = TRUE, ...)
pred |
A matrix of predictions, or an aphylo_estimates object. |
lab |
A matrix of labels (0,1,NA, or 9 if |
tol |
Numeric scalar. Predictions within |
highlight |
Pattern passed to sprintf used to highlight predicted functions that match the observed. |
... |
Further arguments passed to the method. In the case of |
nine_na |
Treat 9 as NA. |
The analysis is done at the protein level. For each protein, the function
compares the YES annotations of that proteins with the predicted by the model.
The algorithm selects the predicted annotations as those that are within
tol
of the maximum score.
This algorithm doesn't take into account NOT annotations (0s), which are excluded from the analysis.
When highlight = ""
, no highlight is done.
A data frame with Ntip()
rows and four variables. The variables are:
Gene: Label of the gene
Predicted: The assigned gene function.
Observed: The true set of gene functions.
Accuracy: The measurement of accuracy according to Engelhardt et al. (2011).
set.seed(81231) atree <- raphylo(50, psi = c(0,0), P = 3) ans <- aphylo_mcmc(atree ~ mu_d + mu_s + Pi) accuracy_sifter(ans)
set.seed(81231) atree <- raphylo(50, psi = c(0,0), P = 3) ans <- aphylo_mcmc(atree ~ mu_d + mu_s + Pi) accuracy_sifter(ans)
The generics ape::Nedge()
, ape::Nnode()
, and ape::Ntip()
can be used
directly on objects of class aphylo, aphylo_estimates, multiAphylo
Integer with the number of edges, nodes, or tips accordignly.
Other information:
aphylo-info
set.seed(12312) atree <- raphylo(50, P = 2) Nnode(atree) Ntip(atree) Nedge(atree) multitree <- rmultiAphylo(10, 50, P = 2) Nnode(multitree) Ntip(multitree) Nedge(multitree)
set.seed(12312) atree <- raphylo(50, P = 2) Nnode(atree) Ntip(atree) Nedge(atree) multitree <- rmultiAphylo(10, 50, P = 2) Nnode(multitree) Ntip(multitree) Nedge(multitree)
This implements Leave-one-out cross-validation (LOO-CV) for trees of class aphylo and multiAphylo.
aphylo_cv(...) ## S3 method for class 'formula' aphylo_cv(model, ...)
aphylo_cv(...) ## S3 method for class 'formula' aphylo_cv(model, ...)
... |
Further arguments passed to the method. |
model |
As passed to aphylo_mcmc. |
For each observation in the dataset (either a single gene if of
class aphylo, or an entire tree if of class multiAphylo), we restimate
the model removing the observation and use the parameter estimates to make
a prediction on it. The prediction is done using the function predict.aphylo_estimates
with argument loo = TRUE
.
An object of class aphylo_cv
with the following components:
pred_out
Out of sample prediction.
expected
Expected annotations
call
The call
ids
Integer vector with the ids of the leafs used in the loo process.
# It takes about two minutes to run this example set.seed(123) atrees <- rmultiAphylo(10, 10, P = 1) cv_multi <- aphylo_cv(atrees ~ mu_d + mu_s + Pi) cv_single <- aphylo_cv(atrees[[1]] ~ mu_d + mu_s + Pi)
# It takes about two minutes to run this example set.seed(123) atrees <- rmultiAphylo(10, 10, P = 1) cv_multi <- aphylo_cv(atrees ~ mu_d + mu_s + Pi) cv_single <- aphylo_cv(atrees[[1]] ~ mu_d + mu_s + Pi)
The function is a wrapper of fmcmc::MCMC()
.
APHYLO_DEFAULT_MCMC_CONTROL aphylo_mcmc( model, params, priors = uprior(), control = list(), check_informative = getOption("aphylo_informative", FALSE), reduced_pseq = getOption("aphylo_reduce_pseq", TRUE) ) APHYLO_PARAM_DEFAULT
APHYLO_DEFAULT_MCMC_CONTROL aphylo_mcmc( model, params, priors = uprior(), control = list(), check_informative = getOption("aphylo_informative", FALSE), reduced_pseq = getOption("aphylo_reduce_pseq", TRUE) ) APHYLO_PARAM_DEFAULT
model |
A model as specified in aphylo-model. |
params |
A vector of length 7 with initial parameters. In particular
|
priors |
A function to be used as prior for the model (see bprior). |
control |
A list with parameters for the optimization method (see details). |
check_informative |
Logical scalar. When |
reduced_pseq |
Logical. When |
An object of class list
of length 6.
An object of class numeric
of length 9.
APHYLO_DEFAULT_MCMC_CONTROL
lists the default values for the MCMC
estimation:
nsteps
: 1e4L
burnin
: 5e3L
thin
: 10L
nchains
: 2L
multicore
: FALSE
conv_checker
: fmcmc::convergence_auto(5e3)
For more information about the MCMC estimation process, see fmcmc::MCMC()
.
Methods base::print()
, base::summary()
, stats::coef, stats::window()
,
stats::vcov()
, stats::logLik()
, predict(),
and the various ways to query features of the trees via Ntip()
are available post estimation.
The vector APHYLO_PARAM_DEFAULT
lists the starting values for the parameters
in the model. The current defaults are:
psi0
: 0.10
psi1
: 0.05
mu_d0
: 0.90
mu_d1
: 0.50
mu_s0
: 0.10
mu_s1
: 0.05
eta0
: 1.00
eta1
: 1.00
Pi
: 0.50
An object of class aphylo_estimates.
Other parameter estimation:
aphylo_mle()
# Using the MCMC ------------------------------------------------------------ set.seed(1233) # Simulating a tree tree <- sim_tree(200) # Simulating functions atree <- raphylo( tree = tree, psi = c(.01, .03), mu_d = c(.05, .02), Pi = .5 ) # Running the MCMC set.seed(1231) ans_mcmc <- aphylo_mcmc( atree ~ mu_d + psi + eta + Pi, control = list(nsteps = 2e5, burnin=1000, thin=200) )
# Using the MCMC ------------------------------------------------------------ set.seed(1233) # Simulating a tree tree <- sim_tree(200) # Simulating functions atree <- raphylo( tree = tree, psi = c(.01, .03), mu_d = c(.05, .02), Pi = .5 ) # Running the MCMC set.seed(1231) ans_mcmc <- aphylo_mcmc( atree ~ mu_d + psi + eta + Pi, control = list(nsteps = 2e5, burnin=1000, thin=200) )
aphylo_estimates
The model fitting of annotated phylogenetic trees can be done using either
MLE via aphylo_mle()
or MCMC via aphylo_mcmc()
. This section describes
the object of class aphylo_estimates
that these functions generate and
the post estimation methods/functions that can be used.
## S3 method for class 'aphylo_estimates' print(x, ...) ## S3 method for class 'aphylo_estimates' coef(object, ...) ## S3 method for class 'aphylo_estimates' vcov(object, ...) ## S3 method for class 'aphylo_estimates' plot( x, y = NULL, which.tree = 1L, ids = list(1:Ntip(x)[which.tree]), loo = TRUE, nsamples = 1L, ncores = 1L, centiles = c(0.025, 0.5, 0.975), cl = NULL, ... )
## S3 method for class 'aphylo_estimates' print(x, ...) ## S3 method for class 'aphylo_estimates' coef(object, ...) ## S3 method for class 'aphylo_estimates' vcov(object, ...) ## S3 method for class 'aphylo_estimates' plot( x, y = NULL, which.tree = 1L, ids = list(1:Ntip(x)[which.tree]), loo = TRUE, nsamples = 1L, ncores = 1L, centiles = c(0.025, 0.5, 0.975), cl = NULL, ... )
x , object
|
Depending of the method, an object of class |
... |
Further arguments passed to the corresponding method. |
y |
Ignored. |
which.tree |
Integer scalar. Which tree to plot. |
ids , nsamples , ncores , centiles , cl
|
passed to |
loo |
Logical scalar. When |
The plot method for the object of class aphylo_estimates
plots
the original tree with the predicted annotations.
Objects of class aphylo_estimates
are a list withh the following elements:
par |
A numeric vector of length 5 with the solution. |
hist |
A numeric matrix of size |
ll |
A numeric scalar with the value of |
counts |
Integer scalar number of steps/batch performed. |
convergence |
Integer scalar. Equal to 0 if |
message |
Character scalar. See |
fun |
A function (the objective function). |
priors |
If specified, the function |
dat |
The data |
par0 |
A numeric vector of length 5 with the initial parameters. |
method |
Character scalar with the name of the method used. |
varcovar |
A matrix of size 5*5. The estimated covariance matrix. |
The plot method for aphylo_estimates
returns the selected tree
(which.tree
) with predicted annotations, also of class aphylo.
set.seed(7881) atree <- raphylo(40, P = 2) res <- aphylo_mcmc(atree ~ mu_d + mu_s + Pi) print(res) coef(res) vcov(res) plot(res)
set.seed(7881) atree <- raphylo(40, P = 2) res <- aphylo_mcmc(atree ~ mu_d + mu_s + Pi) print(res) coef(res) vcov(res) plot(res)
aphylo
object with partial annotationsCreate an aphylo
object with partial annotations
aphylo_from_data_frame(tree, annotations, types = NULL)
aphylo_from_data_frame(tree, annotations, types = NULL)
tree |
An object of class |
annotations |
A data.frame with annotations. The first column should be the gene id (see details). |
types |
A data.frame with types. Just like the annotations, the first column should be the gene id. |
Each row in the the annotations
data frame passed to this function
must have a unique row per gene, and one column per function (GO term). The id
of each gene must match the labels in the tree
object. Missing genes are
annotated with NA
(9).
In the case of types
, while tips can also be annotated with a type, which
should be either 0, duplication, or 1, speciation, only internal nodes
are required. Tip types are ignored.
An object of class aphylo.
Other Data management functions:
aphylo-class
# Generating a test dataset set.seed(1371) x <- raphylo(20) # Extracting the tree and annotations tree <- x$tree anno <- with(x, rbind(tip.annotation, node.annotation)) anno <- data.frame(id = with(tree, c(tip.label, node.label)), anno) types <- data.frame(id = tree$node.label, x$node.type) # Creating a aphylo tree without node types aphylo_from_data_frame(tree, anno) # Now including types aphylo_from_data_frame(tree, anno, types) # Dropping some data aphylo_from_data_frame(tree, anno[sample.int(nrow(anno), 10),])
# Generating a test dataset set.seed(1371) x <- raphylo(20) # Extracting the tree and annotations tree <- x$tree anno <- with(x, rbind(tip.annotation, node.annotation)) anno <- data.frame(id = with(tree, c(tip.label, node.label)), anno) types <- data.frame(id = tree$node.label, x$node.type) # Creating a aphylo tree without node types aphylo_from_data_frame(tree, anno) # Now including types aphylo_from_data_frame(tree, anno, types) # Dropping some data aphylo_from_data_frame(tree, anno[sample.int(nrow(anno), 10),])
The function is a wrapper of stats::optim()
.
aphylo_mle( model, params, method = "L-BFGS-B", priors = function(p) 1, control = list(), lower = 1e-05, upper = 1 - 1e-05, check_informative = getOption("aphylo_informative", FALSE), reduced_pseq = getOption("aphylo_reduce_pseq", TRUE) )
aphylo_mle( model, params, method = "L-BFGS-B", priors = function(p) 1, control = list(), lower = 1e-05, upper = 1 - 1e-05, check_informative = getOption("aphylo_informative", FALSE), reduced_pseq = getOption("aphylo_reduce_pseq", TRUE) )
model |
A model as specified in aphylo-model. |
params |
A vector of length 7 with initial parameters. In particular
|
method , control , lower , upper
|
Arguments passed to |
priors |
A function to be used as prior for the model (see bprior). |
check_informative |
Logical scalar. When |
reduced_pseq |
Logical. When |
The default starting parameters are described in APHYLO_PARAM_DEFAULT.
An object of class aphylo_estimates.
Other parameter estimation:
APHYLO_DEFAULT_MCMC_CONTROL
# Using simulated data ------------------------------------------------------ set.seed(19) dat <- raphylo(100) dat <- rdrop_annotations(dat, .4) # Computing Estimating the parameters ans <- aphylo_mle(dat ~ psi + mu_d + eta + Pi) ans # Plotting the path plot(ans) # Computing Estimating the parameters Using Priors for all the parameters mypriors <- function(params) { dbeta(params, c(2, 2, 2, 2, 1, 10, 2), rep(10, 7)) } ans_dbeta <- aphylo_mle(dat ~ psi + mu_d + eta + Pi, priors = mypriors) ans_dbeta
# Using simulated data ------------------------------------------------------ set.seed(19) dat <- raphylo(100) dat <- rdrop_annotations(dat, .4) # Computing Estimating the parameters ans <- aphylo_mle(dat ~ psi + mu_d + eta + Pi) ans # Plotting the path plot(ans) # Computing Estimating the parameters Using Priors for all the parameters mypriors <- function(params) { dbeta(params, c(2, 2, 2, 2, 1, 10, 2), rep(10, 7)) } ans_dbeta <- aphylo_mle(dat ~ psi + mu_d + eta + Pi, priors = mypriors) ans_dbeta
The aphylo
class tree holds both the tree structure represented as a
partially ordered phylogenetic tree, and node annotations. While annotations
are included for both leafs and inner nodes, the algorithms included in this
package only uses the leaf annotations.
new_aphylo(tree, tip.annotation, ...) ## S3 method for class 'phylo' new_aphylo( tree, tip.annotation, node.annotation = NULL, tip.type = NULL, node.type = NULL, ... )
new_aphylo(tree, tip.annotation, ...) ## S3 method for class 'phylo' new_aphylo( tree, tip.annotation, node.annotation = NULL, tip.type = NULL, node.type = NULL, ... )
tree |
An object of class phylo |
tip.annotation , node.annotation
|
Annotation data. See aphylo. |
... |
Further argmuents passed to the method. |
tip.type , node.type
|
Integer vectors with values 0,1. 0 denotes duplication node and 1 speciation node. This is used in LogLike. |
A list of class aphylo
with the following elements:
tree |
An object of class phylo. |
tip.annotation |
An integer matrix. Tip (leaf) nodes annotations. |
node.annotation |
An integer matrix (optional). Internal nodes annotations. |
offspring |
A list. List of offspring of each node. |
pseq |
Integer vector. The pruning sequence (postorder). |
reduced_pseq |
Integer vector. The reduced version of |
Ntips.annotated |
Integer. Number of tips with annotations. |
tip.type |
Binary of length |
tip.type |
Binary of length |
Other Data management functions:
aphylo_from_data_frame()
Other aphylo methods:
aphylo-methods
# A simple example ---------------------------------------------------------- data(fakeexperiment) data(faketree) ans <- new_aphylo(fakeexperiment[,2:3], tree = as.phylo(faketree)) # We can visualize it plot(ans)
# A simple example ---------------------------------------------------------- data(fakeexperiment) data(faketree) ans <- new_aphylo(fakeexperiment[,2:3], tree = as.phylo(faketree)) # We can visualize it plot(ans)
Indexing aphylo objects
## S3 method for class 'aphylo' x[i, j, drop = FALSE] ## S3 replacement method for class 'aphylo' x[i, j] <- value
## S3 method for class 'aphylo' x[i, j, drop = FALSE] ## S3 replacement method for class 'aphylo' x[i, j] <- value
x |
An object of class aphylo. |
i , j
|
Integer vector. Indices of genes or functions. |
drop |
Logical scalar. When |
value |
Integer vector. Replacing values, can be either |
The subsetting method allows selecting one or more annotations from
the aphylo object. Whenever i
is specified, then aphylo returns the corresponding
annotations.
When indexing with i
: A data frame with the annotations of the
selected genes.
When only indexing with j
(drop = FALSE
): An aphylo
object with the selected sets of
annotations.
When only indexing with j
(drop = TRUE
): A data.frame with the selected
annotations.
When indexing on both i
and j
: A data.frame with the selected genes and annotations.
set.seed(12312) atree <- raphylo(50, P = 4) atree[1:10,] atree[,2:3] atree[, 2:3, drop = TRUE] atree[1:10, 2:3]
set.seed(12312) atree <- raphylo(50, P = 4) atree[1:10,] atree[,2:3] atree[, 2:3, drop = TRUE] atree[1:10, 2:3]
aphylo
and multiAphylo
objectsInformation about annotations, in particular, number of annotations (Nann
),
number of annotated leaves (Nannotated
), number of unnanotated leaves
(Nunannotated
), and number of trees (Ntrees
).
Nann(phy) Nannotated(phy) Ntrees(phy)
Nann(phy) Nannotated(phy) Ntrees(phy)
phy |
Either an object of class aphylo, multiAphylo, or aphylo_estimates. |
If phy
is of class aphylo
, then a single scalar.
otherwise, if phy
is of class multiAphylo
Other information:
ape-methods
# Generating data for the example set.seed(223) dat <- rmultiAphylo(10, n = 5, P = 2) Nann(dat) Nannotated(dat) Ntrees(dat)
# Generating data for the example set.seed(223) dat <- rmultiAphylo(10, n = 5, P = 2) Nann(dat) Nannotated(dat) Ntrees(dat)
aphylo
objectsPlot and print methods for aphylo
objects
## S3 method for class 'aphylo' plot( x, y = NULL, prop = 0.15, node.type.col = c(dupl = "black", other = "gray"), node.type.size = c(dupl = 0, other = 0), rect.args = list(), as_ci = NULL, ... )
## S3 method for class 'aphylo' plot( x, y = NULL, prop = 0.15, node.type.col = c(dupl = "black", other = "gray"), node.type.size = c(dupl = 0, other = 0), rect.args = list(), as_ci = NULL, ... )
x |
An object of class |
y |
Ignored. |
prop |
Numeric scalar between 0 and 1. Proportion of the device that the
annotations use in |
node.type.col , node.type.size
|
Vectors of length 2. In the case of
|
rect.args |
List of arguments passed to graphics::rect. |
as_ci |
Integer vector. Internal use only. |
... |
Further arguments passed to ape::plot.phylo. |
The plot.aphylo
function is a wrapper of ape::plot.phylo.
In the case of plot.aphylo
, NULL
.
Other aphylo methods:
aphylo-class
set.seed(7172) atree <- raphylo(20) plot(atree)
set.seed(7172) atree <- raphylo(20) plot(atree)
aphylo
This function the the workhorse behind the likelihood function. It creates
arbitrary models by modifying the call to LogLike()
function according to
what the user specifies as model.
eta(..., env) psi(..., env) Pi(..., env) mu_d(..., env) mu_s(..., env) aphylo_formula(fm, params, priors, env = parent.frame())
eta(..., env) psi(..., env) Pi(..., env) mu_d(..., env) mu_s(..., env) aphylo_formula(fm, params, priors, env = parent.frame())
... |
Either 0, 1 or both. Depending on the parameter, the index of the model parameter that will be set as fixed. |
env |
Environment (not to be called by the user). |
fm |
A formula. Model of the type |
params |
Numeric vector with model parameters. |
priors |
(optional) A function. Prior for the model. |
A list with the following elements:
fun
A function. The log-likelihood function.
fixed
Logical vector.
set.seed(12) x <- raphylo(10) # Baseline model aphylo_formula(x ~ mu_d) # Mislabeling probabilities aphylo_formula(x ~ mu_d + psi) # Different probabilities for speciation and duplication node # (only works if you have both types) aphylo_formula(x ~ mu_d + mu_s + psi) # Mislabeling probabilities and etas(fixed) aphylo_formula(x ~ mu_d + psi + eta(0, 1)) # Mislabeling probabilities and Pi aphylo_formula(x ~ mu_d + psi + Pi)
set.seed(12) x <- raphylo(10) # Baseline model aphylo_formula(x ~ mu_d) # Mislabeling probabilities aphylo_formula(x ~ mu_d + psi) # Different probabilities for speciation and duplication node # (only works if you have both types) aphylo_formula(x ~ mu_d + mu_s + psi) # Mislabeling probabilities and etas(fixed) aphylo_formula(x ~ mu_d + psi + eta(0, 1)) # Mislabeling probabilities and Pi aphylo_formula(x ~ mu_d + psi + Pi)
as.phylo
functionThis function takes an edgelist and recodes (relabels) the nodes following ape's coding convention.
## S3 method for class 'matrix' as.phylo(x, edge.length = NULL, root.edge = NULL, ...) ## S3 method for class 'aphylo' as.phylo(x, ...)
## S3 method for class 'matrix' as.phylo(x, edge.length = NULL, root.edge = NULL, ...) ## S3 method for class 'aphylo' as.phylo(x, ...)
x |
Either an edgelist or an object of class aphylo. |
edge.length |
A vector with branch lengths (optional). |
root.edge |
A numeric scalar with the length for the root node (optional). |
... |
Further arguments passed to the method. |
An integer matrix of the same dimmension as edges
with the following
aditional attribute:
labels |
Named integer vector of size |
# A simple example ---------------------------------------------------------- # This tree has a coding different from ape's mytree <- matrix(c(1, 2, 1, 3, 2, 4, 2, 5), byrow = TRUE, ncol=2) mytree ans <- as.phylo(mytree) ans plot(ans)
# A simple example ---------------------------------------------------------- # This tree has a coding different from ape's mytree <- matrix(c(1, 2, 1, 3, 2, 4, 2, 5), byrow = TRUE, ncol=2) mytree ans <- as.phylo(mytree) ans plot(ans)
The AUC values are computed by approximation using the area of the polygons formed under the ROC curve.
auc(pred, labels, nc = 200L, nine_na = TRUE) ## S3 method for class 'aphylo_auc' print(x, ...) ## S3 method for class 'aphylo_auc' plot(x, y = NULL, ...)
auc(pred, labels, nc = 200L, nine_na = TRUE) ## S3 method for class 'aphylo_auc' print(x, ...) ## S3 method for class 'aphylo_auc' plot(x, y = NULL, ...)
pred |
A numeric vector with the predictions of the model. Values must range between 0 and 1. |
labels |
An integer vector with the labels (truth). Values should be either 0 or 1. |
nc |
Integer. Number of cutoffs to use for computing the rates and AUC. |
nine_na |
Logical. When |
x |
An object of class |
... |
Further arguments passed to the method. |
y |
Ignored. |
A list:
tpr
A vector of length nc
with the True Positive Rates.
tnr
A vector of length nc
with the True Negative Rates.
fpr
A vector of length nc
with the False Positive Rates.
fnr
A vector of length nc
with the False Negative Rates.
auc
A numeric value. Area Under the Curve.
cutoffs
A vector of length nc
with the cutoffs used.
set.seed(8381) x <- rdrop_annotations(raphylo(50), .3) ans <- aphylo_mcmc(x ~ mu_d + mu_s + Pi) ans_auc <- auc(predict(ans, loo = TRUE), x[,1,drop=TRUE]) print(ans_auc) plot(ans_auc)
set.seed(8381) x <- rdrop_annotations(raphylo(50), .3) ans <- aphylo_mcmc(x ~ mu_d + mu_s + Pi) ans_auc <- auc(predict(ans, loo = TRUE), x[,1,drop=TRUE]) print(ans_auc) plot(ans_auc)
This function computes the distance between .5 and the observed proportion of ones for each function in a tree.
balance_ann(phy)
balance_ann(phy)
phy |
An object of class aphylo or multiAphylo |
Functional balance is defined as follows
Where A
is the matrix of annotations.
With values ranging between 0 and 1, one been perfect balance, this is, equal number of zeros and ones in the annotations. In the case of multiple functions, as noted in the formula, the balance is the average across functions.
If phy
is an object of class phylo
, a single scalar, otherwise,
it returns a vector of length Ntrees
(phy)
.
x <- raphylo(20, P = 2) balance_ann(x) balance_ann(c(x, x))
x <- raphylo(20, P = 2) balance_ann(x) balance_ann(c(x, x))
Convenient wrappers to be used with the aphylo
estimation methods.
bprior(shape1 = 1, shape2 = 9, ...) uprior()
bprior(shape1 = 1, shape2 = 9, ...) uprior()
shape1 , shape2 , ...
|
Arguments passed to stats::dbeta |
In the case of bprior
, a wrapper of the function stats::dbeta.
uprior
returns a function function(p) 1
(the uniform prior)
bprior(1, 9) uprior()
bprior(1, 9) uprior()
pruner
Creates an external pointer to an object of class aphylo_pruner
. This is mostly
used to compute the model's likelihood function faster by reusing underlying
C++ class objects to store probability matrices and data. This is intended
for internal use only.
dist2root(ptr) get_postorder(ptr) new_aphylo_pruner(x, ...)
dist2root(ptr) get_postorder(ptr) new_aphylo_pruner(x, ...)
ptr |
An object of class |
x |
An object of class aphylo or multiAphylo. |
... |
Further arguments passed to the method |
The underlying implementation of the pruning function is based on the pruner C++ library that implements Felsenstein's tree pruning algorithm. See https://github.com/USCbiostats/pruner.
dist2root
: An integer vector with the number of steps from each
node (internal or not) to the root node.
get_postorder
: An integer vector with the postorder sequence
for pruning the tree (indexed from 0).
The function new_aphylo_pruner
returns an object of class
aphylo_pruner
or multiAphylo_pruner
, depending on the class of x
.
set.seed(1) x <- raphylo(20) pruner <- new_aphylo_pruner(x) # Computing loglike LogLike( pruner, psi = c(.10, .20), mu_d = c(.90, .80), mu_s = c(.10, .05), Pi = .05, eta = c(.90, .80) ) dist2root(pruner) get_postorder(pruner)
set.seed(1) x <- raphylo(20) pruner <- new_aphylo_pruner(x) # Computing loglike LogLike( pruner, psi = c(.10, .20), mu_d = c(.90, .80), mu_s = c(.10, .05), Pi = .05, eta = c(.90, .80) ) dist2root(pruner) get_postorder(pruner)
A fake dataset containing 2 functional state of the leaf nodes. Each function can have either 0 (unactive), 1 (active) or 9 (n/a). This dataset is inteded for testing only.
A data frame with 4 rows and 3 variables:
State of function 1.
State of function 1.
Integer, ID of the leaf.
BiostatsUSC
A fake dataset containing the parent-offspring relations between genes. This dataset is inteded for testing only.
A data frame with 6 rows and 2 variables:
Integer, ID of the offspring.
Integer, ID of the parent.
BiostatsUSC
Uses a simple algorithm to impute duplication events based on the terminal genes of the tree. An interior node is a duplication event if a specie has two or more leafs within its clade.
imputate_duplications(tree, species)
imputate_duplications(tree, species)
tree |
An object of class ape::phylo. |
species |
A character vector of length |
This function will take a vector of species and, based on that, assign duplication events throughout the interior nodes. An interior node is labeled as a duplication event if two or more of the leaves within it are from the same species.
A logical vector of length ape::Nnode(tree, internal.only = FALSE)
with TRUE
to indicate that the corresponding node is a duplication event.
The order matches that in the input tree.
# Data from PANTHER path <- system.file("tree.tree", package="aphylo") ptree <- read_panther(path) # Extracting the species sp <- gsub(".+[:]|[|].+", "" , ptree$tree$tip.label) # Imputing duplications imputate_duplications(ptree$tree, species = sp)
# Data from PANTHER path <- system.file("tree.tree", package="aphylo") ptree <- read_panther(path) # Extracting the species sp <- gsub(".+[:]|[|].+", "" , ptree$tree$tip.label) # Imputing duplications imputate_duplications(ptree$tree, species = sp)
For each node in a tree, the functions list_offspring
and list_parents
lists all its offspring and parents, respectively.
list_offspring(x) list_parents(x)
list_offspring(x) list_parents(x)
x |
An object of class |
List of length n
(total number of nodes).
# A simple example with phylo tree ------------------------------------------ set.seed(4) x <- ape::rtree(10) list_offspring(x)
# A simple example with phylo tree ------------------------------------------ set.seed(4) x <- ape::rtree(10) list_offspring(x)
This function computes the log-likelihood of the chosen parameters given
a particular dataset. The arguments annotations
, and offspring
should be as those returned by new_aphylo()
.
For complete parameter estimation see aphylo_estimates.
LogLike(tree, psi, mu_d, mu_s, eta, Pi, verb_ans = TRUE, check_dims = TRUE)
LogLike(tree, psi, mu_d, mu_s, eta, Pi, verb_ans = TRUE, check_dims = TRUE)
tree |
A phylogenetic tree of class aphylo. |
psi |
Numeric vector of length 2. Misclasification probabilities. (see |
mu_d , mu_s
|
Numeric vector of length 2. Gain/loss probabilities (see |
eta |
Numeric vector of length 2. Annotation bias probabilities (see |
Pi |
Numeric scalar. Root node probability of having the function (see |
verb_ans |
Logical scalar. When |
check_dims |
Logical scalar. When |
The parameters to estimate are described as follows:
psi
: A vector of length 2 with and
, which are the misclassification probabilities fo
and
respectively.
mu_d
, mu_s
: A vector of length 2 with and
which are the gain and loss probabilities respectively.
The subscript d denotes duplication nodes and s speciation node.
eta
: A vector of length 2 with and
which are the annotation bias probabilities.
Pi
: A numeric scalar which for which equals the probability
of the root node having the function.
A list of class phylo_LogLik
with the following elements:
S |
An integer matrix of size |
Pr |
A numeric matrix of size |
ll |
A numeric scalar with the log-likelihood value given the chosen parameters. |
Switch labels acoording to mislabeling probabilities
mislabel(atree, psi)
mislabel(atree, psi)
atree |
An object of class aphylo. |
psi |
Numeric vector of length 2. Misclasification probabilities. (see |
An object of class aphylo with modified labels.
set.seed(131) x <- raphylo(5, P=2, psi=c(0,0)) x$tip.annotation # Flipping 0s to 1s and vice versa mislabel(x, psi = c(1,1))$tip.annotation
set.seed(131) x <- raphylo(5, P=2, psi=c(0,0)) x$tip.annotation # Flipping 0s to 1s and vice versa mislabel(x, psi = c(1,1))$tip.annotation
This is equivalent to what ape::c.phylo does.
## S3 method for class 'aphylo' c(...) ## S3 method for class 'multiAphylo' print(x, ...)
## S3 method for class 'aphylo' c(...) ## S3 method for class 'multiAphylo' print(x, ...)
... |
One or several object of class |
x |
An object of class |
A list of class multiAphylo
. Each element corresponds to a single aphylo
object.
data(fakeexperiment) data(faketree) ans <- new_aphylo(fakeexperiment[,2:3], tree = as.phylo(faketree)) c(ans, ans)
data(fakeexperiment) data(faketree) ans <- new_aphylo(fakeexperiment[,2:3], tree = as.phylo(faketree)) c(ans, ans)
The PANTHER Project handles a modified version of newick tree files which,
besides of the tree structure, includes the type of node and ancestor
labels. This function is a wrapper of ape::read.tree()
.
read_panther(x, tree.reader = ape::read.tree, ...) read.panther(x, tree.reader = ape::read.tree, ...)
read_panther(x, tree.reader = ape::read.tree, ...) read.panther(x, tree.reader = ape::read.tree, ...)
x |
Character scalar. Full path to the panther file. |
tree.reader |
Function that will be used to read the tree file.
It can be either |
... |
Further arguments passed to |
A list consisting of a data.frame and a phylo
object. The
data.frame has the following columns:
branch_length |
Numeric vector. Length of the branch to its parent node. |
type |
Character vector. Can be either |
ancestor |
Character vector. Name of the ancestor. |
The nodeids can be identified using the rownames.
Other reading:
read_nhx()
,
read_pli()
path <- system.file("tree.tree", package="aphylo") read_panther(path)
path <- system.file("tree.tree", package="aphylo") read_panther(path)
Plot Log-Likelihood function of the model
plot_logLik(x, sets, ...) ## S3 method for class 'aphylo' plot_logLik(x, sets, ...) ## S3 method for class 'formula' plot_logLik(x, sets, ...) ## S3 method for class 'aphylo_estimates' plot_logLik(x, sets, ...)
plot_logLik(x, sets, ...) ## S3 method for class 'aphylo' plot_logLik(x, sets, ...) ## S3 method for class 'formula' plot_logLik(x, sets, ...) ## S3 method for class 'aphylo_estimates' plot_logLik(x, sets, ...)
x |
An object of class |
sets |
(optional) Character matrix of size |
... |
Aditional parameters to be passed to |
NULL (invisible). Generates a plot of the loglikelihood of the model.
# Loading data data(fakeexperiment) data(faketree) O <- new_aphylo(fakeexperiment[,2:3], tree = as.phylo(faketree)) # Baseline plot (all parameters but Pi) plot_logLik(O) # No psi parameter plot_logLik(O ~ mu_d + Pi + eta)
# Loading data data(fakeexperiment) data(faketree) O <- new_aphylo(fakeexperiment[,2:3], tree = as.phylo(faketree)) # Baseline plot (all parameters but Pi) plot_logLik(O) # No psi parameter plot_logLik(O ~ mu_d + Pi + eta)
Multiavariate plot (surface)
plot_multivariate( fun, params, domain, sets, nlevels = 20, args = list(), plotfun = graphics::image, plot = TRUE, postplot = function(params, res) { points(params, cex = 2, pch = 3, col = "red") }, mfrow = NULL, ... )
plot_multivariate( fun, params, domain, sets, nlevels = 20, args = list(), plotfun = graphics::image, plot = TRUE, postplot = function(params, res) { points(params, cex = 2, pch = 3, col = "red") }, mfrow = NULL, ... )
fun |
A function that receives 2 or more parameters and returns a single number. |
params |
Numeric vector with the default parameters. |
domain |
(optional) Named list with as many elements as parameters. Specifies the domain of the function. |
sets |
(optional) Character matrix of size |
nlevels |
Integer. Number of levels. |
args |
List of named arguments to be passed to |
plotfun |
Function that will be used to plot |
plot |
Logical. When |
postplot |
Function to be called after |
mfrow |
Passed to graphics::par. |
... |
Further arguments passed to |
A list of length length(sets)
, each with the following:
x,y,z vectors of coordinates.
xlab,ylab vectors with the corresponding labels.
# Example: A model with less parameters set.seed(1231) x <- raphylo(20) ans <- aphylo_mcmc( x ~ psi + mu_d + mu_s, control = list(nsteps = 1e3, burnin = 0) ) # Creating the multivariate plot (using by default image) plot_multivariate( function(...) { ans$fun(unlist(list(...)), priors = ans$priors, dat = ans$dat, verb_ans = FALSE) }, sets = matrix(c("mu_d0", "mu_d1", "psi0", "psi1"), ncol=2), params = ans$par )
# Example: A model with less parameters set.seed(1231) x <- raphylo(20) ans <- aphylo_mcmc( x ~ psi + mu_d + mu_s, control = list(nsteps = 1e3, burnin = 0) ) # Creating the multivariate plot (using by default image) plot_multivariate( function(...) { ans$fun(unlist(list(...)), priors = ans$priors, dat = ans$dat, verb_ans = FALSE) }, sets = matrix(c("mu_d0", "mu_d1", "psi0", "psi1"), ncol=2), params = ans$par )
Visualize predictions
## S3 method for class 'aphylo_prediction_score' plot( x, y = NULL, main = "Prediction Accuracy: Observed versus predicted values", main.colorkey = "Probability of Functional Annotation", which.fun = seq_len(ncol(x$expected)), include.labels = NULL, labels.col = "black", leafs_only = TRUE, ... )
## S3 method for class 'aphylo_prediction_score' plot( x, y = NULL, main = "Prediction Accuracy: Observed versus predicted values", main.colorkey = "Probability of Functional Annotation", which.fun = seq_len(ncol(x$expected)), include.labels = NULL, labels.col = "black", leafs_only = TRUE, ... )
x |
An object of class |
y |
Ignored. |
main |
Passed to |
main.colorkey |
Character scalar. Title of the colorkey (optional). |
which.fun |
Integer vector. Which function to plot. |
include.labels |
Logical scalar. When |
labels.col |
Character scalar. Color of the labels. |
leafs_only |
Logical. When |
... |
Ignored |
If include.labels = NULL
and ncol(x$expected) > 40
,
then include.labels=FALSE
by default.
NULL (invisible) Generates a plot of the predictions.
set.seed(8783) atree <- raphylo(29) ans <- aphylo_mle(atree ~ mu_d + mu_s + Pi) pred_s <- prediction_score(ans) pred_s plot(pred_s)
set.seed(8783) atree <- raphylo(29) ans <- aphylo_mle(atree ~ mu_d + mu_s + Pi) pred_s <- prediction_score(ans) pred_s plot(pred_s)
The function predict_pre_order
uses a pre-order algorithm to compute the
posterior probabilities, whereas the predict_brute_force
computes posterior
probabilities generating all possible cases.
## S3 method for class 'aphylo_estimates' predict( object, which.tree = NULL, ids = NULL, newdata = NULL, params = stats::coef(object), loo = TRUE, nsamples = 1L, centiles = c(0.025, 0.5, 0.975), cl = NULL, ... ) predict_pre_order(x, ...) ## S3 method for class 'aphylo_estimates' predict_pre_order( x, params = stats::coef(x), which.tree = 1:Ntrees(x), ids = lapply(Ntip(x)[which.tree], seq_len), loo = TRUE, nsamples = 1L, centiles = c(0.025, 0.5, 0.975), ncores = 1L, cl = NULL, ... ) ## S3 method for class 'aphylo' predict_pre_order(x, psi, mu_d, mu_s, eta, Pi, ...) predict_brute_force(atree, psi, mu_d, mu_s, Pi, force = FALSE)
## S3 method for class 'aphylo_estimates' predict( object, which.tree = NULL, ids = NULL, newdata = NULL, params = stats::coef(object), loo = TRUE, nsamples = 1L, centiles = c(0.025, 0.5, 0.975), cl = NULL, ... ) predict_pre_order(x, ...) ## S3 method for class 'aphylo_estimates' predict_pre_order( x, params = stats::coef(x), which.tree = 1:Ntrees(x), ids = lapply(Ntip(x)[which.tree], seq_len), loo = TRUE, nsamples = 1L, centiles = c(0.025, 0.5, 0.975), ncores = 1L, cl = NULL, ... ) ## S3 method for class 'aphylo' predict_pre_order(x, psi, mu_d, mu_s, eta, Pi, ...) predict_brute_force(atree, psi, mu_d, mu_s, Pi, force = FALSE)
which.tree |
Integer scalar. Which tree to include in the prediction. |
ids |
Integer vector. Ids (positions) of the nodes that need to be predicted (see details.) |
newdata |
(optional) An aphylo object. |
params |
A numeric vector with the corresponding parameters. |
loo |
Logical scalar. When |
nsamples |
Integer scalar. When greater than one, the prediction is done using a random sample from the MCMC chain. This only works if the model was fitted using MCMC, of course. |
centiles |
Used together with |
... |
Ignored. |
ncores , cl
|
Passed to |
psi |
Numeric vector of length 2. Misclasification probabilities. (see |
mu_d , mu_s
|
Numeric vector of length 2. Gain/loss probabilities (see |
eta |
Numeric vector of length 2. Annotation bias probabilities (see |
Pi |
Numeric scalar. Root node probability of having the function (see |
atree , x , object
|
Either a tree of class aphylo or an object of class aphylo_estimates |
force |
Logical scalar. When |
The function predict_brute_force
is only intended for testing. For predictions
after estimating the model, see predict.aphylo_estimates.
In the case of the parameter loo
(leave-one-out), while making tip-level
predictions, at each leaf the algorithm will drop annotations regarding that
leaf, making its prediction using all the available information except the
one include in such leaf.
The predict_brute_force
function makes the (obviously) brute force
calculation of the probabilities. It will perform
It returns a list with the following:
Pr
The conditional probabilities of observing a tree given a particular state
of the leave nodes. The size is given by (2^nnodes x 2^nleaves), each entry is
read as "The probability of observing scenario i (row) given that the leaves have
state j (colum)." The scenarios are specified in the row
matrix returned by the
function.
row
Indicates the state of each node (columns) per scenario (row).
col
Indicates the state of each leaf node (columns) per potential leaf
scenario.
In the case of the predict
method, a P
column numeric matrix
with values between (probabilities).
The ids
parameter indicates for which nodes, both internal and tips, the
predictions should be made. By default, the function will only make predictions
on the leaf nodes.
The ids follow ape
's convention, this is, 1:Ntips(x)
are the leaf nodes, Ntips(x) + 1L
is the root node, and everything else
are the interior nodes.
Although the prediction algorithm is fast, indicating only
a subset of the nodes could make a difference when loo = TRUE
and/or
nsamples > 1
(calculating a Credible/Confidence Interval.)
In the case of multiAphylo
, ids
should be passed as a list of length
Ntrees(x)
, with each element indicating the nodes. Otherwise, ids
are passed as an integer vector.
# Single tree --------------------------------------------------------------- set.seed(123) atree <- raphylo(10) # Fitting the model with MLE ans <- aphylo_mle(atree ~ psi + mu_d + mu_s + Pi) # Prediction on leaves predict(ans) # Prediction on all nodes (including root and interior) predict(ans, ids = 1:Nnode(ans, internal.only = FALSE)) # Multiple trees (multiAphylo) ---------------------------------------------- atree <- c(raphylo(10), raphylo(5)) # Fitting the model with MLE ans <- aphylo_mle(atree ~ psi + mu_d + mu_s + Pi) # Prediction on leaves predict(ans) # Predicting only interior nodes predict(ans, ids = list(11:19, 6:9))
# Single tree --------------------------------------------------------------- set.seed(123) atree <- raphylo(10) # Fitting the model with MLE ans <- aphylo_mle(atree ~ psi + mu_d + mu_s + Pi) # Prediction on leaves predict(ans) # Prediction on all nodes (including root and interior) predict(ans, ids = 1:Nnode(ans, internal.only = FALSE)) # Multiple trees (multiAphylo) ---------------------------------------------- atree <- c(raphylo(10), raphylo(5)) # Fitting the model with MLE ans <- aphylo_mle(atree ~ psi + mu_d + mu_s + Pi) # Prediction on leaves predict(ans) # Predicting only interior nodes predict(ans, ids = list(11:19, 6:9))
Calculate prediction score (quality of prediction)
prediction_score(x, expected, alpha0 = NULL, alpha1 = NULL, W = NULL, ...) ## S3 method for class 'aphylo_estimates' prediction_score( x, expected = NULL, alpha0 = NULL, alpha1 = NULL, W = NULL, loo = TRUE, ... ) ## S3 method for class 'aphylo_prediction_score' print(x, ...)
prediction_score(x, expected, alpha0 = NULL, alpha1 = NULL, W = NULL, ...) ## S3 method for class 'aphylo_estimates' prediction_score( x, expected = NULL, alpha0 = NULL, alpha1 = NULL, W = NULL, loo = TRUE, ... ) ## S3 method for class 'aphylo_prediction_score' print(x, ...)
x |
An object of class aphylo_estimates or other numeric vector-like object (see details). |
expected |
Numeric vector-like object length |
alpha0 , alpha1
|
Probability of observing a zero an a one, respectively. |
W |
A square matrix. Must have as many rows as genes in |
... |
Further arguments passed to predict.aphylo_estimates |
loo |
Logical scalar. When |
In the case of prediction_score
, ...
are passed to
predict.aphylo_estimates
.
The function will accept x
as a numeric vector, list of vectors, or matrix.
Otherwise, it will try to coerce it to a matrix. If it fails, it will throw
an error.
In the case of the method for aphylo estimates, the function takes as a reference using alpha equal to the proportion of observed tip annotations that are equal to 1, this is:
mean(x$dat$tip.annotation[x$dat$tip.annotation != 9L], na.rm = TRUE)
A list of class aphylo_prediction_score
:
obs : Observed 1 - MAE.
obs_raw : Unnormalized (raw) scores.
random_raw: Unnormalized (raw) scores.
worse_raw : Unnormalized (raw) scores.
pval : Computed p-value.
worse : Reference of worse case.
predicted : Numeric matrix with observed predictions.
expected : Integer matrix with expected annotations.
random : Random score (null).
alpha0 : The passed alpha parameters.
alpha1 : The passed alpha parameters.
auc : An object of class aphylo_auc
.
obs.ids : Indices of the ids.
leaf.ids : IDs of the leafs (if present).
tree : Of class phylo
.
# Example with prediction_score --------------------------------------------- set.seed(11552) ap <- raphylo( 50, P = 1, Pi = 0, mu_d = c(.8,.2), mu_s = c(0.1,0.1), psi = c(0,0) ) ans <- aphylo_mcmc( ap ~ mu_d + mu_s + Pi, control = list(nsteps=2e3, thin=20, burnin = 500), priors = bprior(c(9, 1, 1, 1, 5), c(1, 9, 9, 9, 5)) ) (pr <- prediction_score(ans, loo = TRUE)) plot(pr)
# Example with prediction_score --------------------------------------------- set.seed(11552) ap <- raphylo( 50, P = 1, Pi = 0, mu_d = c(.8,.2), mu_s = c(0.1,0.1), psi = c(0,0) ) ans <- aphylo_mcmc( ap ~ mu_d + mu_s + Pi, control = list(nsteps=2e3, thin=20, burnin = 500), priors = bprior(c(9, 1, 1, 1, 5), c(1, 9, 9, 9, 5)) ) (pr <- prediction_score(ans, loo = TRUE)) plot(pr)
Simulation of Annotated Phylogenetic Trees
raphylo( n = NULL, tree = NULL, edge.length = NULL, tip.type = NULL, node.type = function(n) sample.int(2, size = n, replace = TRUE, prob = c(0.2, 0.8)) - 1, P = 1L, psi = c(0.05, 0.05), mu_d = c(0.9, 0.5), mu_s = c(0.05, 0.02), eta = c(1, 1), Pi = 0.2, informative = getOption("aphylo_informative", FALSE), maxtries = 20L ) rmultiAphylo(R, ...)
raphylo( n = NULL, tree = NULL, edge.length = NULL, tip.type = NULL, node.type = function(n) sample.int(2, size = n, replace = TRUE, prob = c(0.2, 0.8)) - 1, P = 1L, psi = c(0.05, 0.05), mu_d = c(0.9, 0.5), mu_s = c(0.05, 0.02), eta = c(1, 1), Pi = 0.2, informative = getOption("aphylo_informative", FALSE), maxtries = 20L ) rmultiAphylo(R, ...)
n |
Integer scalar. Number of leafs. If not specified, then |
tree |
An object of class phylo. |
edge.length |
Passed to sim_tree. |
tip.type , node.type
|
Integer vectors with values 0,1. 0 denotes duplication node and 1 speciation node. This is used in LogLike. |
P |
Integer scalar. Number of functions to generate. |
psi |
Numeric vector of length 2. Misclasification probabilities. (see |
mu_d , mu_s
|
Numeric vector of length 2. Gain/loss probabilities (see |
eta |
Numeric vector of length 2. Annotation bias probabilities (see |
Pi |
Numeric scalar. Root node probability of having the function (see |
informative , maxtries
|
Passed to sim_fun_on_tree. |
R |
Integer, number of replicates |
... |
Further arguments passed to |
The rmultiAphylo
function is a wrapper around raphylo
.
An object of class aphylo
# A simple example ---------------------------------------------------------- set.seed(1231) ans <- raphylo(n=500)
# A simple example ---------------------------------------------------------- set.seed(1231) ans <- raphylo(n=500)
The function takes an annotated tree and randomly selects leaf nodes to set annotations as 9 (missing). The function allows specifying a proportion of annotations to drop, and also the relative probability that has dropping a 0 with respecto to a 1.
rdrop_annotations( x, pcent, prob.drop.0 = 0.5, informative = getOption("aphylo_informative", FALSE) )
rdrop_annotations( x, pcent, prob.drop.0 = 0.5, informative = getOption("aphylo_informative", FALSE) )
x |
An object of class aphylo. |
pcent |
Numeric scalar. Proportion of the annotations to remove. |
prob.drop.0 |
Numeric scalar. Probability of removing a 0, conversely,
|
informative |
Logical scalar. If |
x
with fewer annotations (more 9s).
# The following tree has roughtly the same proportion of 0s and 1s # and 0 mislabeling. set.seed(1) x <- raphylo(200, Pi=.5, mu_d=c(.5,.5), psi=c(0,0)) summary(x) # Dropping half of the annotations summary(rdrop_annotations(x, .5)) # Dropping half of the annotations, but 0 are more likely to drop summary(rdrop_annotations(x, .5, prob.drop.0 = 2/3))
# The following tree has roughtly the same proportion of 0s and 1s # and 0 mislabeling. set.seed(1) x <- raphylo(200, Pi=.5, mu_d=c(.5,.5), psi=c(0,0)) summary(x) # Dropping half of the annotations summary(rdrop_annotations(x, .5)) # Dropping half of the annotations, but 0 are more likely to drop summary(rdrop_annotations(x, .5, prob.drop.0 = 2/3))
Read New Hampshire eXtended format for trees
read_nhx(fn, txt)
read_nhx(fn, txt)
fn |
Full path to the tree file. |
txt |
If no file is specified, trees can also be passed as a character scalar (see examples). |
A list with the following elements:
tree An object of class ape
edge Edge annotations (length and other annotations)
nhx A list of annotations NHX
"NHX - New Hampshire eXtended [version 2.0]", https://en.wikipedia.org/wiki/Newick_format#New_Hampshire_X_format
Other reading:
panther-tree
,
read_pli()
# Example directly extracted from # https://sites.google.com/site/cmzmasek/home/software/forester/nhx read_nhx( txt = "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):0.05[&&NHX:S=primates:D=Y:B=100], ADHY:0.1[&&NHX:S=nematode],ADHX:0.12[&&NHX:S=insect]):0.1[&&NHX:S=metazoa:D=N], (ADH4:0.09[&&NHX:S=yeast],ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast], ADH1:0.11[&&NHX:S=yeast]):0.1 [&&NHX:S=Fungi])[&&NHX:D=N];" )
# Example directly extracted from # https://sites.google.com/site/cmzmasek/home/software/forester/nhx read_nhx( txt = "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):0.05[&&NHX:S=primates:D=Y:B=100], ADHY:0.1[&&NHX:S=nematode],ADHX:0.12[&&NHX:S=insect]):0.1[&&NHX:S=metazoa:D=N], (ADH4:0.09[&&NHX:S=yeast],ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast], ADH1:0.11[&&NHX:S=yeast]):0.1 [&&NHX:S=Fungi])[&&NHX:D=N];" )
Read PLI files from SIFTER
read_pli(fn, dropNAs = TRUE)
read_pli(fn, dropNAs = TRUE)
fn |
Full path to the file |
dropNAs |
Logical scalar. When |
A data table object including the following columns:
name: Used to match UniProtKB data and GOA,
number,
go: A list of the GO annotations
moc: Evidence code
fam: Name of the family
Other reading:
panther-tree
,
read_nhx()
Simulate functions on a ginven tree
sim_fun_on_tree( tree, tip.type, node.type, psi, mu_d, mu_s, eta, Pi, P = 1L, informative = getOption("aphylo_informative", FALSE), maxtries = 20L )
sim_fun_on_tree( tree, tip.type, node.type, psi, mu_d, mu_s, eta, Pi, P = 1L, informative = getOption("aphylo_informative", FALSE), maxtries = 20L )
tree |
An object of class phylo |
tip.type , node.type
|
Integer vectors with values 0,1. 0 denotes duplication node and 1 speciation node. This is used in LogLike. |
psi |
Numeric vector of length 2. Misclasification probabilities. (see |
mu_d , mu_s
|
Numeric vector of length 2. Gain/loss probabilities (see |
eta |
Numeric vector of length 2. Annotation bias probabilities (see |
Pi |
Numeric scalar. Root node probability of having the function (see |
P |
Integer scalar. Number of functions to simulate. |
informative |
Logical scalar. When |
maxtries |
Integer scalar. If |
Using the model described in the vignette peeling_phylo.html
The optiona informative
was created such that when needed the
function can be forced to simualte annotations while making sure (or at
least trying maxtries
times) that the leafs have both 0s and 9s. From what
we've learned while conducting simulation studies, using this option may
indirectly bias the data generating process.
An matrix of size length(offspring)*P
with values 9, 0 and 1
indicating "no information"
, "no function"
and "function"
.
# Example 1 ---------------------------------------------------------------- # We need to simulate a tree set.seed(1231) newtree <- sim_tree(1e3) # Preprocessing the data # Simulating ans <- sim_fun_on_tree( newtree, psi = c(.01, .05), mu_d = c(.90, .80), mu_s = c(.1, .05), Pi = .5, eta = c(1, 1) ) # Tabulating results table(ans)
# Example 1 ---------------------------------------------------------------- # We need to simulate a tree set.seed(1231) newtree <- sim_tree(1e3) # Preprocessing the data # Simulating ans <- sim_fun_on_tree( newtree, psi = c(.01, .05), mu_d = c(.90, .80), mu_s = c(.1, .05), Pi = .5, eta = c(1, 1) ) # Tabulating results table(ans)
An alternative to ape::rtree. This function was written in C++ and is
significantly faster than rtree
.
sim_tree(n, edge.length = stats::runif)
sim_tree(n, edge.length = stats::runif)
n |
Integer scalar. Number of leaf nodes. |
edge.length |
A Function. Used to set the length of the edges. |
The algorithm was implemented as follows
Initialize N = {1, ..., n}
, E
to be empty,
k = 2*n - 1
While length(N) != 1
do:
Randomly choose a pair (i, j)
from N
Add the edges E = E U {(k, i), (k, j)}
,
Redefine N = (N \ {i, j}) U {k}
Set k = k - 1
next
Use edge.length(2*n - 1)
(simulating branch lengths).
An object of class ape::phylo with the edgelist as a postorderd,
node.label
and edge.length
.
# A very simple example ---------------------------------------------------- set.seed(1223) newtree <- sim_tree(50) plot(newtree) # A performance benchmark with ape::rtree ---------------------------------- ## Not run: library(ape) microbenchmark::microbenchmark( ape = rtree(1e3), phy = sim_tree(1e3), unit = "relative" ) # This is what you would get. # Unit: relative # expr min lq mean median uq max neval # ape 14.7598 14.30809 14.30013 16.7217 14.32843 4.754106 100 # phy 1.0000 1.00000 1.00000 1.0000 1.00000 1.000000 100 ## End(Not run)
# A very simple example ---------------------------------------------------- set.seed(1223) newtree <- sim_tree(50) plot(newtree) # A performance benchmark with ape::rtree ---------------------------------- ## Not run: library(ape) microbenchmark::microbenchmark( ape = rtree(1e3), phy = sim_tree(1e3), unit = "relative" ) # This is what you would get. # Unit: relative # expr min lq mean median uq max neval # ape 14.7598 14.30809 14.30013 16.7217 14.32843 4.754106 100 # phy 1.0000 1.00000 1.00000 1.0000 1.00000 1.000000 100 ## End(Not run)
Matrix of states
states(P)
states(P)
P |
Integer scalar. Number of functions. |
A matrix of size 2^P by P with all the possible (0,1) combinations of functions.
states(3)
states(3)
Write pli files used by SIFTER
write_pli( family_id, protein_name, protein_number, go_number, moc = "EXP", file = "" )
write_pli( family_id, protein_name, protein_number, go_number, moc = "EXP", file = "" )
family_id |
Character scalar. Name of the family |
protein_name , protein_number , go_number , moc
|
Vectors of the same length |
file |
Character scalar passed to cat. |
A string with the XML file.
set.seed(882) atree <- raphylo(5) write_pli( family_id = "a family", protein_name = atree$tree$tip.label, protein_number = 1:Ntip(atree), go_number = "GO:123123123123" ) # Possible outcome: #<?xml version="1.0"?> #<Family> # <FamilyID>a family</FamilyID> # <Protein> # <ProteinName>1</ProteinName> # <ProteinNumber>1</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>2</ProteinName> # <ProteinNumber>2</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>3</ProteinName> # <ProteinNumber>3</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>4</ProteinName> # <ProteinNumber>4</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>5</ProteinName> # <ProteinNumber>5</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> #</Family>
set.seed(882) atree <- raphylo(5) write_pli( family_id = "a family", protein_name = atree$tree$tip.label, protein_number = 1:Ntip(atree), go_number = "GO:123123123123" ) # Possible outcome: #<?xml version="1.0"?> #<Family> # <FamilyID>a family</FamilyID> # <Protein> # <ProteinName>1</ProteinName> # <ProteinNumber>1</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>2</ProteinName> # <ProteinNumber>2</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>3</ProteinName> # <ProteinNumber>3</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>4</ProteinName> # <ProteinNumber>4</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> # <Protein> # <ProteinName>5</ProteinName> # <ProteinNumber>5</ProteinNumber> # <GONumber>[GO:123123123123]</GONumber> # <MOC>[EXP]</MOC> # </Protein> #</Family>