| Title: | Evolutionary Biology in R |
|---|---|
| Description: | Comparative analysis of continuous traits influencing discrete states, and utility tools to facilitate comparative analyses. Implementations of ABBA/BABA type statistics to test for introgression in genomic data. |
| Authors: | Michelle M. Jonika, Maximos Chin, Nathan Anderson, Richard H. Adams, Jeffery P. Demuth, and Heath Blackmon |
| Maintainer: | Heath Blackmon <[email protected]> |
| License: | GPL (>=2) |
| Version: | 2.2 |
| Built: | 2026-05-10 18:08:12 UTC |
| Source: | https://github.com/coleoguy/evobir |
evobiR is a collection of tools for use in evolutionary biology. Functions include comparative methods for analyzing the relationship between continuous and discrete traits on phylogenies, implementations of Patterson's D-statistic for testing introgression, tools for manipulating sequence alignments and phylogenetic trees, and utility functions for sliding window analyses, residual selection, and population genetics calculations.
| Package: | evobiR |
| Type: | Package |
| Version: | 2.2 |
| Date: | 2025-02-20 |
| License: | GPL (>=2) |
More information on evobiR is available at https://github.com/coleoguy/evobir
Michelle M. Jonika, Maximos Chin, Nathan Anderson, Richard H. Adams, Jeffery P. Demuth, and Heath Blackmon
Maintainer: Heath Blackmon <[email protected]>
Simulated SNP alignment files in FASTA format for use with the D-statistic functions.
1.fasta contains 4 sequences representing single individuals in the order P1, P2, P3, and Outgroup, for use with CalcD and WinCalcD.
3.fasta contains population-level data with multiple sequences per population (sequences from the same population share identical names), for use with CalcPopD.
Heath Blackmon
http://coleoguy.github.io/
Tests whether transitions in a binary discrete trait tend to happen at unusually high or low values of a continuous trait. The idea is simple: if evolving a horn (discrete trait = 2) requires being large (continuous trait), then the ancestors that "gave birth" to horned lineages should have been bigger than random ancestors. This function checks that by comparing real transition nodes against a null distribution built from Monte Carlo resampling.
AncCond( tree, data, mc = 1000, drop.state = NULL, mat = c(0, 2, 1, 0), pi = "equal", n.tails = 1, message = TRUE )AncCond( tree, data, mc = 1000, drop.state = NULL, mat = c(0, 2, 1, 0), pi = "equal", n.tails = 1, message = TRUE )
tree |
A phylogenetic tree of class |
data |
A data frame with exactly 3 columns:
No missing data are allowed; prune any taxa that lack data before calling this function. |
mc |
Integer. Number of Monte Carlo replicates used to build the null distribution (default 1000). |
drop.state |
|
mat |
A length-4 numeric vector describing the transition rate matrix for the discrete trait. Must be one of:
|
pi |
Root state prior probabilities. Options:
|
n.tails |
Either 1 (default, one-tailed test) or 2 (two-tailed test). Use 1 if you have an a priori hypothesis about the direction. |
message |
Logical. If |
The function works in three steps:
Reconstruct ancestral continuous values via maximum likelihood
under a Brownian motion model (geiger::anc.ML).
Map discrete trait transitions onto the tree using stochastic
character mapping (make.simmap).
Build a null distribution by randomly sampling the same number
of nodes from the pool of eligible ancestors, repeating
mc times.
A named list. For the irreversible model (sum(mat) <= 1)
the list contains:
Mean continuous trait value at nodes where trait 2 originated.
Number of 1 -> 2 transitions detected.
Numeric vector of length mc giving the null
distribution of mean values.
P-value from the Monte Carlo test.
For the reversible model (sum(mat) > 1), the list has
analogous elements for both directions (1->2 and 2->1).
Blackmon, H., Adams, R.H. (2015). evobiR: Comparative analyses and teaching tools for evolutionary biology.
## Not run: library(ape) library(phytools) # Simulate a tree and data tree <- rcoal(30) data <- data.frame( species = tree$tip.label, body_size = rnorm(30, 10, 2), has_horn = sample(1:2, 30, replace = TRUE) ) result <- AncCond(tree, data, mc = 500) ## End(Not run)## Not run: library(ape) library(phytools) # Simulate a tree and data tree <- rcoal(30) data <- data.frame( species = tree$tip.label, body_size = rnorm(30, 10, 2), has_horn = sample(1:2, 30, replace = TRUE) ) result <- AncCond(tree, data, mc = 500) ## End(Not run)
A more rigorous version of AncCond that builds its null
distribution using parametric bootstrapping instead of simple node
resampling. This means the null is generated by simulating entire
neutral evolutionary histories on the tree, which better accounts for
phylogenetic structure and mapping uncertainty.
AncCondFast( tree, data, n_iter = 100, n_maps = 100, drop.state = NULL, filter_tol = 0.2, mat = "ARD", pi = "equal", message = TRUE )AncCondFast( tree, data, n_iter = 100, n_maps = 100, drop.state = NULL, filter_tol = 0.2, mat = "ARD", pi = "equal", message = TRUE )
tree |
A phylogenetic tree of class |
data |
A data frame with exactly 3 columns:
No missing data allowed; prune any incomplete taxa first. |
n_iter |
Integer. Number of valid neutral simulations to collect for the null distribution (default 100). More iterations give a smoother null but take longer. |
n_maps |
Integer. Number of stochastic maps to average over per simulation (default 100). This integrates over mapping uncertainty so you are not relying on a single random map. |
drop.state |
|
filter_tol |
Numeric. Tolerance for accepting neutral simulations
(default 0.2, meaning 20%).
A simulated trait is kept only if its count of state-2 tips is within
|
mat |
Transition model passed to |
pi |
Root state prior for the discrete trait. Options:
|
message |
Logical. If |
When should you use this instead of AncCond?
Use this function when you want a stricter statistical test. The
parametric bootstrap simulates new discrete traits that evolve
independently of the continuous trait (i.e., the null hypothesis) and
puts each simulation through the exact same stochastic-mapping pipeline
as the real data.
What does "parametric bootstrap" mean here? Instead of shuffling real node values to build a null distribution, this function:
Fits a transition rate matrix (Q) to the observed discrete trait.
Simulates new "neutral" discrete traits on the tree using that Q
matrix (via sim.history).
For each simulated trait, runs multiple stochastic maps and calculates the mean continuous value at transition nodes.
Compares the real Grand Mean to the distribution of neutral Grand Means.
A filter_tol parameter ensures that only simulations producing
a similar number of derived-state tips as the real data are kept,
preventing biologically unreasonable null simulations.
An invisible list with:
The observed Grand Mean – the average continuous trait value at nodes where 1 -> 2 transitions occurred, averaged across all stochastic maps of the real data.
Numeric vector of Grand Means from neutral
simulations (length = n_iter or fewer if the failsafe
triggered).
Two-tailed p-value.
Percent of null values above the observed.
Percent of null values below the observed.
Total simulations attempted (including rejected ones).
Number of simulations that passed the filter.
AncCond for the original Monte Carlo resampling
version, which is faster but less rigorous.
## Not run: library(ape) library(phytools) tree <- rcoal(30) data <- data.frame( species = tree$tip.label, body_size = rnorm(30, 10, 2), has_horn = sample(1:2, 30, replace = TRUE) ) # Run parametric bootstrap with 50 null simulations, 50 maps each result <- AncCondFast(tree, data, n_iter = 50, n_maps = 50) hist(result$null_dist, main = "Null Distribution") abline(v = result$obs_stat, col = "red", lwd = 2) ## End(Not run)## Not run: library(ape) library(phytools) tree <- rcoal(30) data <- data.frame( species = tree$tip.label, body_size = rnorm(30, 10, 2), has_horn = sample(1:2, 30, replace = TRUE) ) # Run parametric bootstrap with 50 null simulations, 50 maps each result <- AncCondFast(tree, data, n_iter = 50, n_maps = 50) hist(result$null_dist, main = "Null Distribution") abline(v = result$obs_stat, col = "red", lwd = 2) ## End(Not run)
Calculates Patterson's D-statistic to test for introgression (gene flow) between species using genomic sequence data. The test looks at two patterns of shared alleles – called "ABBA" and "BABA" – that should be equally common if there is no introgression. A significant departure from D = 0 suggests gene flow between non-sister taxa.
CalcD( alignment = "alignment.fasta", sig.test = "N", ambig = "D", block.size = 1000, replicate = 1000, align.format = "fasta" )CalcD( alignment = "alignment.fasta", sig.test = "N", ambig = "D", block.size = 1000, replicate = 1000, align.format = "fasta" )
alignment |
Character string. Path to the alignment file
(default |
sig.test |
Character. Type of significance test:
|
ambig |
Character. How to handle ambiguous bases (e.g., R, Y, S):
|
block.size |
Integer. Number of contiguous sites to drop in each
jackknife replicate (default 1000). Only used when |
replicate |
Integer. Number of bootstrap or jackknife replicates (default 1000). |
align.format |
Character. Format of the alignment file
(default |
The input alignment must contain exactly 4 sequences in this order:
P1 – first ingroup population
P2 – second ingroup population (suspected introgression recipient)
P3 – third population (suspected introgression donor)
O – outgroup
D is calculated as (ABBA - BABA) / (ABBA + BABA).
D = 0 means no evidence of introgression.
D > 0 suggests gene flow between P2 and P3.
D < 0 suggests gene flow between P1 and P3.
Only biallelic sites are used. Significance is assessed by computing a Z-score (|D| / SD) from the resampled distribution, then converting to a two-tailed p-value.
The D-statistic value (numeric). Also prints summary statistics to the console including site counts and (if requested) p-values.
Durand, E.Y., Patterson, N., Reich, D., & Slatkin, M. (2011). Testing for ancient admixture between closely related populations. Molecular Biology and Evolution, 28(8), 2239-2252.
## Not run: # Basic D-statistic with no significance test CalcD("my_alignment.fasta") # With bootstrap significance test CalcD("my_alignment.fasta", sig.test = "B", replicate = 500) # With jackknife (good for whole-genome data with LD) CalcD("my_alignment.fasta", sig.test = "J", block.size = 5000) ## End(Not run)## Not run: # Basic D-statistic with no significance test CalcD("my_alignment.fasta") # With bootstrap significance test CalcD("my_alignment.fasta", sig.test = "B", replicate = 500) # With jackknife (good for whole-genome data with LD) CalcD("my_alignment.fasta", sig.test = "J", block.size = 5000) ## End(Not run)
Like CalcD, but designed for population-level data where
you have multiple individuals per population. Instead of counting
discrete ABBA/BABA patterns, it uses allele frequencies to compute
a frequency-based D-statistic (Equation 2 from Durand et al. 2011).
CalcPopD( alignment = "alignment.fasta", sig.test = "N", ambig = "D", block.size = 1000, replicate = 1000, align.format = "fasta" )CalcPopD( alignment = "alignment.fasta", sig.test = "N", ambig = "D", block.size = 1000, replicate = 1000, align.format = "fasta" )
alignment |
Character string. Path to the alignment file
(default |
sig.test |
Character. Type of significance test:
|
ambig |
Character. How to handle ambiguous bases (e.g., R, Y, S):
|
block.size |
Integer. Number of contiguous sites to drop in each
jackknife replicate (default 1000). Only used when |
replicate |
Integer. Number of bootstrap or jackknife replicates (default 1000). |
align.format |
Character. Format of the alignment file
(default |
When should you use this instead of CalcD? Use
CalcPopD when you have sequenced multiple individuals from
each population (e.g., 10 humans, 10 chimps, 10 gorillas, 10
orangutans). The frequency-based approach extracts more information
from your data because it considers how common each allele is within
each population, rather than just looking at a single representative
sequence.
The alignment must contain sequences grouped by population name. Sequences from the same population should share the same name in the FASTA header. Populations must appear in this order: P1, P2, P3, Outgroup.
For each biallelic site, the function calculates the frequency of the derived allele (B) in each population:
This frequency-based approach is more powerful than the single-sequence version when population samples are available because it uses all the information about allele frequencies rather than just presence/absence.
A list with:
The D-statistic value.
P-value (only if sig.test != "N").
Total number of sites in the alignment.
Number of biallelic sites with ABBA/BABA signal.
Number of useful sites still segregating within at least one population.
Durand, E.Y., Patterson, N., Reich, D., & Slatkin, M. (2011). Testing for ancient admixture between closely related populations. Molecular Biology and Evolution, 28(8), 2239-2252.
## Not run: # Population-level D-statistic with bootstrap result <- CalcPopD("population_alignment.fasta", sig.test = "B") result$d.stat result$pval ## End(Not run)## Not run: # Population-level D-statistic with bootstrap result <- CalcPopD("population_alignment.fasta", sig.test = "B") result$d.stat result$pval ## End(Not run)
Classifies each tree in a collection against a set of reference topologies and counts how many trees match each reference.
countTrees(collection = NULL, ref = NULL, classes = TRUE, verbose = TRUE)countTrees(collection = NULL, ref = NULL, classes = TRUE, verbose = TRUE)
collection |
Character. Path to a Newick format file containing the collection of trees to classify (e.g., posterior trees from an MCMC analysis). |
ref |
Character. Path to a Newick format file containing the reference topologies to count. Each tree in this file represents one possible branching pattern you want to test. |
classes |
Logical. If |
verbose |
Logical. If |
What is a "topology"? A topology is the branching pattern of a tree – which species are most closely related to which – ignoring branch lengths. For example, with three species (A, B, C), there are three possible topologies: ((A,B),C), ((A,C),B), and ((B,C),A). This function counts how many trees in your collection have each branching pattern.
When would you use this? After a Bayesian phylogenetic analysis (e.g., MrBayes, BEAST), you have thousands of sampled trees. This function tells you what fraction of those trees support each possible branching arrangement, helping you assess which topology has the most support in the posterior distribution.
If classes = TRUE: a list with two elements:
A numeric vector of counts – how many trees matched each reference topology.
A numeric vector giving the topology class (reference
number) for each tree in the collection. NA means the
tree didn't match any of the reference topologies you provided.
If classes = FALSE: just the count vector.
## Not run: # You have 1000 posterior trees and 3 candidate topologies. # How many trees support each arrangement? result <- countTrees("posterior_trees.nwk", "candidate_topologies.nwk") result[[1]] # e.g., c(712, 201, 87) = topology 1 has most support result[[2]] # which topology each individual tree matched ## End(Not run)## Not run: # You have 1000 posterior trees and 3 candidate topologies. # How many trees support each arrangement? result <- countTrees("posterior_trees.nwk", "candidate_topologies.nwk") result[[1]] # e.g., c(712, 201, 87) = topology 1 has most support result[[2]] # which topology each individual tree matched ## End(Not run)
Repairs branches that failed during stochastic character mapping
(from make.simmap2 or phytools::make.simmap).
fix.simmap(hists, tips, transition.matrix)fix.simmap(hists, tips, transition.matrix)
hists |
A |
tips |
A two-column data frame where:
|
transition.matrix |
A symmetric matrix describing which state transitions are possible. Non-zero entries indicate allowed transitions. This is used to find valid paths between states using shortest-path algorithms. |
What is a "failed" branch? Stochastic character mapping
works by simulating trait evolution along each branch of a tree.
For some branches, the simulation can't find a valid history that
starts and ends at the correct states – like trying to find a
path through a maze with too many dead ends. After many attempts
(controlled by rejmax in make.simmap2), the
branch is marked as "fail" instead of running forever.
This function repairs those failed branches by using graph theory (shortest path algorithm) to find the simplest valid sequence of state transitions between the start and end states, then divides the branch time equally among the required intermediate states.
For each failed branch, the function:
Identifies the start state (from the parent node) and end state (from the data or child node).
Finds the shortest valid path between these states using the transition matrix as a graph.
Divides the branch length equally among all states in the path.
Updates both the maps list and the mapped.edge
matrix.
A fixed simmap or multiSimmap object with the
failed branches repaired. The maps and mapped.edge
components are updated to reflect the inferred state transitions.
## Not run: # Run stochastic mapping with a rejection limit mapped_tree <- make.simmap2(tree, states, model = "ARD", nsim = 10, rejmax = 100000) # Fix any branches that failed fixed_tree <- fix.simmap(mapped_tree, tip_data, trans_matrix) ## End(Not run)## Not run: # Run stochastic mapping with a rejection limit mapped_tree <- make.simmap2(tree, states, model = "ARD", nsim = 10, rejmax = 100000) # Fix any branches that failed fixed_tree <- fix.simmap(mapped_tree, tip_data, trans_matrix) ## End(Not run)
Finds taxon names in your data that are close (but not exact) matches to the tip labels on your phylogenetic tree.
FuzzyMatch(tree, data, max.dist)FuzzyMatch(tree, data, max.dist)
tree |
A phylogenetic tree of class |
data |
A character vector of taxon names from your dataset (e.g., the species column of your data frame). |
max.dist |
Integer. Maximum number of character differences to
still count as a "close match." For example, |
Why do you need this? In comparative biology, you often combine a phylogenetic tree (from one source) with trait data (from another source). If a species name is spelled slightly differently in the two datasets – for example, "Homo_sapiens" in the tree but "Homo_sapien" in your data – that species will be silently dropped from your analysis with no warning. This function catches those near-misses so you can fix them before running your analysis.
The function uses Levenshtein distance (the minimum number of single-character edits – insertions, deletions, or substitutions – needed to change one name into another) to measure similarity.
A data frame with three columns:
The name as it appears in your data.
The closest matching name on the tree.
Number of character differences (Levenshtein distance) between the two names.
Returns NULL invisibly if no close matches are found.
## Not run: library(ape) tree <- read.tree(text = "((Homo_sapiens, Pan_troglodytes), Gorilla_gorilla);") my_data <- c("Homo_sapien", "Pan_troglodytes", "Gorila_gorilla") FuzzyMatch(tree, my_data, max.dist = 2) # Should flag "Homo_sapien" and "Gorila_gorilla" as close matches ## End(Not run)## Not run: library(ape) tree <- read.tree(text = "((Homo_sapiens, Pan_troglodytes), Gorilla_gorilla);") my_data <- c("Homo_sapien", "Pan_troglodytes", "Gorila_gorilla") FuzzyMatch(tree, my_data, max.dist = 2) # Should flag "Homo_sapien" and "Gorila_gorilla" as close matches ## End(Not run)
Calculates the effective population size (Ne) from the number of breeding males and females. Ne is almost always smaller than the actual census size because not all individuals contribute equally to the next generation. Unequal sex ratios reduce Ne further.
getNe(males, females, locus = "A")getNe(males, females, locus = "A")
males |
Integer. Number of breeding males in the population. |
females |
Integer. Number of breeding females in the population. |
locus |
Character. Which locus type to calculate Ne for:
|
Two formulas are available:
Autosomal (default):
X-linked:
where M = number of breeding males, F = number of breeding females.
When M = F, autosomal Ne = 2N (same as census). As the sex ratio becomes more skewed, Ne drops. X-linked Ne differs because X chromosomes spend 2/3 of their time in females and 1/3 in males.
Numeric. The effective population size.
# Equal sex ratio: 50 males, 50 females getNe(50, 50) # Returns 100 (same as census) # Skewed: 10 males, 90 females (like elephant seals) getNe(10, 90) # Returns 36 -- much less than the census of 100! # X-linked Ne with equal sex ratio getNe(50, 50, locus = "X")# Equal sex ratio: 50 males, 50 females getNe(50, 50) # Returns 100 (same as census) # Skewed: 10 males, 90 females (like elephant seals) getNe(10, 90) # Returns 36 -- much less than the census of 100! # X-linked Ne with equal sex ratio getNe(50, 50, locus = "X")
Estimates how fast a discrete character has been evolving at each tip of a phylogenetic tree. For example, if you are studying chromosome number evolution, this tells you which species have recently experienced rapid changes in chromosome number on the branch leading to the present.
GetTipRates( tree = NULL, Q = NULL, tip.states = NULL, hyper = FALSE, p.mat = NULL )GetTipRates( tree = NULL, Q = NULL, tip.states = NULL, hyper = FALSE, p.mat = NULL )
tree |
An ultrametric phylogenetic tree of class |
Q |
A transition rate matrix (square matrix) with column names matching the possible character states. This matrix tells the model how likely each type of state change is. It is used internally for ancestral state reconstruction (estimating what states the ancestors had). |
tip.states |
An integer vector of observed states at the tips,
with names matching the tree's tip labels. Values should be integers
from 1 to N where N is the number of possible states. For example,
if your trait has 3 states, each tip gets a value of 1, 2, or 3.
If |
hyper |
Logical. If |
p.mat |
A probability matrix where rows are species (with row
names matching tree tips) and columns are states. Each row should
have exactly one 1 and the rest 0s (i.e., you know each species'
state with certainty). This is an alternative way to provide tip
states instead of |
The "rate" for a tip is calculated as:
|tip_state - parent_state| / branch_length.
In plain English: how much the trait changed on the last branch of the
tree (between the species and its most recent ancestor), divided by the
length of that branch (a proxy for time). Bigger values mean faster
recent evolution; a rate of zero means no change occurred on that branch.
The function works in three steps:
Ancestral state reconstruction: Uses
asr_mk_model with the supplied transition
matrix to estimate what state each internal node (ancestor)
most likely had.
Compare tip to parent: For each species (tip), it looks at the reconstructed state of the immediately ancestral node (the parent node).
Calculate rate: The rate is the absolute difference between the tip state and its parent state, divided by the branch length connecting them.
A named numeric vector of tip rates, one value per tip. Names correspond to tree tip labels. Higher values indicate faster evolution on that terminal branch.
## Not run: library(ape) library(castor) tree <- rcoal(10) # Make a simple 3-state transition matrix Q <- matrix(c(-0.2, 0.1, 0.1, 0.1, -0.2, 0.1, 0.1, 0.1, -0.2), 3, 3) colnames(Q) <- rownames(Q) <- c("1", "2", "3") tip.states <- setNames(sample(1:3, 10, replace = TRUE), tree$tip.label) rates <- GetTipRates(tree, Q, tip.states) # Species with high rates evolved rapidly on their terminal branch ## End(Not run)## Not run: library(ape) library(castor) tree <- rcoal(10) # Make a simple 3-state transition matrix Q <- matrix(c(-0.2, 0.1, 0.1, 0.1, -0.2, 0.1, 0.1, 0.1, -0.2), 3, 3) colnames(Q) <- rownames(Q) <- c("1", "2", "3") tip.states <- setNames(sample(1:3, 10, replace = TRUE), tree$tip.label) rates <- GetTipRates(tree, Q, tip.states) # Species with high rates evolved rapidly on their terminal branch ## End(Not run)
A csv file containing measurements of horn and body size for the broad-horned flour beetle Gnatocerus cornutus. The file has three columns: individual ID, body size, and horn size. Used in the ResSel function examples for demonstrating residual selection analysis.
data <- read.csv(file = system.file("horn.beetle.csv", package = "evobiR")) head(data)data <- read.csv(file = system.file("horn.beetle.csv", package = "evobiR")) head(data)
A phylogenetic tree of class phylo containing 5 species of Hymenoptera (bees, wasps, and ants). Used in examples for the FuzzyMatch function to demonstrate fuzzy matching of taxon names between data and trees.
data(hym.tree) ape::plot.phylo(hym.tree)data(hym.tree) ape::plot.phylo(hym.tree)
An enhanced version of make.simmap that adds
configurable rejection limits and monitoring for large or complex
datasets. Stochastic character mapping simulates possible histories
of a discrete trait on a phylogeny – imagine painting each branch of
the tree with colors representing different trait states.
make.simmap2( tree, x, model = "SYM", nsim = 1, monitor = FALSE, rejmax = NULL, rejint = 1e+06, ... )make.simmap2( tree, x, model = "SYM", nsim = 1, monitor = FALSE, rejmax = NULL, rejint = 1e+06, ... )
tree |
A phylogenetic tree of class |
x |
A named vector of discrete character states at the tips, or a matrix of state probabilities (rows = tips, columns = states). |
model |
Character or matrix. Model of trait evolution:
|
nsim |
Integer. Number of stochastic maps to simulate (default 1). |
monitor |
Logical. If |
rejmax |
Integer or |
rejint |
Integer. Interval for printing rejection count
progress messages (default 1000000). Only used when
|
... |
Additional arguments passed to
|
The key enhancement is the rejmax parameter: when the Gillespie
simulation can't find a valid path on a branch after rejmax
attempts, it marks that branch as "fail" instead of running forever.
Failed branches can then be repaired with fix.simmap.
A tree of class "simmap" (if nsim = 1) or a
list of class c("multiSimmap", "multiPhylo") (if
nsim > 1). Each tree has additional components:
A list of named vectors, one per edge, giving the time spent in each state along that edge.
A matrix summarizing total time in each state per edge.
The transition rate matrix used.
The log-likelihood.
Integer vector of edge indices that failed
(hit rejmax).
fix.simmap to repair failed branches.
## Not run: library(phytools) # Basic usage (same as make.simmap) mapped <- make.simmap2(tree, states, model = "ARD", nsim = 100) # With rejection limit and monitoring mapped <- make.simmap2(tree, states, model = "ARD", nsim = 10, rejmax = 500000, monitor = TRUE) # Fix any failed branches fixed <- fix.simmap(mapped, tip_data, trans_matrix) ## End(Not run)## Not run: library(phytools) # Basic usage (same as make.simmap) mapped <- make.simmap2(tree, states, model = "ARD", nsim = 100) # With rejection limit and monitoring mapped <- make.simmap2(tree, states, model = "ARD", nsim = 10, rejmax = 500000, monitor = TRUE) # Fix any failed branches fixed <- fix.simmap(mapped, tip_data, trans_matrix) ## End(Not run)
A dataframe of sexual system and chromosome number data for mites. Contains three columns:
species names matching the tip labels of trees.mite
diploid chromosome numbers (continuous trait)
sex determination system coded as 0 (diplodiploidy) or 1 (haplodiploidy) (discrete binary trait)
Used in examples for the AncCond function.
data(mite.trait) head(mite.trait)data(mite.trait) head(mite.trait)
S3 plot method for "phyloscaled" objects produced by
scaleTreeRates. Visualizes how evolutionary rates
vary across the tree using either branch length scaling or color
coding.
## S3 method for class 'phyloscaled' plot( x, method = "multiply", palette = "RdYlGn", edge.width = 1, cex = 1, show.tip.label = TRUE, ... )## S3 method for class 'phyloscaled' plot( x, method = "multiply", palette = "RdYlGn", edge.width = 1, cex = 1, show.tip.label = TRUE, ... )
x |
A tree of class |
method |
Character. How to display rate variation:
|
palette |
Character. Color palette for the |
edge.width |
Numeric. Width of plotted edges (default 1). |
cex |
Numeric. Character expansion factor for tip labels (default 1). |
show.tip.label |
Logical. Whether to display tip labels
(default |
... |
Additional arguments passed to
|
Produces a plot. Returns NULL invisibly.
## Not run: # After running scaleTreeRates: scaled_tree <- scaleTreeRates(tree, tip.states, model = "ARD") # Method 1: scale branch lengths plot(scaled_tree, method = "multiply") # Method 2: color branches by rate plot(scaled_tree, method = "color") ## End(Not run)## Not run: # After running scaleTreeRates: scaled_tree <- scaleTreeRates(tree, tip.states, model = "ARD") # Method 1: scale branch lengths plot(scaled_tree, method = "multiply") # Method 2: color branches by rate plot(scaled_tree, method = "color") ## End(Not run)
Computes the expected proportion of chromosomal fusions that join a sex chromosome to an autosome (SA-fusions), as opposed to autosome-autosome (AA) or sex-sex (SS) fusions. Two null models are available:
pSAF(n_auto, n_x = 0, n_y = 0, n_z = 0, n_w = 0, mu = 0.5, weighted = TRUE)pSAF(n_auto, n_x = 0, n_y = 0, n_z = 0, n_w = 0, mu = 0.5, weighted = TRUE)
n_auto |
Integer >= 1. The haploid autosome count — i.e., the number
of autosome PAIRS. For example, humans have 22 autosome pairs, so
|
n_x |
Integer >= 0. Number of X chromosomes present in the
heterogametic sex (males in XY systems). For a standard XY system,
|
n_y |
Integer >= 0. Number of Y chromosomes present in the
heterogametic sex. For a standard XY system, |
n_z |
Integer >= 0. Number of Z chromosomes present in the
heterogametic sex (females in ZW systems). Analogous to |
n_w |
Integer >= 0. Number of W chromosomes present in the
heterogametic sex. Analogous to |
mu |
Numeric between 0 and 1. The proportion of fusion mutations that
originate in females, regardless of which sex is homogametic.
Default is 0.5, meaning both sexes contribute equally. Setting
|
weighted |
Logical. If |
Occurrence model (weighted = FALSE): The original null from
Anderson et al. (2020). Every pair of non-homologous chromosomes is equally
likely to fuse. The probability of each fusion class (AA, SA, SS) depends
only on how many chromosomes of each type are present in each sex, weighted
by the proportion of fusions originating in each sex (mu).
Fixation-weighted model (weighted = TRUE, default): Extends
the occurrence model by accounting for the fact that a new fusion's
probability of drifting to fixation under neutrality depends on how many
copies of that chromosome exist in the population (Kimura 1962). Autosomes
are present in 2N copies, the shared sex chromosome (X or Z) in 3N/2
copies, and the limited sex chromosome (Y or W) in only N/2 copies. Because
fewer copies means higher fixation probability, Y-autosome (or W-autosome)
fusions are disproportionately likely to reach fixation relative to their
occurrence rate. This model therefore predicts a higher baseline proportion
of SA-fusions than the occurrence model alone.
Terminology:
Shared sex chromosome: the sex chromosome present in both sexes — X in XY systems (females are XX, males have at least one X) or Z in ZW systems (males are ZZ, females have at least one Z).
Limited sex chromosome: the sex chromosome present only in the heterogametic sex — Y in XY systems or W in ZW systems.
Homogametic sex: the sex with two copies of the shared sex chromosome (e.g., XX females in XY systems, ZZ males in ZW systems).
Heterogametic sex: the sex carrying both shared and limited sex chromosomes (e.g., XY males, ZW females).
Non-homologous fusion: a fusion between two chromosomes that are NOT homologs of each other. Fusions between homologs (e.g., both copies of chromosome 3) produce unbalanced gametes and are excluded.
How the math works:
Within each sex, the probability that a random non-homologous fusion falls
into each class (AA, SA, SS) is determined by the number of chromosomes of
each type. The overall probability is the weighted average across the two
sexes (using mu).
For the fixation-weighted model, the SA class is further split into fusions involving the shared sex chromosome (XA or ZA) versus the limited sex chromosome (YA or WA). Each class is multiplied by a fixation weight proportional to 1/(population copy number), then renormalized so probabilities sum to 1.
Fixation weights (relative to autosomes):
Autosome: weight = 1 (present in 2N copies)
Shared sex chrom (X/Z): weight = 4/3 (present in 3N/2 copies)
Limited sex chrom (Y/W): weight = 4 (present in N/2 copies)
A numeric value (or vector, if n_auto is a vector) giving
the probability that a random fusion is a sex-autosome fusion under the
chosen null model.
Anderson, N.W., Hjelmen, C.E., & Blackmon, H. (2020). The probability of fusions joining sex chromosomes and autosomes. Biology Letters, 16, 20200648. doi:10.1098/rsbl.2020.0648
Schmalz, S. & Blackmon, H. A fixation-weighted null model for the proportion of sex chromosome-autosome fusions. In prep.
Kimura, M. (1962). On the probability of fixation of mutant genes in a population. Genetics, 47, 713-719.
# ---- Basic usage ---- # Standard XY system with 5 autosome pairs (Da = 10), fixation-weighted pSAF(n_auto = 5, n_x = 1, n_y = 1) # Same system, occurrence probability only (Anderson et al. 2020) pSAF(n_auto = 5, n_x = 1, n_y = 1, weighted = FALSE) # ---- Different sex chromosome systems ---- # XO system (no Y chromosome): 13 autosome pairs pSAF(n_auto = 13, n_x = 1, n_y = 0) # XXY system: 2 X chromosomes and 1 Y in the heterogametic sex pSAF(n_auto = 5, n_x = 2, n_y = 1) # XYY system: 1 X and 2 Y chromosomes in the heterogametic sex pSAF(n_auto = 5, n_x = 1, n_y = 2) # ---- ZW systems ---- # Standard ZW (e.g., birds): 10 autosome pairs pSAF(n_auto = 10, n_z = 1, n_w = 1) # ---- Vectorized over autosome count ---- # Compute across a range of autosome counts for plotting n <- 1:30 p_fix <- pSAF(n_auto = n, n_x = 1, n_y = 1) p_occ <- pSAF(n_auto = n, n_x = 1, n_y = 1, weighted = FALSE) plot(n, p_fix, type = "l", col = "red", xlab = "Haploid autosome count", ylab = "P(SA fusion)") lines(n, p_occ, col = "blue", lty = 2) legend("topright", c("Fixation-weighted", "Occurrence"), col = c("red", "blue"), lty = 1:2) # ---- Habronattus example from the paper (XXO, Da = 26) ---- pSAF(n_auto = 13, n_x = 2, n_y = 0) # fixation: 0.243 pSAF(n_auto = 13, n_x = 2, n_y = 0, weighted = FALSE) # occurrence: 0.194# ---- Basic usage ---- # Standard XY system with 5 autosome pairs (Da = 10), fixation-weighted pSAF(n_auto = 5, n_x = 1, n_y = 1) # Same system, occurrence probability only (Anderson et al. 2020) pSAF(n_auto = 5, n_x = 1, n_y = 1, weighted = FALSE) # ---- Different sex chromosome systems ---- # XO system (no Y chromosome): 13 autosome pairs pSAF(n_auto = 13, n_x = 1, n_y = 0) # XXY system: 2 X chromosomes and 1 Y in the heterogametic sex pSAF(n_auto = 5, n_x = 2, n_y = 1) # XYY system: 1 X and 2 Y chromosomes in the heterogametic sex pSAF(n_auto = 5, n_x = 1, n_y = 2) # ---- ZW systems ---- # Standard ZW (e.g., birds): 10 autosome pairs pSAF(n_auto = 10, n_z = 1, n_w = 1) # ---- Vectorized over autosome count ---- # Compute across a range of autosome counts for plotting n <- 1:30 p_fix <- pSAF(n_auto = n, n_x = 1, n_y = 1) p_occ <- pSAF(n_auto = n, n_x = 1, n_y = 1, weighted = FALSE) plot(n, p_fix, type = "l", col = "red", xlab = "Haploid autosome count", ylab = "P(SA fusion)") lines(n, p_occ, col = "blue", lty = 2) legend("topright", c("Fixation-weighted", "Occurrence"), col = c("red", "blue"), lty = 1:2) # ---- Habronattus example from the paper (XXO, Da = 26) ---- pSAF(n_auto = 13, n_x = 2, n_y = 0) # fixation: 0.243 pSAF(n_auto = 13, n_x = 2, n_y = 0, weighted = FALSE) # occurrence: 0.194
Reads a FASTA alignment and writes a new file with the sequences reordered by their starting position (i.e., where the first non-gap character appears).
ReOrderAlignment(file, newfile, ref = NULL)ReOrderAlignment(file, newfile, ref = NULL)
file |
Character. Path to the input FASTA alignment file. |
newfile |
Character. Path for the output (reordered) FASTA file. |
ref |
Optional. Reference sequence(s) to place first in the output. Can be specified as:
|
When would you use this? When you have an alignment where different sequences cover different regions – for example, amplicon sequencing data where each primer pair targets a different part of a gene, or partial genome sequences that start at different positions. Sorting by start position makes the alignment much easier to visualize and inspect in an alignment viewer.
Optionally, one or more reference sequences can be pinned to the top of the output file regardless of their start position.
Writes the reordered alignment to newfile. No value is
returned (called for its side effect of writing a file).
## Not run: # Reorder by start position ReOrderAlignment("my_alignment.fasta", "reordered.fasta") # Pin "reference_seq" to the top, then sort the rest ReOrderAlignment("my_alignment.fasta", "reordered.fasta", ref = "reference_seq") ## End(Not run)## Not run: # Reorder by start position ReOrderAlignment("my_alignment.fasta", "reordered.fasta") # Pin "reference_seq" to the top, then sort the rest ReOrderAlignment("my_alignment.fasta", "reordered.fasta", ref = "reference_seq") ## End(Not run)
Identifies individuals with the most extreme residual values from a linear regression between two traits. This is a common method in artificial selection experiments.
ResSel(data, traits, percent = 10, identifier = 1, model = "linear")ResSel(data, traits, percent = 10, identifier = 1, model = "linear")
data |
A data frame containing at least an identifier column and the two trait columns. |
traits |
A numeric vector of length 2 giving the column numbers
of the two traits. The first is the predictor (x-axis), the second
is the response (y-axis). For example, |
percent |
Numeric. The percentage of individuals to select from each tail of the residual distribution (default 10, meaning the top 10% and bottom 10%). Higher values select more individuals but with less extreme phenotypes. |
identifier |
Integer. Column number containing individual IDs (default 1). |
model |
Character. Type of regression model (default |
What are residuals? Imagine you plot body size (x-axis) against wing length (y-axis) and draw a best-fit line through the data. A "residual" is how far above or below that line each individual falls. Individuals far ABOVE the line have unusually long wings for their body size (positive residuals). Individuals far BELOW the line have unusually short wings for their body size (negative residuals).
The function selects the most extreme individuals from each end: the "high line" (furthest above the line) and the "low line" (furthest below the line) for use as breeding stock. It also produces a scatter plot showing the regression line (orange), high-line individuals (blue), and low-line individuals (red).
A list with two elements:
Identifiers of individuals in the upper tail (positive residuals – their y-trait is bigger than expected given their x-trait value).
Identifiers of individuals in the lower tail (negative residuals – their y-trait is smaller than expected given their x-trait value).
## Not run: # Simulate some data: 100 beetles with body size and wing length df <- data.frame( id = paste0("beetle_", 1:100), body_size = rnorm(100, 50, 10), wing_length = rnorm(100, 30, 5) ) # Select top and bottom 15% by residual wing length # (controlling for body size) selected <- ResSel(df, traits = c(2, 3), percent = 15) selected$`high line` # beetles with unusually LONG wings for their size selected$`low line` # beetles with unusually SHORT wings for their size ## End(Not run)## Not run: # Simulate some data: 100 beetles with body size and wing length df <- data.frame( id = paste0("beetle_", 1:100), body_size = rnorm(100, 50, 10), wing_length = rnorm(100, 30, 5) ) # Select top and bottom 15% by residual wing length # (controlling for body size) selected <- ResSel(df, traits = c(2, 3), percent = 15) selected$`high line` # beetles with unusually LONG wings for their size selected$`low line` # beetles with unusually SHORT wings for their size ## End(Not run)
Reads a collection of phylogenetic trees from a NEXUS file (typically output from a Bayesian analysis like MrBayes or BEAST), removes the burn-in, and randomly samples a specified number of post-burn-in trees.
SampleTrees(trees, burnin, final.number, format, prefix)SampleTrees(trees, burnin, final.number, format, prefix)
trees |
Character. Path to a NEXUS format file containing
phylogenetic trees (e.g., the |
burnin |
Numeric between 0 and 1. Proportion of trees to
discard as burn-in. For example, |
final.number |
Integer. Number of trees to randomly sample from the post-burn-in trees. Common choices are 100 or 1000. |
format |
Character. Output format:
|
prefix |
Character. Prefix for the output file name. The appropriate extension (.nwk or .nex) is added automatically. |
Why do you need this? Bayesian phylogenetic programs like MrBayes produce thousands (sometimes millions) of trees as they explore "tree space." The first chunk of trees are unreliable because the program was still searching for good trees – this early period is called the "burn-in" and should be discarded. After removing the burn-in, you typically don't need ALL the remaining trees (they would be too many to work with), so you randomly subsample a manageable number that still represents the full range of uncertainty in your phylogeny.
Writes the sampled trees to a file. No value is returned (called for its side effect of writing a file).
## Not run: # Your MrBayes run produced "my_analysis.nex.t" with 10,000 trees. # Remove the first 25% as burn-in, then randomly pick 100 trees: SampleTrees("my_analysis.nex.t", burnin = 0.25, final.number = 100, format = "new", prefix = "sampled_trees") # This creates "sampled_trees.nwk" with 100 trees ## End(Not run)## Not run: # Your MrBayes run produced "my_analysis.nex.t" with 10,000 trees. # Remove the first 25% as burn-in, then randomly pick 100 trees: SampleTrees("my_analysis.nex.t", burnin = 0.25, final.number = 100, format = "new", prefix = "sampled_trees") # This creates "sampled_trees.nwk" with 100 trees ## End(Not run)
Tests whether different branches of a phylogeny have different rates of discrete character evolution. The function assigns a "scalar" multiplier to each edge: scalar > 1 means faster-than-average evolution, scalar < 1 means slower, and scalar = 1 means the background rate.
scaleTreeRates( tree, tip.states, model, fixedQ = NULL, max.ratio = 2, nbins = 10, max.transition = 1, var.start = FALSE, pi = "fitzjohn" )scaleTreeRates( tree, tip.states, model, fixedQ = NULL, max.ratio = 2, nbins = 10, max.transition = 1, var.start = FALSE, pi = "fitzjohn" )
tree |
A phylogenetic tree of class |
tip.states |
A named vector of discrete character states at
the tips. Names must match |
model |
Character or matrix. The model for
|
fixedQ |
A pre-estimated Q (transition rate) matrix.
If |
max.ratio |
Numeric. Maximum rate scalar to test (default 2).
The function tests scalars from |
nbins |
Integer. Number of scalar bins above and below 1 (default 10). More bins gives finer resolution but takes longer. |
max.transition |
Integer. Maximum number of bin steps an edge's scalar can differ from its parent's scalar (default 1). This prevents rate estimates from jumping wildly between adjacent branches. |
var.start |
Logical. If |
pi |
Character. Method for estimating the root state prior.
Passed to |
The method works by trying different scalar values for each edge (via dynamic programming) and keeping whichever value maximizes the likelihood.
The algorithm proceeds edge-by-edge from root to tips. For each
edge, it tries every allowed scalar value (constrained to be within
max.transition bins of the parent's scalar), fits the Mk
model with scaled branch lengths, and picks the scalar that gives
the highest likelihood.
A tree of class c("phylo", "phyloscaled") with an
additional $scalar element: a numeric vector (one value
per edge) of rate multipliers. Plot with
plot.phyloscaled.
## Not run: library(phytools) tree <- rcoal(20) states <- setNames(sample(1:3, 20, replace = TRUE), tree$tip.label) scaled <- scaleTreeRates(tree, states, model = "ARD", nbins = 5) plot(scaled, method = "color") ## End(Not run)## Not run: library(phytools) tree <- rcoal(20) states <- setNames(sample(1:3, 20, replace = TRUE), tree$tip.label) scaled <- scaleTreeRates(tree, states, model = "ARD", nbins = 5) plot(scaled, method = "color") ## End(Not run)
Applies a function across a sliding window over a vector or matrix.
SlidingWindow(FUN, data, window, step, strict = FALSE)SlidingWindow(FUN, data, window, step, strict = FALSE)
FUN |
Function to apply within each window. Must accept a
vector (for vector data) or matrix (for matrix data) and return
a single value. Can be a function name as a string (e.g.,
|
data |
A numeric vector or matrix to analyze. |
window |
Integer. Size of the sliding window (number of elements for vectors, or number of rows/columns for matrices). |
step |
Integer. How far the window moves between calculations.
A step of 1 means windows overlap maximally; a step equal to
|
strict |
Logical. If |
What is a sliding window? Imagine looking at a long DNA sequence through a small window that you slide along one step at a time. At each position, you measure something inside the window (e.g., the average GC content). This gives you a profile showing how that measurement changes along the sequence. For example, you could calculate the mean GC content in 100-bp windows sliding every 10 bp across a genome, or compute local diversity statistics across a landscape grid.
For vector data the window slides along the length; for matrix data the window slides in both dimensions (rows and columns).
For vector input: a numeric vector of results, one per window position.
For matrix input: a matrix of results where each cell is the function applied to the corresponding window.
# Calculate mean in windows of 10, stepping by 5 x <- rnorm(100) SlidingWindow("mean", x, window = 10, step = 5) # GC content in sliding windows (simplified example) seq_data <- sample(c(0, 1), 500, replace = TRUE) gc <- SlidingWindow("mean", seq_data, window = 50, step = 10)# Calculate mean in windows of 10, stepping by 5 x <- rnorm(100) SlidingWindow("mean", x, window = 10, step = 5) # GC content in sliding windows (simplified example) seq_data <- sample(c(0, 1), 500, replace = TRUE) gc <- SlidingWindow("mean", seq_data, window = 50, step = 10)
Reads multiple sequence alignment files and either concatenates them into a single supermatrix or pads each one to include all taxa.
SuperMatrix( missing = "-", prefix = "concatenated", save = TRUE, input = "", format.in = "f", format.out = "f", concatenate = TRUE )SuperMatrix( missing = "-", prefix = "concatenated", save = TRUE, input = "", format.in = "f", format.out = "f", concatenate = TRUE )
missing |
Character. Symbol to use for missing data
(default |
prefix |
Character. Prefix for output file names
(default |
save |
Logical. If |
input |
Character. A pattern to match alignment file names in
the current directory. If |
format.in |
Character. Input alignment format (default |
format.out |
Character. Output alignment format (default |
concatenate |
Logical. If |
What is a supermatrix? Think of it like merging
spreadsheets. You have separate DNA sequence files for different
genes (e.g., gene1.fasta with 80 species, gene2.fasta with 95
species). This function pastes all the gene columns together
side-by-side into one big alignment. Species that are missing
from a particular gene get filled in with the missing data
character (default "-").
A partition file is also saved so you know which columns in the final alignment came from which gene – this is important for partitioned phylogenetic analyses where each gene gets its own substitution model.
When concatenate = TRUE: a list with two elements:
A data frame with columns: gene file name, start position, stop position.
A character matrix of the concatenated alignment.
When concatenate = FALSE: a list of padded alignment matrices,
one per input file.
## Not run: # Concatenate all .fasta files in the current directory setwd("path/to/alignments") result <- SuperMatrix(input = ".fasta") result$partitions # see which positions belong to which gene ## End(Not run)## Not run: # Concatenate all .fasta files in the current directory setwd("path/to/alignments") result <- SuperMatrix(input = ".fasta") result$partitions # see which positions belong to which gene ## End(Not run)
A multiPhylo object containing 10 phylogenetic trees from a previously published study on mite sexual system evolution. These trees can be used with the associated mite.trait dataset for comparative analyses using AncCond. Loaded as the object trees.
data(trees.mite) plot(trees[[1]])data(trees.mite) plot(trees[[1]])
A nexus-format file containing a collection of 100 simulated phylogenetic trees, each with 10 tips. Stored in the inst/ directory. Used in examples for the SampleTrees function to demonstrate tree subsampling from posterior distributions.
Calculates Patterson's D-statistic in sliding windows across a sequence alignment. This lets you see how introgression signal varies along the genome – some regions may show strong introgression while others don't.
WinCalcD( alignment = "alignment.fasta", win.size = 100, step.size = 50, boot = FALSE, replicate = 1000 )WinCalcD( alignment = "alignment.fasta", win.size = 100, step.size = 50, boot = FALSE, replicate = 1000 )
alignment |
Character. Path to the FASTA alignment file
(default |
win.size |
Integer. Width of each window in base pairs (default 100). |
step.size |
Integer. How far the window slides between
calculations (default 50). Setting |
boot |
Logical. If |
replicate |
Integer. Number of bootstrap replicates per window
(default 1000). Only used when |
Like CalcD, the alignment must have exactly 4 sequences
in the order P1, P2, P3, Outgroup.
D is calculated as (ABBA - BABA) / (ABBA + BABA) within each window. Windows with no informative sites return D = 0. When bootstrapping, sites within each window are resampled with replacement to generate a null distribution.
A data frame with one row per window and columns:
Genomic coordinates of the window (e.g., "1:100").
Number of ABBA sites in the window.
Number of BABA sites in the window.
D-statistic for the window.
Z-score (only if boot = TRUE).
P-value (only if boot = TRUE).
## Not run: # Calculate D in 500-bp windows sliding every 250 bp results <- WinCalcD("my_alignment.fasta", win.size = 500, step.size = 250) # Plot D across the genome plot(1:nrow(results), as.numeric(results$d), type = "l", xlab = "Window", ylab = "D-statistic") abline(h = 0, lty = 2, col = "red") ## End(Not run)## Not run: # Calculate D in 500-bp windows sliding every 250 bp results <- WinCalcD("my_alignment.fasta", win.size = 500, step.size = 250) # Plot D across the genome plot(1:nrow(results), as.numeric(results$d), type = "l", xlab = "Window", ylab = "D-statistic") abline(h = 0, lty = 2, col = "red") ## End(Not run)