Simulating fossils

This vignette provides information about how the package stores and outputs information about fossil sampling times via the fossils object and available options for simulating fossil recovery.

Contents

The fossils object

The fossils object contains information about fossil sampling times and their relationship to the corresponding phylo and taxonomy objects. In some simulation conditions we may know the age of fossils precisely. Alternatively, there may be some uncertainty associated with specimen ages, i.e. we only know the age to within some stratigraphic interval, which can be incorporated into the fossils object.

The object contains a dataframe with species and age information for each fossil with the following 4 fields:

  • sp the label of the corresponding species. This label matches the edge labels in the corresponding phylo object or the species labels in the corresponding taxonomy object if taxonomic information was provided

  • edge the label of the sampled node or tip in the phylogeny, i.e the node at the end of the edge along which the fossil was sampled

  • hmin the age of the fossil or the youngest bound of the time interval in which the fossil was sampled

  • hmax the oldest bound of the time interval in which the fossil was sampled. This is equal to hmin if exact sampling times are known

The object also contains a logical variable from.taxonomy indicating whether the fossil record was simulated using a taxonomy object, in which case from.taxonomy = TRUE, or from a phylo object in which case from.taxonomy = FALSE (the default).

Constant fossil recovery

The simplest model of fossil recovery is a Poisson sampling process, where fossils are sampled along lineages with constant rate ψ (function argument rate).

Fossils can be simulated using a phylo object or an existing taxonomy object generated using the function sim.taxonomy.

Simulating fossils for a tree object

If a phylo rather than a taxonomy object is passed to the function, it will assume that all speciation has occurred via symmetric (i.e. bifurcating, β = 1) speciation, i.e. every edge in the tree represents a unique species. This applies to all functions used to simulate fossils.

# set the random number generator seed to generate the same results using the same code
set.seed(1234)

# simulate a tree using TreeSim conditioned on tip number
lambda = 1
mu = 0.2
tips = 8
t = TreeSim::sim.bd.taxa(n = tips, numbsim = 1, lambda = lambda, mu = mu)[[1]]
# t is an object of class phylo
t
#> 
#> Phylogenetic tree with 8 tips and 7 internal nodes.
#> 
#> Tip labels:
#>   t8, t4, t3, t1, t6, t2, ...
#> 
#> Rooted; includes branch length(s).
# use t$edge, t$edge.length, t$root.edge to see the tree attributes

# Simulate fossils
rate = 3 # poisson sampling rate
f = sim.fossils.poisson(rate = rate, tree = t)
f
#>    sp edge       hmin       hmax
#> 1   1    1 0.58684752 0.58684752
#> 2   1    1 0.17795617 0.17795617
#> 3   1    1 0.58613520 0.58613520
#> 4   1    1 0.57453591 0.57453591
#> 5   2    2 0.14585530 0.14585530
#> 6   2    2 0.70851792 0.70851792
#> 7   2    2 0.72224865 0.72224865
#> 8   6    6 0.82173813 0.82173813
#> 9   6    6 0.35835122 0.35835122
#> 10  7    7 0.04035432 0.04035432
#> ...
#> Fossil record with 19 occurrences representing 9 species
#> Fossil record not simulated using taxonomy: all speciation events are assumed to be symmetric

Each entry in the fossils dataframe corresponds to a fossil sampling time. sp and edge are the species and edge labels, respectively, which will be the same here since we assume symmetric speciation.

hmin and hmax are the youngest and oldest sampling ages associated with each fossil, respectively. Since this function returns exact fossil sampling times hmin and hmax are equal.

# plot fossil occurrences
plot(f, tree = t)

# plot stratigraphic ranges
plot(f, tree = t, show.ranges = TRUE)

Simulating fossils for a taxonomy object

# simulate taxonomy under mixed speciation
beta = 0.5 # probability of symmetric speciation
lambda.a = 1.2 # rate of anagenetic speciation
s = sim.taxonomy(tree = t, beta = beta, lambda.a = lambda.a)

# simulate fossils
f = sim.fossils.poisson(rate = rate, taxonomy = s)
f
#>    sp edge      hmin      hmax
#> 1   1    1 1.1276605 1.1276605
#> 2   1    1 0.7431218 0.7431218
#> 3   1    1 0.1493434 0.1493434
#> 4   4   15 0.8912792 0.8912792
#> 5   4   15 0.4560055 0.4560055
#> 6   7    7 0.4670495 0.4670495
#> 7   7    7 0.4936152 0.4936152
#> 8   7    7 0.6791209 0.6791209
#> 9   7    7 0.3928259 0.3928259
#> 10 16    2 0.3277610 0.3277610
#> ...
#> Fossil record with 31 occurrences representing 12 species
#> Fossils record simulated from or assigned to an existing taxonomy

# plot the output and color fossils by taxonomy
plot(f, tree = t, taxonomy = s, show.taxonomy = TRUE)

Note that sp and edge are not all equal, since some species are assigned to multiple edges, and some edges are associated with multiple species.

Interval-dependent fossil recovery

Under the interval-dependent model, fossil recovery changes in a piece-wise manner across a set of user defined intervals. Interval ages can be specified using a fixed set of uniform length intervals, by passing the function the maximum interval age (max.age) and the number of intervals (strata), or a vector of non-uniform intervals (interval.ages).

The function tree.max can be used to assign a maximum age based on tree age. If the tree has a root.edge the function will return the origin time, otherwise the function returns the root age.

Simulating fossils for a fixed number of intervals

# define max interval age based on tree age
max.age = tree.max(t)

# define the number of intervals
strata = 4

# define a vector of sampling rates, where the first entry corresponds the youngest interval
rates = c(1, 0.1, 1, 0.1)

# simulate fossils
f = sim.fossils.intervals(tree = t, rates = rates, max.age = max.age, strata = strata)
 
# plot the output
plot(f, tree = t, show.strata = TRUE, strata = strata)

If no maximum is provided to the plotting function and strata > 1, the function tree.max is used to define max.age.

Simulating fossils for a set of non-uniform intervals

# the following will sample a random set of interval ages between 0 and max.age
times = c(0, sort(runif(3, min = 0, max = max.age)), max.age)

# define a vector of sampling rates from youngest to oldest
rates = c(1, 0.1, 1, 0.1)

# simulate fossils
f = sim.fossils.intervals(tree = t, rates = rates, interval.ages = times)
 
# plot the output and show sampling or proxy data
plot(f, tree = t, show.strata = TRUE, interval.ages = times, show.proxy = TRUE, proxy = rates)

Simulating fossils using per-interval probabilities

Fossil recovery can also be specified using a set of per-interval probabilities. At most one fossil per species will be sampled per interval using this approach.

# define max interval age based on tree age
max.age = tree.max(t)

# define the number of intervals
strata = 4

# define a vector of sampling probabilities from youngest to oldest
probabilities = c(1.0, 0.5, 0.2, 0.1)

# simulate fossils
f = sim.fossils.intervals(tree = t, probabilities = probabilities, max.age = max.age, strata = strata)
 
# plot the output
plot(f, tree = t, show.strata = TRUE, strata = strata)

Incorporating stratigraphic age uncertainty

To incorporate age uncertainty into the output using the function sim.fossils.intervals set use.exact.times = FALSE. In this case hmin and hmax will represent the youngest and oldest age associated with the fossil, respectively. When we plot the output the function will place the fossils at the mid-point of the corresponding interval. This means in some cases fossils will not appear along their corresponding edge.

# simulate fossils
f = sim.fossils.intervals(tree = t, probabilities = probabilities, max.age = max.age, strata = strata, use.exact.times = FALSE)
f
#>   sp edge      hmin      hmax
#> 1  1    1 0.0000000 0.5640402
#> 2  1    1 0.5640402 1.1280804
#> 3  2    2 0.0000000 0.5640402
#> 4  3    3 0.0000000 0.5640402
#> 5  6    6 0.0000000 0.5640402
#> 6  7    7 0.0000000 0.5640402
#> 7  8    8 0.0000000 0.5640402
#> 8 13   13 0.5640402 1.1280804
#> Fossil record with 8 occurrences representing 7 species
#> Fossil record not simulated using taxonomy: all speciation events are assumed to be symmetric

# plot the output
plot(f, tree = t, show.strata = TRUE, strata = strata)

Alternatively, if you simulate fossils using a function that only returns exact fossil sampling times, e.g. using the sim.fossils.poisson function, you can later assign fossils to different intervals using the sim.interval.ages function. You can also generate output using this function that respect the species duration times by setting use.species.ages = TRUE, i.e. the function will not return a value for hmin or hmax that are younger or older than the true species age. To use this option you also need to pass a phylo or taxonomy object to the function.

# simulate fossils
f = sim.fossils.poisson(tree = t, rate = 3)

# reassign fossils to intervals
f = sim.interval.ages(fossils = f, tree = t, max.age = max.age, strata = strata, use.species.ages = TRUE)

# plot the output
plot(f, tree = t, show.strata = TRUE, strata = strata)

Environment-dependent fossil recovery

The function sim.fossils.environment can be used to simulate fossil recovery that reflects species environmental preferences. The variables that control the distribution and abundance of species also control depositional environment, meaning changes in ecological or environmental gradient across a given area will be reflected in the patterns of deposition. A species response curve, which describes the abundance of a species relative to an environmental gradient, can therefore be used to model distributions of fossil occurrences. See the plot below for examples.

Holland (1995) described a three parameter Gaussian model for simulating environment-dependent fossil recovery. In this model each species has three parameters, species preferred depth (PD), depth tolerance (DT) and peak abundance (PA) (function arguments PD, DT and PA). The probability of fossil recovery during a given interval of time is a function of water depth, given by,

Prcollectioni(d) = PA * exp(−(d − PD)2/(2 * DT2)),

where d is current water depth, PD and DT are the mean and the standard deviation of the distribution, respectively, and PA describes the amplitude of the distribution. Here the model is described in the context of marine taxa, however the model and the function sim.fossils.environment can be applied to any environmental or depositional context, in which some measure of environmental gradient can be generated.

The following code snippet illustrate how to explore the role of the model parameters PD, DT and PA in determining species response curves.

# define a set of colors for 4 species
# color palette chosen using RColorBrewer
cols = c("#E41A1C", "#377EB8", "#984EA3", "#4DAF4A") # red, blue, purple, green

# species response curve function
response.curve = function(x, pd, dt, pa) pa * exp(-(x - pd)^2/ (2*dt^2))

# define species variables # species 1
PA = 1 # peak abundance
PD = 2 # preferred depth
DT = 1 # depth tolerance

# plot species resonse curve
curve(response.curve(x, PD, DT, PA), -2, 6, bty = "n", xlab = "relative depth", ylab = "Pr(collection)", xlim = c(-6,6), lwd = 1.5, col = cols[1])

# redefine depth tolerance and compare the output # species 2
DT = 0.5
curve(response.curve(x, PD, DT, PA), -1, 5, add = TRUE, lwd = 1.5, col = cols[2])

# redefine preferred depth and compare the output # species 3
PD = -2
curve(response.curve(x, PD, DT, PA), -5, 1, add = TRUE, lwd = 1.5, col = cols[3])

# redefine prferred depth and peak abundance # species 4
PD = -4
PA = 0.5
curve(response.curve(x, PD, DT, PA), -6, -2.2, add = TRUE, lwd = 1.5, col = cols[4])

The user is free to use a vector containing any gradient values to simulate fossil occurrences. In the following example we will use a vector of gradient values generated using the function sim.gradient, which return values for the following simple sine wave function, y = d * sin(cycles * pi * (x − 1/4)), where x is the number of intervals or strata. The function output can be used to represent changes in relative water depth, where the number of cycles may be equivalent to the number of transgression/regression events.

Interval ages can be specified in the same way as the function sim.fossils.intervals via max.age and strata or the intervals vector.

# define the interval parameters
strata = 20
times = seq(0, tree.max(t), length.out = strata+1)

# simulate gradient values
wd = sim.gradient(strata, depth = 6)

# wd is a vector representing relative water depth
wd
#>  [1] -6.0000000 -4.7348431 -1.4729129  2.4101725  5.2768425  5.9181678
#>  [7]  4.0636894  0.4954761 -3.2816889 -5.6749035 -5.6749035 -3.2816889
#> [13]  0.4954761  4.0636894  5.9181678  5.2768425  2.4101725 -1.4729129
#> [19] -4.7348431 -6.0000000

# define species trait values # species 1
PD = 1
DT = 2
PA = 1

# simulate fossils
f = sim.fossils.environment(tree = t, interval.ages = times, proxy.data = wd, PD = PD, DT = DT, PA = PA)

# plot output and include proxy data & preferred depth
plot(f, tree = t, interval.ages = times, show.strata = TRUE, show.proxy = TRUE, proxy.data = wd, show.preferred.environ = TRUE, preferred.environ = PD, fossil.col = cols[1])

# redefine species trait values # species 2
DT = 0.5

# simulate fossils
f = sim.fossils.environment(tree = t, interval.ages = times, proxy.data = wd, PD = PD, DT = DT, PA = PA)

# plot output
plot(f, tree = t, interval.ages = times, show.strata = TRUE, show.proxy = TRUE, proxy.data = wd, show.preferred.environ = TRUE, preferred.environ = PD, fossil.col = cols[2])

# redefine species trait values # species 3
PD = -2

# simulate fossils
f = sim.fossils.environment(tree = t, interval.ages = times, proxy.data = wd, PD = PD, DT = DT, PA = PA)

# plot output
plot(f, tree = t, interval.ages = times, show.strata = TRUE, show.proxy = TRUE, proxy.data = wd, show.preferred.environ = TRUE, preferred.environ = PD, fossil.col = cols[3])

# define species trait values # species 4
PD = -4
PA = 0.5

# simulate fossils
f = sim.fossils.environment(tree = t, interval.ages = times, proxy.data = wd, PD = PD, DT = DT, PA = PA)

# plot output
plot(f, tree = t, interval.ages = times, show.strata = TRUE, show.proxy = TRUE, proxy.data = wd, show.preferred.environ = TRUE, preferred.environ = PD, fossil.col = cols[4])

Lineage-dependent fossil recovery

Variation in fossil recovery parameters across different lineages or species can be generated using the function sim.trait.values. The function output is a vector of simulated trait values that can be used to specify the rate parameter in sim.fossils.poisson or the PD, DT and PA parameters in sim.fossils.environment.

Trait values can be simulated for a given phylo or taxonomy object. If the function is provided with a phylo object, trait values are simulated assuming bifurcating speciation and simulated trait values are output for each edge in the order in which they appear in the phylo object dataframe. If the tree also has a root edge (tree$root.edge), the first entry in the vector will correspond to the first entry in the vector. If the function is provided with a taxonomy object, simulated trait values are output for each species in the order in which they appear in the taxonomy object dataframe.

The autocorrelated fossil recovery model

Under the autocorrelated model, traits evolve along lineages according to a Brownian motion process, where the strength of the relationship between ancestor and descendant values is determined by the parameter ν (function argument v). New trait values are drawn from a lognormal distribution, where the mean is equal to the value of the ancestor and the variance is a function of ν and the species or edge duration. If ν is small values will be more similar between ancestor and descendants, and if ν is zero all trait values will be equal. This is model is described in Heath et al. (2014) and is equivalent to the autocorrelated relaxed clock model described in Kishino et al. (2001).

# define the initial rate at the root or origin
rate = 1 

# simulate rates under the autocorrelated trait values model (the default option)
rates = sim.trait.values(init = rate, tree = t, v = 0.01)
# simulated rates
rates
#>  [1] 1.0599176 1.0883147 1.2007113 1.1898586 1.1380816 1.2943105 1.1031386
#>  [8] 1.0791394 1.0736070 1.1441912 1.2172428 1.3427452 1.0542301 1.0758923
#> [15] 0.9792786
#> attr(,"from.taxonomy")
#> [1] FALSE

# simulate fossils
f = sim.fossils.poisson(tree = t, rate = rates)

# plot the output
plot(f, tree = t)

The independent fossil recovery model

Under the independent model a new trait value is drawn for each species from any valid user-specified distribution dist. To be valid the function just has to return a single value.

# define the initial rate at the root or origin
rate = 1

# define the distribution used to sample new rates
# in this case an exponential with mean = 1
dist = function() { rexp(1) }

# simulate trait values under the independent trait values model
rates = sim.trait.values(init = rate, tree = t, model = "independent", dist = dist)

# simulate fossils
f = sim.fossils.poisson(tree = t, rate = rates)

# plot the output
plot(f, tree = t)

The parameter change.pr is the probability that changes in trait values are coincident with speciation events. At each speciation event change occurs with probability change.pr and new values are drawn from any valid user-specified distribution dist. If change.pr = 1 a change will occur at every speciation event and the model will be the same as above.

# define the initial value at the root or origin
rate = 1

# define the distribution used to sample new rates
# in this case an exponential with a mean ~ 3
dist = function() { rexp(1, 1/4) }

# define the probability of the trait value changing at each speciation event
change.pr = 0.5

# simulate trait values under the independent trait values model
rates = sim.trait.values(init = rate, tree = t, model = "independent", dist = dist, change.pr = change.pr)

# simulate fossils
f = sim.fossils.poisson(tree = t, rate = rates)

# plot the output
plot(f, tree = t)

Environment and lineage-dependent fossil recovery

Lineage specific values generated using sim.trait.values can also be passed to the the function sim.fossils.environment to define the PD, DT, and PA parameters.

# define constant values for preferred depth and depth tolerance
PD = 2
DT = 0.5

# simulate lineage variable peak abundance values

# define the distribution used to sample new PA values
# in this case a uniform in the interval 0, 1
dist = function() { runif(1) }

# simulate trait values under the independent model
PA = sim.trait.values(init = 0.1, tree = t, model = "independent", dist = dist)

# simulate fossils
f = sim.fossils.environment(tree = t, interval.ages = times, proxy.data = wd, PD = PD, DT = DT, PA = PA)

# plot the output
plot(f, tree = t, show.strata = TRUE, interval.ages = times)

Extant species and tip sampling

The functions sim.extant.samples and sim.tip.samples can be used to simulate incomplete extant species and tip sampling, respectively. The parameter rho is the probability that an extant species or a tip will be sampled.

# simulate fossils
f = sim.fossils.poisson(tree = t, rate = rates)

# simulate extant species sampling
f2 = sim.extant.samples(fossils = f, tree = t, rho = 0.5)

# plot the output
plot(f2, tree = t, extant.col = "red")


# for tip sampling only, create a fossil object with no fossils
f = fossils()

# simulate tip sampling
f2 = sim.tip.samples(fossils = f, tree = t, rho = 0.75)

# plot the output
plot(f2, tree = t)  

See also

See the paleotree vignette to see how FossilSim objects can be converted into paleotree objects.

References

Heath, T.A., Huelsenbeck, J.P. & Stadler, T. (2014). The fossilized birth-death process for coherent calibration of divergence-time estimates. Proceedings of the National Academy of Sciences, 111, E2957–E2966.
Holland, S.M. (1995). The stratigraphic distribution of fossils. Paleobiology, 21, 92–109.
Kishino, H., Thorne, J.L. & Bruno, W.J. (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution, 18, 352–361.