---
title: "An introduction to the phyloregion package"
author: "Barnabas H. Daru, Piyal Karunarathne & Klaus Schliep"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output: rmarkdown::html_vignette
bibliography: phyloregion.bib
vignette: >
   %\VignetteIndexEntry{An introduction to the phyloregion package}
   %\VignetteEngine{knitr::rmarkdown}
   %\usepackage[utf8]{inputenc}   
---
---
nocite: | 
   @Daru2020, @Daru2017
---

## 1. Installation
`phyloregion` is available from the [Comprehensive R Archive Network](https://CRAN.R-project.org/package=phyloregion), so you can use the following line of code to install and run it:

```
install.packages("phyloregion")
```

Alternatively, you can install the development version of `phyloregion` hosted 
on GitHub. To do this, you will need to install the `devtools` package. In R, type:

```
if (!requireNamespace("remotes", quietly = TRUE)) 
    install.packages("remotes") 
remotes::install_github("darunabas/phyloregion")
```
When installed, load the package in R:
```{r}
library(phyloregion)
```

## 2. Overview and general workflow of `phyloregion`
The workflow of the `phyloregion` package demonstrates steps from preparation 
of different types of data to visualizing the results of biogeographical regionalization, together with tips on selecting the optimal method for 
achieving the best output, depending on the types of data used and research questions.

![__Figure 1.__ Simplified workflow for analysis of biogeographical regionalization using phyloregion. Distribution data is converted to a sparse community matrix. When paired with phylogenetic data, phylobuilder creates a subtree with largest overlap from a species list, thereby ensuring complete representation of missing data; phylocommunity matrix to visualization of results.](workflow.png)

## 3. Input data

### Phylogenies
In R, phylogenetic relationships among species / taxa are often represented as a phylo 
object implemented in the `ape` package[@Paradis2018]. Phylogenies (often in the Newick 
or Nexus formats) can be imported into R with the `read.tree` or `read.nexus` 
functions of the `ape` package[@Paradis2018].
```{r, fig.align="left", fig.cap="__Figure 2.__ Phylogenetic tree of the woody plants of southern Africa inferred from DNA barcodes using a maximum likelihood approach and transforming branch lengths to millions of years ago by enforcing a relaxed molecular clock and multiple calibrations [@Daru2015ddi]."}
library(ape)
library(Matrix)
library(terra)
data(africa)
sparse_comm <- africa$comm

tree <- africa$phylo
tree <- keep.tip(tree, intersect(tree$tip.label, colnames(sparse_comm)))
par(mar=c(2,2,2,2))
plot(tree, show.tip.label=FALSE)
```
### Distribution data input

The `phyloregion` package has functions for manipulating three kinds of distribution 
data: point records, vector polygons and raster layers. An overview can be easily obtained with the functions `points2comm`, `polys2comm` and `rast2comm` for point records, polygons, or raster layers, respectively. Depending on the data source, all three functions ultimately provide convenient interfaces to convert the distribution data to a community matrix at varying spatial grains and extents for downstream analyses.

We will play around with these functions in turn.

### **_Function `points2comm`_**

Here, we will generate random points in geographic space, similar to occurrence data obtained from museum records, GBIF, iDigBio, or CIESIN which typically have columns of geographic coordinates for each observation.

```{r}
s <- vect(system.file("ex/nigeria.json", package="phyloregion"))

set.seed(1)
m <- as.data.frame(spatSample(s, 1000, method = "random"),
                   geom = "XY")[-1]
names(m) <- c("lon", "lat")
species <- paste0("sp", sample(1:100))
m$taxon <- sample(species, size = nrow(m), replace = TRUE)

pt <- points2comm(dat = m, res = 0.5, lon = "lon", lat = "lat",
            species = "taxon") # This generates a list of two objects
head(pt[[1]][1:5, 1:5])
```

### **_Function `polys2comm`_**

This function converts polygons to a community matrix at varying spatial grains and extents for downstream analyses. Polygons can be derived from the IUCN Redlist spatial database (https: //www.iucnredlist.org/resources/spatial-data-download), published monographs or field guides validated by taxonomic experts. To illustrate this function, we will use the function `random_species` to generate random polygons for five random species over the landscape of Nigeria as follows:

```{r}
s <- vect(system.file("ex/nigeria.json", package="phyloregion"))
sp <- random_species(100, species=5, pol=s)
pol <- polys2comm(dat = sp)
head(pol[[1]][1:5, 1:5])
```

### **_Function `rast2comm`_**

This third function, converts raster layers (often derived from species distribution modeling, such as aquamaps[@kaschner2008aquamaps]) to a community matrix. 

```{r}
fdir <- system.file("NGAplants", package="phyloregion")
files <- file.path(fdir, dir(fdir))
ras <- rast2comm(files) 
head(ras[[1]][1:5, 1:5])
```

The object `ras` above also returns two objects: a community data frame and a vector of grid cells with the numbers of species per cell and can be plotted as 
a heatmap using `plot` function as follows:

```{r, fig.align="left", fig.cap="__Figure 3.__ Species richness of plants in Nigeria across equal area grid cells. This is to demonstrate how the function `plot` works.", out.width = '50%'}
s <- vect(system.file("ex/SR_Naija.json", package="phyloregion"))
par(mar=rep(0,4))
plot(s, "SR", border=NA, type = "continuous", 
     col = hcl.colors(20, palette = "Blue-Red 3", rev=FALSE))
```

### Community data

Community data are commonly stored in a matrix with the sites as 
rows and species / operational taxonomic units (OTUs) as columns. The 
elements of the matrix are numeric values indicating the abundance/observations 
or presence/absence (0/1) of OTUs in different sites. In practice, such a matrix can contain many zero values because species are known to generally have unimodal distributions along environmental gradients [@TerBraak2004], and storing and analyzing every single element of that matrix can be computationally challenging and expensive. 

`phyloregion` differs from other R packages (e.g. vegan [@vegan], picante [@Kembel2010]
or betapart[@Baselga2012]) in that the data are not stored in a (dense) `matrix` or `data.frame` but as a sparse matrix making use of the infrastructure provided by the Matrix package [@Matrix]. A sparse matrix is a matrix with a high proportion of zero entries[@Duff1977], of which only the non-zero entries are stored and used for downstream analysis.

A sparse matrix representation has two advantages. First the community matrix 
can be stored in a much memory efficient manner, allowing analysis of larger 
datasets. Second, for very large datasets spanning thousands of taxa and spatial scales, 
computations with a sparse matrix are often much faster.  
The `phyloregion` package contains functions to conveniently change between data 
formats. 
```{r, eval=TRUE}
library(Matrix) 
data(africa)
sparse_comm <- africa$comm
dense_comm <- as.matrix(sparse_comm) 
object.size(dense_comm)
object.size(sparse_comm)
```
Here, the data set in the dense matrix representation consumes roughly five times 
more memory than the sparse representation. 

## 4. Analysis

### Alpha diversity
We demonstrate the utility of `phyloregion` in mapping standard conservation metrics of species richness, weighted endemism (`weighted_endemism`) and threat (`map_traits`) as well as fast computations of phylodiversity measures such as phylogenetic diversity (`PD`), phylogenetic endemism (`phylo_endemism`), and evolutionary distinctiveness and global endangerment (`EDGE`). The major advantage of these functions compared to available tools is the ability to utilize sparse matrix that speeds up the analyses without exhausting computer memories, making it ideal for handling any data from small local scales to large regional and global scales.

#### Function `weighted_endemism`

Weighted endemism is species richness inversely weighted by species ranges[@crisp2001endemism],[@laffan2003assessing],[@daru2020endemism].

```{r, fig.align="left", fig.cap="__Figure 4.__ Geographic distributions of weighted endemism for woody plants of southern Africa.", out.width = '50%'}
library(terra)
data(africa)
p <- vect(system.file("ex/sa.json", package = "phyloregion"))
Endm <- weighted_endemism(africa$comm)
m <- merge(p, data.frame(grids=names(Endm), WE=Endm), by="grids")
m <- m[!is.na(m$WE),]

par(mar=rep(0,4))
plot(m, "WE", col = hcl.colors(20, "Blue-Red 3"), 
     type="continuous", border = NA)
```

#### Function `PD` – phylogenetic diversity

Phylogenetic diversity (`PD`) represents the length of evolutionary pathways that connects a given set of taxa on a rooted phylogenetic tree [@Faith1992]. This metric is often characterised in units of time (millions of years, for dated phylogenies). We will map PD for plants of southern Africa. 
```{r, fig.align="left", fig.cap="__Figure 5.__ Geographic distributions of phylogenetic diversity for woody plants of southern Africa.", out.width = '50%'}
data(africa)
comm <- africa$comm
tree <- africa$phylo
poly <- vect(system.file("ex/sa.json", package = "phyloregion"))

mypd <- PD(comm, tree)
head(mypd)

M <- merge(poly, data.frame(grids=names(mypd), pd=mypd), by="grids")
M <- M[!is.na(M$pd),]
head(M)

par(mar=rep(0,4))
plot(M, "pd", border=NA, type="continuous",
            col = hcl.colors(20, "Blue-Red 3"))
```

#### Function `phylo_endemism` – phylogenetic endemism

Phylogenetic endemism is not influenced by variations in taxonomic opinion because it measures endemism based on the relatedness of species before weighting it by their range sizes[@Rosauer2009],[@daru2020endemism].

```{r, fig.align="left", fig.cap="__Figure 6.__ Geographic distributions of phylogenetic endemism for woody plants of southern Africa.", out.width = '50%'}
library(terra)
data(africa)
comm <- africa$comm
tree <- africa$phylo
poly <- vect(system.file("ex/sa.json", package = "phyloregion"))

pe <- phylo_endemism(comm, tree)
head(pe)

mx <- merge(poly, data.frame(grids=names(pe), pe=pe), by="grids")
mx <- mx[!is.na(mx$pe),]
head(mx)

par(mar=rep(0,4))
plot(mx, "pe", border=NA, type="continuous",
            col = hcl.colors(n=20, palette = "Blue-Red 3", rev=FALSE))
```

#### Function `EDGE` – Evolutionary Distinctiveness and Global Endangerment

This function calculates EDGE by combining evolutionary distinctiveness (ED; i.e., phylogenetic isolation of a species) with global endangerment (GE) status as defined by the International Union for Conservation of Nature (IUCN).

```{r, fig.align="left", fig.cap="__Figure 7.__ Geographic distributions of evolutionary distinctiveness and global endangerment for woody plants of southern Africa.", out.width = '50%'}
data(africa)
comm <- africa$comm
threat <- africa$IUCN
tree <- africa$phylo
poly <- vect(system.file("ex/sa.json", package = "phyloregion"))

x <- EDGE(threat, tree, Redlist = "IUCN", species="Species")
head(x)

y <- map_trait(comm, x, FUN = sd, pol=poly)

par(mar=rep(0,4))
plot(y, "traits", border=NA, type="continuous",
            col = hcl.colors(n=20, palette = "Blue-Red 3", rev=FALSE))
```

### Analysis of beta diversity (phylogenetic and non-phylogenetic)

The three commonly used methods for quantifying -diversity, the variation in species composition among sites, – Simpson, Sorenson and Jaccard[@laffan2016range]. The `phyloregion`’s functions `beta_diss` and `phylobeta` compute efficiently pairwise dissimilarities matrices for large sparse community matrices and phylogenetic trees for taxonomic and phylogenetic turnover, respectively. The results are stored as distance objects for subsequent analyses.

***

### Phylogenetic beta diversity

`phyloregion` offers a fast means of computing phylogenetic beta diversity, the turnover 
of branch lengths among sites, making use of and improving on the infrastructure provided by the `betapart` package[@Baselga2012] allowing a sparse community matrix as input.  
```{r}
data(africa)
p <- vect(system.file("ex/sa.json", package = "phyloregion"))
sparse_comm <- africa$comm

tree <- africa$phylo
tree <- keep.tip(tree, intersect(tree$tip.label, colnames(sparse_comm)))
pb <- phylobeta(sparse_comm, tree)
```

```{r, message=FALSE, results='hide', warning=FALSE}
y <- phyloregion(pb[[1]], pol=p)
```

```{r}
plot_NMDS(y, cex=3)
text_NMDS(y)

par(mar=rep(0,4))
plot(y, palette="NMDS")
```

## Session Information
```{r, eval=TRUE}
sessionInfo()
```

## REFERENCES