--- title: "Loading phylogenetic data into R" author: "Martin R. Smith " date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: ../inst/REFERENCES.bib csl: ../inst/apa-old-doi-prefix.csl vignette: > %\VignetteIndexEntry{Loading phylogenetic data into R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- It can be a bit fiddly to get a phylogenetic dataset into R, particularly if you are not used to working with files in the Nexus format. First off, make sure that you are comfortable [telling R where to find a file](filesystem-navigation.html). Then you are ready to load raw data: ## From an Excel spreadsheet If your data is in an Excel spreadsheet, one way to load it into R is using the '[readxl](https://readxl.tidyverse.org/)' package. First you'll have to install it: ```r install.packages("readxl") # You only need to do this once ``` Then you should prepare your Excel spreadsheet such that each row corresponds to a taxon, and each column to a character. Then you can read the data from the Excel file by telling R which sheet, rows and columns contain your data: ```r library("readxl") raw_data <- as.matrix(read_excel( filename, sheet = 1, # Loads sheet number 1 from the excel file range = "B1:AA21", # Extracts columns B to AA, rows 1 to 21 # Note that the first row is interpreted as column (character) names col_types = "text" # Read all columns as character strings )) # Read row (taxon) names from column A # Again, the first cell will be interpreted as a column name taxon_names <- unlist(read_excel(filename, sheet = 1, range = "A1:A21")) rownames(raw_data) <- taxon_names ``` ## From a Nexus file TreeTools contains an inbuilt Nexus parser: ```r raw_data <- ReadCharacters(filename) # Or, to go straight to PhyDat format: as_phydat <- ReadAsPhyDat(filename) ``` This will extract character names and codings from a dataset. It's been written to work with datasets downloaded from [MorphoBank](https://morphobank.org/), but my aim is for this function to handle most valid (and many invalid) NEXUS files. If you find a file that this function can't handle, please [let me know](https://github.com/ms609/TreeTools/issues/new?title=Unsupported+NEXUS+file&body=Please%20attach%20the%20problematic%20NEXUS%20file%20and%20describe%20the%20issue&labels=enhancement) and I'll try to fix it. In the meantime, alternative Nexus parsers are available: try ```R raw_data <- ape::read.nexus.data(filename) ``` Non-standard elements of a Nexus file might be beyond the capabilities of ape's parser. In particular, you will need to replace spaces in taxon names with an underscore, and to arrange all data into a single block starting `BEGIN DATA`. You'll need to strip out comments, character definitions and separate taxon blocks. The function `readNexus` in package `phylobase` uses the NCL library and promises to be more powerful, but I've not been able to get it to work. ## From a TNT file A TNT format dataset downloaded from [MorphoBank](https://morphobank.org) can be parsed with `ReadTntCharacters`, which might also handle other TNT-compatible files. If there's a file that's not being read correctly, please [let me know](https://github.com/ms609/TreeTools/issues/new?title=Unsupported+TNT+file&body=Please%20attach%20the%20problematic%20TNT%20file%20and%20describe%20the%20issue&labels=enhancement) and I'll try to fix it. ```r raw_data <- ReadTntCharacters(filename) # Or, to go straight to PhyDat format: my_data <- ReadTntAsPhyDat(filename) ``` # Processing raw data The next stage is to get the raw data into a format that most R packages can understand. If you've used the `ReadAsPhyDat` or `ReadTntAsPhyDat` functions, then you can skip this step -- you're already there. Otherwise, you can try ```r my_data <- PhyDat(raw_data) ``` or if that doesn't work, ```r my_data <- MatrixToPhyDat(raw_data) ``` These functions are pretty robust, but might return an error when they encounter an unexpected dataset format -- if they don't work on your dataset, please [let me know](https://github.com/ms609/TreeTools/issues/new?title=data+to+PhyDat+conversion+issue&body=Please%20describe+the+problem+here:&labels=bug). Failing that, you can enlist the help of the 'phangorn' package: ```r install.packages('phangorn') library('phangorn') my_data <- phyDat(raw_data, type = "USER", levels = c(0:9, "-")) ``` `type="USER"` tells the parser to expect morphological data. The `levels` parameter simply lists all the states that any character might take. `0:9` includes all the integer digits from 0 to 9. If you have inapplicable data in your matrix, you should list `-` as a separate level as it represents an additional state (as handled by the Morphy implementation of [@Brazeau2018]). If you have more complicated ambiguities, you may need to use a contrast matrix to decode your matrix. A contrast matrix translates the tokens used in your dataset to the character states to which they correspond: for example decoding 'A' to {01}. For more details, see the 'phangorn-specials' vignette in the phangorn package, accessible by typing '?phangorn' in the R prompt and navigating to index > package vignettes. ```{r Contrast matrix} contrast.matrix <- matrix(data = c( # 0 1 - # Each column corresponds to a character-state 1, 0, 0, # Each row corresponds to a token, here 0, denoting the # character-state set {0} 0, 1, 0, # 1 | {1} 0, 0, 1, # - | {-} 1, 1, 0, # A | {01} 1, 1, 0, # + | {01} 1, 1, 1 # ? | {01-} ), ncol = 3, # ncol corresponds to the number of columns in the matrix byrow = TRUE) dimnames(contrast.matrix) <- list( c(0, 1, "-", "A", "+", "?"), # A list of the tokens corresponding to each row # in the contrast matrix c(0, 1, "-") # A list of the character-states corresponding to the columns # in the contrast matrix ) contrast.matrix ``` If you need to use a contrast matrix, convert the data using ```r my.phyDat <- phyDat(my.data, type = "USER", contrast = contrast.matrix) ``` # What next? You might want to: - [Load a phylogenetic tree](load-trees.html) into R. - Conduct parsimony search using Brazeau, Guillerme & Smith's [-@Brazeau2018] [approach to inapplicable data]( https://ms609.github.io/TreeSearch/articles/tree-search.html), or using [Profile parsimony]( https://ms609.github.io/TreeSearch/articles/profile.html) [@Faith2001]. # References