Title: | Discovery, Access and Manipulation of 'TreeBASE' Phylogenies |
---|---|
Description: | Interface to the API for 'TreeBASE' <http://treebase.org> from 'R.' 'TreeBASE' is a repository of user-submitted phylogenetic trees (of species, population, or genes) and the data used to create them. |
Authors: | Carl Boettiger [aut, cre], Duncan Temple Lang [aut] |
Maintainer: | Carl Boettiger <[email protected]> |
License: | CC0 |
Version: | 0.1.5 |
Built: | 2024-11-12 05:57:10 UTC |
Source: | https://github.com/ropensci/treebase |
A function to cache the phylogenies in treebase locally
cache_treebase( file = paste("treebase-", Sys.Date(), ".rda", sep = ""), pause1 = 3, pause2 = 3, attempts = 10, max_trees = Inf, only_metadata = FALSE, save = TRUE )
cache_treebase( file = paste("treebase-", Sys.Date(), ".rda", sep = ""), pause1 = 3, pause2 = 3, attempts = 10, max_trees = Inf, only_metadata = FALSE, save = TRUE )
file |
filename for the cache, otherwise created with datestamp |
pause1 |
number of seconds to hesitate between requests |
pause2 |
number of seconds to hesitate between individual files |
attempts |
number of attempts to access a particular resource |
max_trees |
maximum number of trees to return (default is Inf) |
only_metadata |
option to only return metadata about matching trees |
save |
logical indicating whether to save a file with the resuls. |
it's a good idea to let this run overnight
saves a cached file of treebase
## Not run: treebase <- cache_treebase() ## End(Not run)
## Not run: treebase <- cache_treebase() ## End(Not run)
Download the metadata on treebase using the OAI-MPH interface
download_metadata( query = "", by = c("all", "until", "from"), curl = getCurlHandle() )
download_metadata( query = "", by = c("all", "until", "from"), curl = getCurlHandle() )
query |
a date in format yyyy-mm-dd |
by |
return all data "until" that date, "from" that date to current, or "all" |
curl |
if calling in series many times, call getCurlHandle() first and then pass the return value in here. Avoids repeated handshakes with server. |
query must be#' download_metadata(2010-01-01, by="until") all isn't a real query type, but will return all trees regardless of date
## Not run: Near <- search_treebase("Near", "author", max_trees=1) metadata(Near[[1]]$S.id) ## or manualy give a sudy id metadata("2377") ### get all trees from a certain depostition date forwards ## m <- download_metadata("2009-01-01", by="until") ## extract any metadata, e.g. publication date: dates <- sapply(m, function(x) as.numeric(x$date)) hist(dates, main="TreeBase growth", xlab="Year") ### show authors with most tree submissions in that date range authors <- sapply(m, function(x){ index <- grep( "creator", names(x)) x[index] }) a <- as.factor(unlist(authors)) head(summary(a)) ## Show growth of TreeBASE all <- download_metadata("", by="all") dates <- sapply(all, function(x) as.numeric(x$date)) hist(dates, main="TreeBase growth", xlab="Year") ## make a barplot submission volume by journals journals <- sapply(all, function(x) x$publisher) J <- tail(sort(table(as.factor(unlist(journals)))),5) b<- barplot(as.numeric(J)) text(b, names(J), srt=70, pos=4, xpd=T) ## End(Not run)
## Not run: Near <- search_treebase("Near", "author", max_trees=1) metadata(Near[[1]]$S.id) ## or manualy give a sudy id metadata("2377") ### get all trees from a certain depostition date forwards ## m <- download_metadata("2009-01-01", by="until") ## extract any metadata, e.g. publication date: dates <- sapply(m, function(x) as.numeric(x$date)) hist(dates, main="TreeBase growth", xlab="Year") ### show authors with most tree submissions in that date range authors <- sapply(m, function(x){ index <- grep( "creator", names(x)) x[index] }) a <- as.factor(unlist(authors)) head(summary(a)) ## Show growth of TreeBASE all <- download_metadata("", by="all") dates <- sapply(all, function(x) as.numeric(x$date)) hist(dates, main="TreeBase growth", xlab="Year") ## make a barplot submission volume by journals journals <- sapply(all, function(x) x$publisher) J <- tail(sort(table(as.factor(unlist(journals)))),5) b<- barplot(as.numeric(J)) text(b, names(J), srt=70, pos=4, xpd=T) ## End(Not run)
drop errors from the search
drop_nontrees(tr)
drop_nontrees(tr)
tr |
a list of phylogenetic trees returned by search_treebase |
primarily for the internal use of search_treebase, but may be useful
the list of phylogenetic trees returned successfully
Search the dryad metadata archive
dryad_metadata(study.id, curl = getCurlHandle())
dryad_metadata(study.id, curl = getCurlHandle())
study.id |
the dryad identifier |
curl |
if calling in series many times, call getCurlHandle() first and then pass the return value in here. Avoids repeated handshakes with server. |
a list object containing the study metadata
## Not run: dryad_metadata("10255/dryad.12") ## End(Not run)
## Not run: dryad_metadata("10255/dryad.12") ## End(Not run)
Simple function to identify which trees have branch lengths
have_branchlength(trees)
have_branchlength(trees)
trees |
a list of phylogenetic trees (ape/phylo format) |
logical string indicating which have branch length data
Contains a cache of all publication metadata the search_metadata() to pull down when run on 2012-05-12.
metadata(phylo.md = NULL, oai.md = NULL)
metadata(phylo.md = NULL, oai.md = NULL)
phylo.md |
cached phyloWS (tree) metadata, (optional) |
oai.md |
cached OAI-PMH (study) metadata (optional) |
recreate with:
search_metadata()
a data frame of all available metadata, (as a data.table object) columns are: "Study.id", "Tree.id", "kind", "type", "quality", "ntaxa" "date", "publisher", "author", "title".
## Not run: meta <- metadata() meta[publisher %in% c("Nature", "Science") & ntaxa > 50 & kind == "Species Tree",] ## End(Not run)
## Not run: meta <- metadata() meta[publisher %in% c("Nature", "Science") & ntaxa > 50 & kind == "Species Tree",] ## End(Not run)
A function to pull in the phyologeny/phylogenies matching a search query
search_treebase( input, by, returns = c("tree", "matrix"), exact_match = FALSE, max_trees = Inf, branch_lengths = FALSE, curl = getCurlHandle(), verbose = TRUE, pause1 = 0, pause2 = 0, attempts = 3, only_metadata = FALSE )
search_treebase( input, by, returns = c("tree", "matrix"), exact_match = FALSE, max_trees = Inf, branch_lengths = FALSE, curl = getCurlHandle(), verbose = TRUE, pause1 = 0, pause2 = 0, attempts = 3, only_metadata = FALSE )
input |
a search query (character string) |
by |
the kind of search; author, taxon, subject, study, etc (see list of possible search terms, details) |
returns |
should the fn return the tree or the character matrix? |
exact_match |
force exact matching for author name, taxon, etc. Otherwise does partial matching |
max_trees |
Upper bound for the number of trees returned, good for keeping possibly large initial queries fast |
branch_lengths |
logical indicating whether should only return trees that have branch lengths. |
curl |
the handle to the curl web utility for repeated calls, see the getCurlHandle() function in RCurl package for details. |
verbose |
logical indicating level of progress reporting |
pause1 |
number of seconds to hesitate between requests |
pause2 |
number of seconds to hesitate between individual files |
attempts |
number of attempts to access a particular resource |
only_metadata |
option to only return metadata about matching trees which lists study.id, tree.id, kind (gene,species,barcode) type (single, consensus) number of taxa, and possible quality score. |
either a list of trees (multiphylo) or a list of character matrices
## Not run: ## defaults to return phylogeny Huelsenbeck <- search_treebase("Huelsenbeck", by="author") ## can ask for character matrices: wingless <- search_treebase("2907", by="id.matrix", returns="matrix") ## Some nexus matrices don't meet read.nexus.data's strict requirements, ## these aren't returned H_matrices <- search_treebase("Huelsenbeck", by="author", returns="matrix") ## Use Booleans in search: and, or, not ## Note that by must identify each entry type if a Boolean is given HR_trees <- search_treebase("Ronquist or Hulesenbeck", by=c("author", "author")) ## We'll often use max_trees in the example so that they run quickly, ## notice the quotes for species. dolphins <- search_treebase('"Delphinus"', by="taxon", max_trees=5) ## can do exact matches humans <- search_treebase('"Homo sapiens"', by="taxon", exact_match=TRUE, max_trees=10) ## all trees with 5 taxa five <- search_treebase(5, by="ntax", max_trees = 10) ## These are different, a tree id isn't a Study id. we report both studies <- search_treebase("2377", by="id.study") tree <- search_treebase("2377", by="id.tree") c("TreeID" = tree$Tr.id, "StudyID" = tree$S.id) ## Only results with branch lengths ## Has to grab all the trees first, then toss out ones without branch_lengths Near <- search_treebase("Near", "author", branch_lengths=TRUE) ## End(Not run)
## Not run: ## defaults to return phylogeny Huelsenbeck <- search_treebase("Huelsenbeck", by="author") ## can ask for character matrices: wingless <- search_treebase("2907", by="id.matrix", returns="matrix") ## Some nexus matrices don't meet read.nexus.data's strict requirements, ## these aren't returned H_matrices <- search_treebase("Huelsenbeck", by="author", returns="matrix") ## Use Booleans in search: and, or, not ## Note that by must identify each entry type if a Boolean is given HR_trees <- search_treebase("Ronquist or Hulesenbeck", by=c("author", "author")) ## We'll often use max_trees in the example so that they run quickly, ## notice the quotes for species. dolphins <- search_treebase('"Delphinus"', by="taxon", max_trees=5) ## can do exact matches humans <- search_treebase('"Homo sapiens"', by="taxon", exact_match=TRUE, max_trees=10) ## all trees with 5 taxa five <- search_treebase(5, by="ntax", max_trees = 10) ## These are different, a tree id isn't a Study id. we report both studies <- search_treebase("2377", by="id.study") tree <- search_treebase("2377", by="id.tree") c("TreeID" = tree$Tr.id, "StudyID" = tree$S.id) ## Only results with branch lengths ## Has to grab all the trees first, then toss out ones without branch_lengths Near <- search_treebase("Near", "author", branch_lengths=TRUE) ## End(Not run)
Contains a cache of all phylogenies cache_treebase()
function was able
to pull down when run on 2012-05-14.
recreate with:
cache_treebase()