Abstract
The USA National Center for Biotechnology Information (NCBI) is one of the world’s most important sources of biological information. NCBI databases like PubMed and GenBank contain millions of records describing bibliographic, genetic, genomic, and medical data. Here I present rentrez, a package which provides an R interface to 50 NCBI databases. The package is well-documented, contains an extensive suite of unit tests and has an active user base. The programmatic interface to the NCBI provided by rentrez allows researchers to query databases and download or import particular records into R sessions for subsequent analysis. The complete nature of the package, its extensive test-suite and the fact the package implements the NCBI’s usage policies all make rentrez a powerful aid to developers of new packages that perform more specific tasks.The USA National Center for Biotechnology Information (NCBI) is one of the world’s largest and most important sources of biological data. At the time of writing, the NCBI PubMed database provided information on \(27.5\) million journal articles, including \(4.6\) million full text records. The NCBI Nucleotide Database (including GenBank) had data for \(243.3\) million different sequences and dbSNP described \(997.3\) million different genetic variants. Records from all of these databases can be cross-referenced with the \(1.3\) million species in the NCBI taxonomy, and PubMed entries can be searched using a controlled vocabulary containing \(272\) thousand unique terms.
The NCBI provides access to a total of \(50\) databases through a web interface, public FTP sites and an API called Entrez Programming Utilities (eUtils, Sayers and Wheeler (2004)). R packages from the Bioconductor project (e.g., genomes, Stubben (2015); RMassBank, Stravs et al. (2013) and MeSHSim, Zhou and Shui (2015)) or available from CRAN (e.g., ape, Paradis, Claude, and Strimmer (2004); RISmed, Kovalchik (2017) and pubmed.mineR, Rani, S.Ramachandran, and Shah (2014)) take advantage of the Eutils API to perform specific tasks. Two packages, rentrez and reutils (Schöfl 2016), provide functions that cover the entire API. Here I describe rentrez, a package which provides users with a simple and consistent interface to eUtils. This paper discusses the design of the package, illustrates its use in biological research and demonstrates how the provided functions can aid the development of other packages designed to meet more specific goals.
The eUtils API provides endpoints for searching each of the databases it covers, finding cross-references among records in those databases and fetching particular records (in complete or summary form). The design of rentrez mirrors that of eUtils, with each of these endpoints represented by a core function that has arguments named to match those used in the API documentation (Table 1). The most important arguments to each R function are documented, and the help pages associated with these functions contain a reference to the relevant section of the eUtils documentation.
| NCBI endpoint | Purpose | Core function | |
|---|---|---|---|
| esearch | Locate records matching search criteria. | entrez_search |
|
| elink | Discover cross-linked records. | entrez_link |
|
| esummary | Fetch summary data on a set of records. | entrez_summary |
|
| efetch | Fetch complete records in a variety of formats. | entrez_fetch |
Typically, a user will begin by using entrez_search to
discover unique identifiers for database records matching particular
criteria. This function requires a database (argument db)
and a search term (argument term). The term
argument can take advantage of the eUtils search syntax, in which
queries can be associated with particular ‘search fields’ (enclosed in
square brackets), to limit the number of matching records. The fields
available for a given database can be retrieved with the function
entrez_db_searchable. The following call demonstrates the
use of this search syntax. This query finds scientific papers that
contain the phrase ‘R Package’ in their title while limiting the result
to include only papers published in 2017.
pubmed_search <- entrez_search(db="pubmed",
term="(R package[TITL]) AND 2017[PDAT]",
use_history=TRUE)
pubmed_search
#> Entrez search result with 62 hits (object contains 20 IDs and a web_history object)
#> Search term (as translated): R package[TITL] AND 2017[PDAT]
For the most part, the objects returned by rentrez functions
are S3 objects that inherit from "list" objects. Each of
these classes has its own print method, providing users
with a succint summary of the object’s contents. In this case,
entrez_search returns a an "esearch", object
and the print method describes the number of records matching the query.
These objects also contain either a character vector containing the
unique identifiers of records that match the query or a
"web_history" object that serves as a reference to this set
of identifiers stored on the NCBI’s servers. These
"web_history" objects are useful when dealing with large
numbers of records, as they reduce the number of identifiers that need
to be exchanged between a user and NCBI servers. Both identifiers and
"web_history" objects can be passed to the other core
functions to retrieve information about the records they represent. For
example, a call to entrez_summary returns information about
each paper identified in the search above.
pkg_paper_summs <- entrez_summary(db="pubmed", web_history=pubmed_search$web_history)
pkg_paper_summs
#> List of 62 esummary records. First record:
#>
#> $`28759592`
#> esummary result with 42 items:
#> [1] uid pubdate epubdate
#> [4] source authors lastauthor
#> [7] title sorttitle volume
#> [10] issue pages lang
#> [13] nlmuniqueid issn essn
#> [16] pubtype recordstatus pubstatus
#> [19] articleids history references
#> [22] attributes pmcrefcount fulljournalname
#> [25] elocationid doctype srccontriblist
#> [28] booktitle medium edition
#> [31] publisherlocation publishername srcdate
#> [34] reportnumber availablefromurl locationlabel
#> [37] doccontriblist docdate bookname
#> [40] chapter sortpubdate sortfirstauthor
In addition to matching each of the eUtils endpoints,
rentrez provides utility functions that facilitate common
workflows. For example, the function extract_from_esummary
allows users to extract some subset of the items contained in each of a
set of summary records. In this case, the names of the journals that
these papers appeared in can be retrieved. The resulting character
vector can then be used to identify the PubMed-indexed journals that
have published the most papers describing R packages this year.
journals <- extract_from_esummary(pkg_paper_summs, "fulljournalname")
journals_by_R_pkgs <- sort(table(journals), decreasing = TRUE)
head(journals_by_R_pkgs,3)
#> journals
#> Bioinformatics (Oxford, England) BMC bioinformatics
#> 16 9
#> Molecular ecology resources
#> 9
Records in the NCBI’s various databases are heavily cross-referenced, allowing users to identify and download data related to particular papers, organisms or genes. By providing a programmatic interface to these records rentrez allows R users to develop reproducible workflows that either download particular datasets for further analysis or load them into an R session. Here I demonstrate such a workflow, downloading DNA sequences corresponding to unique mRNA transcripts of a particular gene in a particular species.
Our aim is to retrieve the sequence of mRNA transcripts associated
with the gene that encodes Amyloid Beta Precursor Protein in humans.
This gene is identified by the gene symbol 1 APP. The
NCBI database dealing with genetic loci (rather than particular
sequences) is called ‘Gene’, so the first step to recovering the
sequence data is discovering the unique identifier associated with APP
in this database. This can be achieved with entrez_search,
using the gene symbol and species in the search term.
app_gene <- entrez_search(db="gene", term="(Homo sapiens[ORGN]) AND APP[GENE]")
app_gene
#> Entrez search result with 1 hits (object contains 1 IDs and no web_history object)
#> Search term (as translated): "Homo sapiens"[Organism] AND APP[GENE]
We now have a unique identifier for APP in the Gene database. In
order to download sequences for this gene we need to find records from
the NCBI Nucleotide database that are associated with the Gene record.
The function entrez_link can be used to find
cross-referenced records. In this case, a single call to
entrez_link can identify human APP sequences in the
nucleotide database in general and in a number of restrictive subsets of
that database.
nuc_links <- entrez_link(dbfrom="gene", id=app_gene$ids, db="nuccore")
nuc_links$links
#> elink result with information from 5 databases:
#> [1] gene_nuccore gene_nuccore_mgc gene_nuccore_pos
#> [4] gene_nuccore_refseqgene gene_nuccore_refseqrna
The RefSeq RNA subset on the Nucleotide database contains a curated
set of mRNA transcripts for different genes. Thus the unique identifiers
contained in the gene_nuccore_refseqrna element correspond
to the sequences we wish to download.The function
entrez_fetch allows users to retrieve complete records in a
variety of formats. Here the sequences are retrieved in the standard
fasta format, and returned as a character vector with a
single element.
raw_recs <- entrez_fetch(db="nuccore",
id=nuc_links$links$gene_nuccore_refseqrna,
rettype="fasta")
cat(substr(raw_recs, 1,303), "...")
#> >NM_001136131.2 Homo sapiens amyloid beta precursor protein (APP) ...
#> GTCGGATGATTCAAGCTCACGGGGACGAGCAGGAGCGCTCTCGACTTTTCTAGAGCCTCAGCGTCCTAGG
#> ACTCACCTTTCCCTGATCCTGCACCGTCCCTCTCCTGGCCCCAGACTCTCCCTCCCACTGTTCACGAAGC
#> CCAGGTACCCACTGATGGTAATGCTGGCCTGCTGGCTGAACCCCAGATTGCCATGTTCTGTGGCAGA...
Sequences retrieved in this way could be written to file to be used by other software.
cat(raw_recs, file="APP_transcripts.fasta")
Alternatively, the sequences can be analysed within R using packages
designed for sequence data. In this case, the data can be represented as
a ’DNAbin’ object using the phylogenetics package
ape.
tf <- tempfile()
cat(raw_recs, file=tf)
ape::read.dna(tf, format="fasta")
#> 10 DNA sequences in binary format stored in a list.
#>
#> Mean sequence length: 3477.9
#> Shortest sequence: 3255
#> Longest sequence: 3648
#>
#> Labels:
#> NM_001136131.2 Homo sapiens amyloid beta precursor protein (...
#> NM_001136016.3 Homo sapiens amyloid beta precursor protein (...
#> NM_001204303.1 Homo sapiens amyloid beta precursor protein (...
#> NM_001204301.1 Homo sapiens amyloid beta precursor protein (...
#> NM_001204302.1 Homo sapiens amyloid beta precursor protein (...
#> NM_201414.2 Homo sapiens amyloid beta precursor protein (APP...
#> ...
#>
#> Base composition:
#> a c g t
#> 0.276 0.223 0.258 0.244
The workflow detailed above provides a relatively simple example of
how functions provided by rentrez can be used to identify,
retrieve and analyse data from the NCBI’s databases. The package
includes an extensive vignette which documents each of the eUtils
endpoints and demonstrates a number of detailed workflows. This document
also describes how analyses can be scaled to analyse much larger
datasets than are described in this paper. The tutorial can be accessed
from within an R session by typing
vignette(topic="rentrez_tutorial").
Development of rentrez has deliberately focused on producing a "low-level" package that provides a flexible interface to the entire eUtils API. As a result the package does not provide functions for any particular analysis or return records in any of the object classes made available for biological data by other packages. Rather, it is hoped that by providing a reliable interface to the eUtils API that meets the NCBI’s terms of use rentrez will help other developers to build packages for more specific use-cases. Indeed, the package has already been used to integrate NCBI data into packages dealing with sequence analysis (genbankr, Becker and Lawrence (2017)), retrieval of phylogenetic trees (rotl, Michonneau, Brown, and Winter (2016)) and handling of full-text journal articles (fulltext, Chamberlain (2016)).
New packages that take advantage of rentrez will usually
focus on providing a simple interface to users, so researchers will not
need to be familiar with the syntax or arguments used in eUtils to
perform tasks. Packages may also provide functions to parse files
returned by entrez_fetch, either to extract particular
information from those files or represent them as R objects. Here I
present an example of a small package that performs both of these
tasks.The software repository for this manuscript (https://github.com/dwinter/rentrez_ms) includes the code
for a package called ‘tidytaxonomy’ that can be used to explore the
taxonomic diversity of various NCBI databases. This demonstrates how the
low-level code in rentrez can be used to develop specific
applications that have a simpler interface than could be achieved with
core rentrez functions alone.
The exposed functions from tidytaxonomy retrieve data
from NCBI, but do not require the user to have any knowledge of the
eUtils API. Internal functions
parse the XML formatted records returned from the NCBI Taxonomy
database and extract relevant information. The core function
tidy_taxonomy allows users to retrieve a part of the NCBI
Taxonomy database in ‘tidy data’ format (Wickham
2014).
# install.packages('devtools') # (If not already installed)
devtools::load_all("tidytaxon")
animal_orders <- tidy_taxonomy("animals",
lowest_rank="order",
higher_ranks=c("phylum", "class"))
head(animal_orders,3)
#> phylum class order
#> 1 Chordata Actinopteri Lutjaniformes
#> 2 Chordata Actinopteri Gerreiformes
#> 3 Chordata Actinopteri Priacanthiformes
Once this data is obtained, two additional functions make it easy to
include the number of records a given taxon has in a particular
database. The function taxon_children is specifically for
counting taxonomy records which are subordinate to a given taxonomic
group. The other function, taxon_records, discovers records
in any NCBI database.
animal_orders$species <- taxon_children(animal_orders$order)
animal_orders$genomes <- taxon_records(animal_orders$order, db="genome")
animal_orders$sequences <- taxon_records(animal_orders$order, db="nuccore")
animal_orders$papers <- taxon_records(animal_orders$order, db="pubmed")
head(animal_orders,3)
#> phylum class order species genomes sequences papers
#> 1 Chordata Actinopteri Lutjaniformes 279 0 9315 0
#> 2 Chordata Actinopteri Gerreiformes 62 0 901 0
#> 3 Chordata Actinopteri Priacanthiformes 34 0 602 0
The resulting data can be used to visualise the taxonomic diversity of NCBI databases (Figure 1). The Appendix to this paper includes code that takes advantage of treemap (Tennekes 2017) to produce these visualisations.
rentrez covers the complete eUtils API, is well-documented at the function and package level and includes an extensive test suite that covers internal functions as well as typical use-cases of the software. The current version of rentrez is thus considered a stable release, and it is unlikely any additional functionality will be added. The software is nevertheless still actively maintained to keep pace with CRAN and NCBI policies and to fix any bugs that arise. Software issues, including bug reports and requests for help with particular use-cases, are welcomed at the package’s software repository: http://github.com/ropensci/rentrez.
Development of rentrez has benefited greatly from being part of the rOpenSci project. I am especially grateful to Scott Chamberlain for his guidance. I am also very grateful to everyone who has provided pull-requests or filed issues including Chris Stubben, Karthik Ram, Han Guangchun, Matthew O’Meara, Reed Cartwright and Pavel Fedotov.
Code used to produce Figure 1, using
animal_orders data generated above.
# Format the total number of records for a graph title
make_title <- function(col_name, data){
n <- sum(data[,col_name])
with_commas <- formatC(n, format = "d", big.mark = ",")
paste0(col_name, " (n = ", with_commas, ")")
}
# Generate a treemap from taxonmic data.frame
# * data= tidy_taxonomy data.frame
# * size_col = name of column for tm tile-size
# * fill_col = name of column for tile-fil
# * row = plot row in 2x2 grid
# * col = plot col in 2x2 grid
# * pal = palette for fill
taxic_diversity_tm <- function(data, size_col, fill_col, row, col, pal){
treemap(data,
index=c("phylum", "class", "order"), vSize=size_col, vColor=fill_col,
palette=pal, type='categorical', position.legend="none",
title=make_title(size_col, data), border.col=c("white","white","black"),
vp = viewport(layout.pos.row = row, layout.pos.col = col)
)
}
library(treemap)
library(grid)
library(gridExtra)
# 24 phyla means some fill-colours will be re-used, ordering phylum factor by spp
# will prevent any "major" phyla from getting the same colour.
spp_per_phylum <- aggregate(species ~ phylum, FUN=sum, data=animal_orders)
phyla_ordered <- spp_per_phylum$phylum[ order(spp_per_phylum$species, decreasing=TRUE)]
animal_orders$phylum<- factor(animal_orders$phylum, levels=phyla_ordered)
pal <- rep(RColorBrewer::brewer.pal(8, name="Set1"), 3)
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))
taxic_diversity_tm(animal_orders, "species", "phylum", 1,1, pal)
taxic_diversity_tm(animal_orders, "papers", "phylum", 1,2, pal)
taxic_diversity_tm(animal_orders, "sequences", "phylum", 2,1, pal)
taxic_diversity_tm(animal_orders, "genomes", "phylum", 2,2, pal)