Abstract
Species Distribution Models (SDMs) constitute an important tool to assist decision-making in environmental conservation and planning in the context of climate change. Nevertheless, SDM projections are affected by a wide range of uncertainty factors (related to training data, climate projections and SDM techniques), which limit their potential value and credibility. The new package mopa provides tools for designing comprehensive multi-factor SDM ensemble experiments, combining multiple sources of uncertainty (e.g. baseline climate, pseudo-absence realizations, SDM techniques, future projections) and allowing to assess their contribution to the overall spread of the ensemble projection. In addition, mopa is seamlessly integrated with the climate4R bundle and allows straightforward retrieval and post-processing of state-of-the-art climate datasets (including observations and climate change projections), thus facilitating the proper analysis of key uncertainty factors related to climate data.Species Distribution Models (SDMs) are statistical tools used for the generation of probabilistic predictions of the presence of biological entities in the geographical space (Antoine Guisan and Zimmermann 2000; Elith and et al 2006). SDMs operate through the establishment of an empirical link between known presence locations and the physical characteristics of their environment. A particular case is that of Climate Envelope Models (CEMs), where appropriate climatic variables are used as predictors to characterize the climatic conditions where a species can potentially live —typically in the form of bioclimatic variables (Nix 1986; Busby 1991). In the context of climate change, SDMs have become a valuable tool for the vulnerability and impact assessment community, as a means of estimating distribution shifts due to climate variations, a problem of current interest in environmental conservation studies (see e.g.: M. B. Araújo et al. 2004; Hamann and Wang 2006; Jeschke and Strayer 2008). These studies require suitable climate products to produce models at an adequate spatial resolution and varying geographical extents –up to global–, including historical climate databases (i.e. high resolution gridded observations) and future climate projections for different emission scenarios. However, the intricacy of climate data retrieval and post-processing of the existing climate products (e.g. the global and regional climate change projections available from the Earth System Grid Federation, ESGF, Taylor, Stouffer, and Meehl 2011) has resulted in a wide use of ready-to-use products without considering their limitations for a particular case study (see J. Bedia, Herrera, and Gutiérrez 2013). In this paper we fill this gap with the package mopa (Species Distribution MOdeling with Pseudo-Absences), which has been developed in the framework of the climate4R bundle for climate data access and post-processing, thus facilitating the use of state-of-the-art global and regional climate data for SDM projections.
Despite the increased use of future SDM projections as a support tool for decision-making in biological conservation, the communication of the inherent uncertainties of these products remains as an ongoing challenge (see, e.g. Miguel B. Araújo et al. 2005; Beaumont, Hughes, and Pitman 2008; Fronzek, Carter, and Luoto 2011). A common approach to tackle different sources of uncertainty is based on producing ensembles of future SDM projections that encompass a wide range of variability by considering multiple choices of each of the factors/components involved in the modeling and projection chain (Miguel B. Araújo and New 2007; Buisson et al. 2010; Bagchi et al. 2013). However, there are important sources of uncertainty that are rarely quantified, yet crucial, in order to assess the credibility of the future distributions, such as the training data (including the baseline climate) used to fit the SDMs characterizing the ecological niche (Mateo, Felicísimo, and Muñoz 2010; J. Bedia, Herrera, and Gutiérrez 2013; Baker et al. 2016), the varying extrapolation ability outside the training period/spatial extent of the different SDM techniques [known as SDM transferability in time/space; J. Bedia, Busqué, and Gutiérrez (2011); Fronzek, Carter, and Luoto (2011)], the Global/Regional Climate Model (GCM/RCM) projections and biases (Turco et al. 2013) and others (see e.g.: Falloon et al. 2014 for an overview). Moreover, the ensemble approach has also limitations, since it assumes that all SDMs are equally transferable to climate change conditions, thus posing the risk of diluting insightful model signals with noise and error from less useful or defective SDMs forming the ensemble (Thuiller et al. 2004; Peterson et al. 2011).
The package mopa here presented has been designed to facilitate the design and analysis of comprehensive multi-factor SDM ensemble experiments, exploring different uncertainty factors such as presence data sets, pseudo-absence realizations, baseline climate, modeling algorithms, and future climate projections. Moreover, mopa provides variance partition tools to assess the contribution of the different factors to the overall uncertainty/spread of the ensemble projection. We illustrate the functionality of the package with the case-study presented in Iturbide, Bedia, and Gutiérrez (2018), focusing on the impact of the pseudo-absence data in the future distribution of a specific Oak phylogenetic group in Europe resulting from an ensemble of SDM projections considering three factors: 1) different SDMs techniques, 2) different realizations of randomly generated pseudo-absence data and 3) different climate projections produced over Europe from an ensemble of RCMs. The analyses undertaken with mopa reveal the sensitivity of SDMs to the pseudo-absence samples, affecting model stability and transferability to new climate conditions, with important implications for the construction of the final ensemble projections. We use and provide publicly available data to guarantee the reproducibility of the results.
The numerous climate databases available (both baselines and future
projections) are scattered across many different repositories with
various file formats, variable naming conventions, etc. sometimes
requiring relatively complex, time-consuming data downloads and
error-prone processing steps (e.g. bias correction) prior to SDM
development. This is also a major barrier for research reproducibility
and data exchange. The climate4R bundle is a
set of R packages specifically designed to ease climate data access,
analysis and processing in a straightforward manner, tailored to the
needs of the impacts and vulnerability assessment community. Further
details and references to worked examples and tutorials can be found for
instance in Cofino et al. (2017), Joaquín Bedia et al. (2017) and Frías et al. (2018). With this regard,
mopa was developed as part of the climate4R ecosystem, so
that typical climate data operations for SDM applications and conversion
features to the data type handled by mopa are
provided. Additionally, mopa includes a user
guide with an end-to-end worked example of climate data retrieval,
transformation and SDM development:
help(package = "mopa").
The popularity of R and its excellent statistical modeling and spatial analysis support has favored the development of specific, well-established and actively maintained packages for SDM construction and analysis, such as sdm (Naimi and Araújo 2016), biomod2 (Thuiller et al. 2016), dismo (Hijmans et al. 2017) and SDMTools (Van der Wal et al. 2014), some of them also implementing pseudo-absence data generation and ensemble building utilities. For instance, both sdm and biomod2 implement methods for building ensemble projections based on model performance in the calibration phase —e.g. by discarding or weighting the obtained results—. On the contrary, mopa is oriented towards the design and analysis of multi-factor ensembles of future SDM projections (considering as potential factors the presence data sets, the pseudo-absence realizations, the baseline climate, the modeling algorithms, and the future climate projections). The analysis of the resulting ensemble allows, for instance, assessing the problem of SDM transferability, which can not be properly evaluated during model calibration.
Besides, unlike previously existing packages, mopa allows pseudo-absence data generation as an independent step prior to model fitting, thus providing a finer control to the user for the analysis of several alternative methods and specific tuning options. In addition, the novel Three-Step method for pseudo-absence data generation is implemented (TS hereafter, Senay, Worner, and Ikeda 2013; Iturbide et al. 2015), providing a convenient interface that allows a fine tuning of the technique with simple arguments. Furthermore, mopa is also seamlessly integrated with standard R packages for spatial data manipulation like raster (Hijmans 2015) and sp (Pebesma and Bivand 2005; Bivand, Pebesma, and Gomez-Rubio 2013), allowing their usage at any stage of the modeling process (e.g. for data visualization and post-processing), and also extensibility to other SDM tools available in sdm, biomod2, …, also handling the same spatial data classes.
SDM predictor variables (in this case-study a number of bioclimatic
variables, but not necessarily so) are introduced in the analysis as
collections of raster objects of the classes
rasterBrick or rasterStack, similarly as other
SDM-oriented packages. For instance, here we use a set of present and
future bioclimatic variables widely used in SDM applications based on
precipitation and temperature climatologies (Busby 1991), using the function
biovars of package dismo. To this
aim, we first exploit the climate4R
functionalities to load and post-process observed precipitation and
temperature climatologies from the E-OBS gridded observational dataset
(Haylock et al. 2008) and the simulations
of 7 Regional Climate Model (RCMs) of the project ENSEMBLES (van der Linden and Mitchell 2009 http://www.ensembles-eu.org) for the control
(20C3M, 1971-2000) and future (A1B, 2071-2100) scenarios, including the
application of bias-correction ("delta" method,
e.g. Winkler et al. 1997; Zahn and von Storch 2010).
> install.packages("mopa")
> library(mopa)
> destfile <- tempfile()
> url <- paste0("https://raw.githubusercontent.com/SantanderMetGroup/",
+ "mopa/master/data/biostack.rda")
> download.file(url, destfile)
> load(destfile, verbose = TRUE)
Several impact studies indicate that species should be modeled by
treating sub-specific groups of organisms independently (e.g.: distinct
genetic linages) due to their differing adaptive responses to changes in
their environment (Hernandez et al. 2006;
Beierkuhnlein et al. 2011; Serra-Varela et al. 2015). Although
this is not always possible, due to the rare availability of information
on the distribution of sub-specific groups for most of species,
mopa has been conceived with this idea in
mind, being able to deal with several sets of presences simultaneously.
This adds flexibility to the modeling process in order to carry out
experiments considering different sub-collections of presences, not only
for sub-specific analyses (Iturbide et al.
2015), but also to address the sensitivity of the modeled
distributions to different characteristics of the training sample (e.g. the sample size, Hernandez et al. 2006; Mateo,
Felicísimo, and Muñoz 2010). Thus, the Oak_phylo2
mopa dataset contains a named list of length
two, containing the geographical coordinates of presence localities for
two different Oak phylogenies (H01 and H11, Petit
et al. 2002). More details about the source data are provided in
the help file of the dataset.
> data(Oak_phylo2)
> help(Oak_phylo2)
> presences <- Oak_phylo2$H11
The geographic background is often defined as the spatial extent of
the area considered in the SDM calibration stage. Here, we refer to the
background as a regular, geo-referenced grid with a specific size and
resolution, in which both the environmental variables and the presence
localities are located, so its grid-points are the sampling units.
Function backgroundGrid provides a simple way of generating
a backgroud using a raster-class object as reference. It
also includes an additional argument (spatial.subset) for
spatial subsetting, set by a raster::extent object or by
one or several sets of bounding-box coordinates, providing great
flexibility and ease of use for the analysis of SDM spatial aspects. For
instance, it allows straightforward exploration of SDM geographical
transferability or performing cross-validation experiments based on
spatial folds (e.g.: Randin et al. 2006).
As a result, when the object Oak_phylo2 is passed to
backgroundGrid, two different backgrounds are created by
default, each one spatially restricted by its phylogeny distribution
(H11 and H01).
> bg <- backgroundGrid(raster = biostack$baseline$bio1)
A smaller domain than the previous one can be arbitrarily indicated by the user by providing a specific spatial extent:
> bg.subdomain <- backgroundGrid(raster = biostack$baseline$bio1,
spatial.subset = extent(c(-10, 35, 45, 65)))
Similarly, the user might be interested in a background strictly
constrained by the bounding box of the actual species localities, by
just passing to spatial.subset their coordinates:
> bg.species <- backgroundGrid(raster = biostack$baseline$bio1,
spatial.subset = presences)
Thus, the user has flexibility to perform further modifications of
the background, so it would be also possible to discard specific areas
based on expert knowledge (e.g. Serra-Varela et
al. 2015). In this case study, we will retain the full background
(bg) for further analyses.
Most of SDMs require data not only from known presences of the biological entity, but also absence data in order to model the binary response presence/absence as a function of the different environmental variables. While the sampling efforts are typically focused on recording presence localities (atlases, natural history collections, targeted samplings, …), in most cases there is no explicit information about the absence of the species. Therefore pseudo-absence generation is often required for SDM construction, by sampling the background of the study domain. Different methods have been proposed to this aim, whose choice has an important effect on the final SDM results, as highlighted in different previous studies (e.g.: Wisz and Guisan 2009; Iturbide et al. 2015). However, there is no consensus on the best sampling design for generating pseudo–absences.
Pseudo-absence sampling in mopa is
performed by the pseudoAbsence function. It implements a
wide range of methodologies described in the literature (see Iturbide et al. 2015 for an overview and
comparison of methods) for maximum user flexibility, but at the
same time its arguments have been kept as simple as possible to ease its
application (Table 1). Here, three methods are
described: random sampling, random sampling with environmental profiling
and the three-step method. Their main characteristics are next briefly
described. A more extended explanation can be found in (Iturbide et al. 2015) and reference
therein.
| Argument | Description |
|---|---|
realizations |
Number of realizations of pseudo-absence generation |
exclusion.buffer |
Minimum distance to be kept between presence data and pseudo-absence data |
prevalence |
Proportion of presences against absences |
kmeans |
Performs a k-means clustering of the background to extract the pseudo-absences instead of sampling at random |
varstack |
RasterStack of variables for computing the
k-means clustering |
The RS method is the simplest and most frequent way of generating pseudo-absences (Iturbide et al. 2015). In the next example three times more pseudo-absences than presences are generated at random, keeping a \(0.249^{\circ}\) (\(\simeq 30\) km) exclusion buffer around known presence localities. Ten pseudo-absence realizations are considered:
> pa_RS <- pseudoAbsences(xy = presences, background = bg$xy,
realizations = 10, exclusion.buffer = 0.249,
prevalence = -0.5)
As an alternative to random sampling, a stratified sampling approach
can be performed, based on homogeneous environmental conditions. To this
aim, a clustering of the environmental space is applied following Senay, Worner, and Ikeda (2013) by setting
argument kmean to TRUE:
> pa_kmeans <- pseudoAbsences(xy = presences, background = bg$xy,
exclusion.buffer = 0.249,
prevalence = -0.5,
kmeans = TRUE, varstack = biostack$baseline)
The RSEP method imposes restrictions on the environmental range of
the background to be sampled for pseudo-absences. In
mopa this is done by performing an
environmental profiling of the background (function
OCSVMprofiling) that, following (Senay, Worner, and Ikeda 2013), applies a
one-class support vector machine algorithm (OCSVM, implemented in package e1071,
Meyer et al. 2017) returning a binary (presence/absence)
classification of the background gridboxes based solely on the presence
information (bg.profiled$presence and
bg.profiled$absence in the example below). Only the
predicted absence background is then retained for pseudo-absence
generation.
> bg.profiled <- OCSVMprofiling(xy = presences, varstack = biostack$baseline,
background = bg$xy)
> pa_RSEP <- pseudoAbsences(xy = presences, background = bg.profiled$absence,
realizations = 10, exclusion.buffer = 0.249,
prevalence = -0.5)
TS is based on imposing restrictions to both the environmental range
and the spatial extent of the background from which pseudo-absences are
sampled. This method has been shown to outperform other common
approaches in terms of resulting SDM robustness (Iturbide et al. 2015). The TS method adds an
additional step to the RSEP method, consisting on the partition of the
background space (as yielded by RSEP) in multiple bands using different
radius from presence localities. In the example below, multiple distance
bands with an increasing radius of 30 km between each other are created
(argument by = 0.249, in degrees). The first one (with the
shortest radius from presence localities) is at 30 km from the closest
presence point (start = 0.249), and the largest one (the
longest radius from presences) is set by default to half the length of
the diagonal of the background bounding-box (see
Iturbide et al. 2015 for more details).
> bg.radius <- backgroundRadius(xy = presences,
background = bg.profiled$absence,
start = 0.249, by = 0.249, unit = "decimal degrees")
> pa_TS <- pseudoAbsences(xy = presences,
background = bg.radius, realizations = 10,
exclusion.buffer = 0.249, prevalence = -0.5)
A spatial representation of the results yielded by the pseudo-absence methods described is next generated (Fig. 1):
> # Generates Fig. 1
> par(mfrow = c(2, 2), mar = c(2, 2, 2, 1.2))
> # Panel 1a (Presence data)
> plot(bg$xy, pch = 18, cex = 0.4, col = "gray", asp = 1)
> points(presences, pch = 18, cex = 0.6, col = "red")
> # Panel 1b (RS method)
> plot(bg$xy, pch = 18, cex = 0.4, col = "gray", asp = 1)
> points(pa_RS$species1$PA01[[1]], pch = 18, col = "darkviolet", cex = .6)
> points(pa_kmeans$species1$PA01[[1]], pch = 18, col = "yellow", cex = .6)
> points(presences, pch = 18, cex = 0.6, col = "red")
> # Panel 1c (RSEP method)
> plot(bg.profiled$absence, pch = 18, cex = 0.4, col = "gray", asp = 1)
> points(bg.profiled$presence, pch = 18, cex = 0.4, col = "aquamarine")
> points(pa_RSEP$species1$PA01[[1]], pch = 18, cex = 0.6, col = "darkviolet")
> points(presences, pch = 18, cex = 0.6, col = "red")
> # Panel 1d (TS method)
> plot(bg.radius[[1]]$km3120, col = "gray", asp = 1, pch = 18, cex = 0.4)
> points(bg.profiled$presence, pch = 18, cex = 0.4, col = "aquamarine")
> for (i in 1:10) {
l <- (11 - i) * 10
points(bg.radius[[1]][[l]],
col = gray.colors(10, start = .9,end = 0.1)[i],
pch = 18, cex = 0.4)
}
> points(pa_TS$species1$PA01[[50]], pch = 18, cex = 0.6, col = "darkviolet")
> points(presences, pch = 18, cex = 0.6, col = "red")
pseudoAbsences. (a)
Known presence locations of the Oak phylogeny H1 (red points) and
initial background for pseudo-absence sampling (grey grid points).
(b) pseudo-absences generated using the RS method
randomly (purple points) and with k-means clustering (yellow points).
(c) Pseudo-absences generated with the RSEP method
(purple), where the turquoise area corresponds to the discarded suitable
background space as identified by the OCSVM profiling approach.
(d) TS approach. Environmentally stratified as RSEP
(c), but also spatially stratified background, the different strata
(spatial extents) identified by the different gray-scale colors.
Pseudo-absences for one of the background extents (3120 km) are depicted
as example (purple points).Thus, mopa allows for the generation of a
wide range of combinations of environmental restriction criteria (using
OCSVMprofiling) and spatial extent constraints (using
backgroundRadius, see Table 2),
providing unrivalled functionality for the development and
inter-comparison of multiple pseudo-absence setups for SDM refinement
and ensemble prediction generation.
OCSVMprofiling |
backgroundRadius |
Method |
|---|---|---|
| \(\times\) | \(\times\) | No restriction (RS method) |
| \(\checkmark\) | \(\times\) | Environmental restriction (RSEP method) |
| \(\checkmark\) | \(\checkmark\) | Environmental and spatial restriction (TS method) |
| \(\times\) | \(\checkmark\) | Spatial restriction (Particular case of RS) |
Once the pseudo-absence dataset(s) chosen by the user is(are) built,
the mopaTrain function performs SDM fitting. The function
is a wrapper for different statistical method implementations commonly
used in SDM applications (see summary in Table 3).
Moreover, mopaTrain adds extended functionality for
cross-validation for each set of presence/absence data and for each
different species contained in the presence dataset, as routinely done
in SDM applications (see e.g.: Verbyla and
Litvaitis 1989). In the next line of code, the Oak H1 phylogeny
is fitted using a generalized linear model (GLM,
A. Guisan, Edwards, and Hastie 2002) and multivariate adaptive
regression splines (MARS, Friedman 1991),
applying a 10-fold cross validation approach. Moreover, equal weighting
of presences and pseudo-absences is indicated with the argument
weighting = TRUE (see e.g.:
Barbet-Massin et al. 2012).
> trainRS <- mopaTrain(y = pa_RS, x = biostack$baseline, weighting = TRUE,
k = 10, algorithm = c("glm", "mars"))
| SDM technique | algorithm value |
*****pkg*::function |
Reference |
|---|---|---|---|
| Generalized Linear Model | "glm" |
stats::glm |
Part of R |
| Random Forest | "rf" |
ranger::ranger |
(Wright and Ziegler 2017) |
| Multivariate Adaptive Regression Splines | "mars" |
earth::earth |
(Milborrow 2017) |
| Maximum Entropy | "maxent" |
dismo::maxent |
(Hijmans et al. 2017) |
| Support Vector Machine | "svm" |
e1071::best.svm |
(Meyer et al. 2017) |
| Classification and regression tree (tree) | "cart.tree" |
tree::tree |
(Ripley 2016) |
| Classification and regression tree (rpart) | "cart.rpart" |
rpart::rpart |
(Therneau, Atkinson, and Ripley 2017) |
After the generation of TS pseudo-absences, multiple background
extents exist as a result of the different distances defined by
backgroundRadius. It has been noted that the background
extent from which pseudo-absences are sampled is an important factor
affecting not only model performance, but also its transferability and
biological meaning (Van der Wal and Shoo
2009). With this regard, (Iturbide et al.
2015) propose a selection criterion based on the response of
model performance as a function of distance radius, that is
generalizable to different SDM characteristics and spatial scales. With
this regard, the performance criterion chosen is the Area Under the ROC
Curve (AUC), one of the most widely used accuracy measures of binary
classification systems (Swets 1988).
Essentially, the method performs a non-linear regression of the AUC
obtained by each SDM extent against their background radius, considering
three possible asymptotic models (Fig. 2):
Michaelis-Menten model: \(v(x) = \dfrac{ax}{Km + x}\)
2-parameter exponential model: \(v(x) = a(1 - e^{-bx})\)
3-parameter exponential model: \(v(x) = a - be^{-cx}\)
, where \(v\) and \(x\) represent the AUC and the background
extent respectively. \(a\) is the
asymptotic AUC value achieved by the system and \(a - b\) is the intercept. \(Km\) is the Michaelis constant (i.e. the
extent at which the AUC is half of \(a\), and \(c\) is the coefficient of the point where
the curve is most pronounced. The asymptotic model that better fits the
AUC response to the different background extents is automatically
selected to extract the AUC asymptotical value. The minimum extent at
which the AUC lies above the asymptote is retained as the optimal
threshold radius, being the corresponding fitted SDM returned. The
asymptotic models are fitted internally by mopaTrain via
the nls function from package
stats always the TS method is used (this is
automatically detected by the function). Optionally, a diagram
displaying the results is also returned by setting the argument
diagrams=TRUE (Fig. 2).
> # Train TS model and generate Fig. 2
> trainTS <- mopaTrain(y = pa_TS, x = biostack$baseline, weighting = TRUE,
k = 10, algorithm = c("glm", "mars"), diagrams = TRUE)
mopaTrain SDM output.The object returned by mopaTrain is a list of several
components generated in the model calibration and evaluation process.
Several performance measures are included apart from the AUC, like the
True Skill Statistic (TSS) and Cohen’s Kappa obtained in the
cross-validation, frequently use for the assessment in SDMs (Allouche, Tsoar, and Kadmon 2006). These and
other ocmponents of the SDM fitted object can be accessed using
extractFromModel. For, instance, to extract the TSS:
> tss.RS <- extractFromModel(models = trainRS, value = "tss")
However, and for maximum user flexibility, a matrix containing the observed and predicted probability values for each calibration point is returned, allowing other types of user-tailored model performance assessments.
> ObsPred.RS <- extractFromModel(models = trainRS, value = "ObsPred")
The fitted models are stored in the "model" (or
"fold.models") component, required for subsequent model
prediction.
> models.RS <- extractFromModel(models = trainRS, value = "model")
Additionally, variable importance may be also estimated. One
straightforward possibility is to pass the fitted models to function
varImp from package caret
(Kuhn 2017).
SDM predictions are obtained by passing a new set of predictors
(e.g.: future bioclimatic variables) to the generated models. The
model component corresponds to the models fitted using all
available data for model training, while the SDM predictions for the
k-cross-validation setup are generated from the component
fold.models –instead of model–. Thus,
mopa allows handling both the cross-fitted
models for flexible model performance assessment and the global model
–fitted with all presences and pseudo-absences– for predicting
distributions, accomplished through the use of the function
mopaPredict. In the following example, models corresponding
to the RS method are projected to reference climate conditions
(biostack$baseline) and to 7 future climate projections
(biostack$future):
> ensemble.present <- mopaPredict(models = models.RS,
newClim = biostack$baseline)
> ensemble.future <- mopaPredict(models = models.RS,
newClim = biostack$future)
Projections returned by mopaPredict are structured in a
nested list. Each depth or level in the list
corresponds to a different component. These are: presence data sets
(SP), pseudo-absence realizations (PA),
modeling algorithms (SDM), baseline climate
(baseClim), and the new climate (newClim) used
to project models (e.g. future climate projections). The function used
to extract components is extractFromPrediction. In the next
example, projections corresponding to the first pseudo-absence
realization (object rcms_run1) and to the future climate
projection from the MPI RCM (object runs_rcm1) are
extracted:
> rcms_run1 <- extractFromPrediction(ensemble.future, "PA01")
> runs_rcm1 <- extractFromPrediction(ensemble.future, "MPI")
Then, the function is again applied to object runs_rcm1
to extract the SDM results for MPI and GLM. The resulting object is of
S4-class raster*, thus being straightforward to apply any
of the plotting/analysis methods for spatial objects. Here, we use
spplot from sp for output
visualization (Fig. 3).
> glm_runs_rcm1 <- extractFromPrediction(runs_rcm1, "glm")
> # Generates Fig. 3
> data(wrld)
> spplot(glm_runs_rcm1, layout = c(5, 2), at = seq(0, 1, 0.1),
col.regions = colorRampPalette(c("white", "red3")),
sp.layout= list(wrld, first = FALSE, lwd = 0.5))
glm_runs_rcm1.Thus, it is easy to explore the results by inspecting the different
components of the mopaPredict outputs. For instance, the
raster package can be particularly useful this
aim allowing for a wide variety of map algebra operations through the
function stackApply over user-defined subsets of SDM
projections.
The relative contribution of each component to the total ensemble
spread/variability is implemented in mopa
using an ANOVA approach, through the function
varianceAnalysis, following the method in (Déqué et al. 2012), also applied by (San-Martín et al. 2016). For instance, in this
example, the total variance \(V\) can
be decomposed as the summation of the variance explained by the
pseudo-absence realization \(P\), the
RCM \(R\) and the combination of both
\(PR\), so \(V = P + R + PR\). Let \(i\) be the index of the pseudo-absence
realization \((i = 1, \dots, 10)\),
\(j\) the index of the RCM \((j = 1, \dots, 7)\), and \(X_{ij}\) is the response (e.g.: the
predicted distribution for the particular realization and climate
projection). Then:
\[\label{eq:P and R} P = \frac{1}{10}\sum\limits_{i=1}^{10}(X_{i}-\bar{X})^2 \quad \text{and} \quad R = \frac{1}{7}\sum\limits_{j=1}^{7}(X_{j}- \bar{X})^2 (\#eq:P-and-R)\]
are the terms resulting from the realization alone (\(P\)), and RCM alone (\(R\)), and
\[\label{eq:PR} PR = \frac{1}{10}\sum\limits_{i=1}^{10}\frac{1}{7}\sum\limits_{j=1}^{7}(X_{ij}-X_{i}-X_{j}+\bar{X})^2 (\#eq:PR)\]
is the interaction term of the realization with the RCM (\(PR\)). The following example shows the
analysis performed for the pseudo-absence realizations
(component1 = "PA") and the climate projections
(component2 = "newClim") in GLM projections
(fixed = "glm"). In order to illustrate thoroughgoing
information on the spread in the projected potential distributions,
variance percentage maps are returned together with the maps of the mean
and standard deviation. Again, the results can be conveniently
visualized with function spplot (Figs. 4 and 5).
> var.glm <- varianceAnalysis(predictions = ensemble.future,
component1 = "PA", component2 = "newClim", fixed = c("glm"))
> # Generates Fig. 4
> spplot(var.glm$mean,
at = seq(0,1,0.1),
col.regions = colorRampPalette(c("white", "red3")),
sp.layout= list(wrld, first = FALSE, lwd = 0.5))
> # Generates Fig. 5
> spplot(var.glm$variance,
col.regions = rev(gray.colors(10, end = 1)),
at = seq(0, 100, 10),
sp.layout= list(wrld, first = FALSE, lwd = 0.5))
var.glm$mean).var.glm$var).Figures 4 and 5 depict the ensemble SDM projections and the
variance analysis results, applied to the set of projections that
correspond to the 10 pseudo-absence realization and 7 climate
projections (10 realizations x 7 RCMs). The mean suitability map and the
standard deviation are shown in Figure 4,
while Figure 5 are the variance fraction maps
(%), depicting the contribution of each component (realization, RCM and
realization & RCM) to the overall variance. For instance, the
results displayed in Figure 5 unveil that the
RCM choice (component newClim) is by far the most important
factor contributing to the ensemble spread, while pseudo-absence
realization has some impact in areas that are outside the current domain
of the Oak phylogeny H1 (e.g. Scandinavia).
Similarly, the next lines perform the same analysis, but considering MARS instead of GLM as the statistical modeling technique (Figs. 6 and 7):
> var.mars <- varianceAnalysis(predictions = ensemble.future,
component1 = "PA", component2 = "newClim", fixed = c("mars"))
> # Generates Fig. 6
> spplot(var.mars$mean,
at = seq(0,1,0.1),
col.regions = colorRampPalette(c("white", "red3")),
sp.layout= list(wrld, first = FALSE, lwd = 0.5))
> # Generates Fig. 7
> spplot(var.mars$variance,
at = seq(0, 100, 10),
col.regions = rev(gray.colors(10, end = 1)),
sp.layout= list(wrld, first = FALSE, lwd = 0.5))
var.mars$mean).var.mars$var).Unlike GLM, in the case of MARS the ensemble spread (Fig. 6) is greatly affected by the pseudo-absence
realization in a wide area of the study domain (Fig. 7), specially in peripheral regions. This is
unequivocally diagnosed after applying function
varianceSummary, which provides a summary of the results,
including a graph (Fig. 8) and allowing
the comparison of multiple results for a particular uncertainty
component. This summary is based on the spatial subsetting of the study
area, by specifying the number of subsets with argument
regions. The output boxplot (Fig. 8) shows the spatial spread of the results
(variance proportion explained by a component and the total standard
deviation) in each region.
As a result, in Figure 8, we compare
GLM and MARS (var.glm and var.mars) with
regard to the variance proportion explained by the RCM choice
(component = 2L), so that the percentage not explained by
it, is associated to the pseudo–absence realization. From this summary,
we can confirm a significantly higher sensitivity of MARS to the
pseudo–absence sample across all regions.
> # Generates Fig. 8
> varianceSummary("glm" = var.glm, "mars" = var.mars,
component = 2L, regions = c(6, 6), drawBoxplot = TRUE)
varianceSummary, where GLM
and MARS techniques (brown and blue respectively) are compared. Boxes
account for the spatial spread of the results in each region. Empty
boxes show the variance proportion explained by the RCM choice
(component newClim) and filled boxes show the overall
spread, this is, the standard deviation of the predicted probability
expressed as a percentage. Thus, empty boxes show how is the total
spread (filled boxes) distributed between components (PA
and newClim). The x-axis corresponds to the regions shown
in the map at the top.Alternatively, a SpatialPolygons object (package
sp) can be passed to argument
regions in order to focus the analysis on specific areas of
interest. For illustrative purposes, in this example we use the climatic
regions defined in the EU-funded PRUDENCE project (Christensen and Christensen 2007), which is
available at the climate4R package
visualizeR (Frías et
al. 2018):
> regiondir <- tempfile()
> download.file(paste0("https://github.com/SantanderMetGroup/",
+ "visualizeR/raw/devel/data/PRUDENCEregions.rda"), destfile = regiondir)
> load(regiondir)
> varianceSummary("glm" = var.glm, "mars" = var.mars,
component = 2L, drawBoxplot = FALSE, regions = PRUDENCEregions)
Additionally, if argument drawBoxplot is set as
FALSE, a simpler graph is obtained displaying the points of
the spatial mean. This might be useful when multiple results are being
compared in the same graph.
The much higher sensitivity of MARS to the pseudo–absence sample warns about its instability, while GLM reveals much better properties in terms of model stability and transferability. These findings are possible after ANOVA analysis thanks to the utilities included in mopa, enabling a flexible experimental setup with a simple user interface. Model transferability is thus not apparent during the SDM calibration stage and is not coupled to model performance (even with the application of the 10-fold cross validation approach), so for instance TSS among realizations was 0.82 for GLM and 0.85 for MARS, and the mean AUC, 0.91 and 0.92 respectively. The uncertainty analysis results are extremely valuable for the construction of an ensemble of SDM projections that minimizes the risk of including unuseful realizations, thus yielding more plausible results.
In the same vein, the contribution of pseudo–absences in front SDM
techniques to the overall spread is achieved by adding a new component
argument to varianceAnalysis, while the RCM projection (MPI
in this example) is kept as a fixed factor:
> MPI.var <- varianceAnalysis(ensemble.future,
component1 = "PA", component2 = "SDM", fixed = c("MPI"))
In case further uncertainty components are considered for predicting
distributions (named in mopa as
SP, baseClim and foldModels),
these could also be analyzed by keeping several fixed factors, each
corresponding to a component that is not being analyzed. This is
explained in detail in the help document of function
"varianceAnalysis".
> help(varianceAnalysis)
Finally, the ensemble forecast is built. In this particular example,
we could discard those MARS projections that we consider are the result
of bad transferability, e.g. corresponding to the pseudo-absence
realizations that resulted in unrealistic predictions. Let us consider
the simplified case where, after a more detailed analysis of the
results, we conclude that MARS projections corresponding to
pseudo–absence realization 8 along with GLM projections, are valid
forecasts, then, as shown in the next example, the definitive ensemble
is easily built with function extractFromPrediction and the
utilities of the raster package. Here we
calculate and plot the ensemble mean and standard deviation of the final
SDM ensemble projections (Fig. 9):
> marsEns <- extractFromPrediction(ensemble.future, value = "mars")
> marsEnsPA08 <- extractFromPrediction(marsEns, value = "PA08")
> glmEns <- extractFromPrediction(ensemble.future, value = "glm")
> ensemble.future.def <- stack(list(glmEns, marsEnsPA08))
> mean.ensemble <- stackApply(ensemble.future.def, fun = mean,
indices = rep(1, nlayers(ensemble.future.def)))
> sd.ensemble <- stackApply(ensemble.future.def, fun = sd,
indices = rep(1, nlayers(ensemble.future.def)))
> forecast.future <- stack(mean.ensemble, sd.ensemble)
> names(forecast.future) <- c("ensemble mean", "ensemble sd")
> # Generates Fig. 9
> spplot(forecast.future, at = seq(0,1,0.1),
col.regions = colorRampPalette(c("white", "red3")),
sp.layout= list(wrld, first = FALSE, lwd = 0.5))
Basically, this is a weighting exercise that favors GLM predictions in front of those of MARS, beyond the performance shown in the calibration phase.
The impacts of climate change on the biological systems are of current concern worldwide, and future SDMs have become a key tool for the vulnerability and impact assessment community. Thus, the utilities in package mopa can help in the SDM production chain since the early stage (climate data retrieval and post-processing) to the ultimate phase in which a final set of models is retained for ensemble generation and map production.
In this case-study, we illustrate the development of a set of SDM
projections considering multiple combinations of climate change
projections from a set of state-of-the-art RCMs, two popular statistical
modeling methods (GLM and MARS) and different pseudo-absence
realizations, enabling the identification of those members of the
ensemble yielding consistent and plausible future estimates for final
SDM building. With this regard, the ability to quantitatively assess the
individual contribution of each factor to the overall SDM spread, as
implemented in function varianceAnalysis proved crucial in
the evaluation. While previously existing R packages already provide
functionality for SDM building and their assessment during the
calibration stage, we have shown that model performance, as evaluated by
ordinary cross-validation, is not coupled to model transferabilty into
future climate, being therefore this essential feature specific of
mopa. Other characteristic aspects introduced
by the package consist of the novel methods for pseudo-absence
generation, and the ability to perform a fine-tuning of these methods
prior to model fitting. Furthermore, the inter-operability of
mopa with other SDM-related R packages enables
maximum flexibility and eases the use of R for SDM applications in the
framework of complex modeling exercises, for which multiple aspects have
a varying contribution to the overall uncertainty.
We are grateful to the one anonymous referee for her/his valuable comments. We acknowledge the ENSEMBLES project (GOCE-CT-2003-505539), supported by the European Commission’s 6th Framework Program for providing publicly the RCM simulations and observational data used in this study. We are also grateful to Rémy Petit and François Ehrenmann for providing the distribution of Oak phylogenies.