Abstract

Integrating R with Geographic Information Systems (GIS) extends R’s statistical capabilities with numerous geoprocessing and data handling tools available in a GIS. QGIS is one of the most popular open-source GIS, and it furthermore integrates other GIS programs such as the System for Automated Geoscientific Analyses (SAGA) GIS and the Geographic Resources Analysis Support System (GRASS) GIS within a single software environment. This and its QGIS Python API makes it a perfect candidate for console-based geoprocessing. By establishing an interface, the R package RQGIS makes it possible to use QGIS as a geoprocessing workhorse from within R. Compared to other packages building a bridge to GIS (e.g., rgrass7, RSAGA, RPyGeo), RQGIS offers a wider range of geoalgorithms, and is often easier to use due to various convenience functions. Finally, RQGIS supports the seamless integration of Python code using reticulate from within R for improved extendability.

Introduction

Defining a GIS as a system for the analysis, manipulation and visualization of geographical data (Longley et al. 2011), one could argue that R has become a GIS (R. Bivand, Pebesma, and Gómez-Rubio 2013). In great part this is thanks to packages that provide spatial classes and algorithms coded in and for R (despite this these packages might also link to other software outside of R). These include maptools (Roger Bivand and Lewin-Koh 2017), raster (Hijmans 2017), sp (R. Bivand, Pebesma, and Gómez-Rubio 2013) and sf (Pebesma 2017). Further packages even extend R’s GIS capabilities through advanced mapping, e.g., mapview (Appelhans et al. 2017) and mapmisc (Brown 2016), and routing, e.g., osmar (Eugster and Schlesinger 2013) and dodgr (Padgham and Peutschnig 2017), among others. Despite this, native R (in the sense of coded in and for R) lacks fundamental GIS capabilities. GIS topology and topological operations are only partially (RArcInfo, Gómez-Rubio and López-Quílez 2005) or indirectly available via rgrass7 (Roger Bivand 2017). Furthermore, R is neither a spatial database management system nor especially good at the manipulation of large data sets (Ripley, Bates, and DebRoy 2016). Hence, computationally demanding GIS operations (point cloud processing, overlay operations on ‘big’ spatial data) executed in R may be rather slow. Performance and scalability, of course, depend on the computer hardware, and cloud computing may eventually alleviate or even settle this problem. Yet, most R users most likely still work on a local machine. What is more, R is lacking a number of fundamental GIS operations such as the derivation of various terrain attributes from a digital elevation model (DEM). And the same is true for 3D data visualization and voxel processing (Tomislav Hengl et al. 2015). Finally, though interactive tasks such as digitizing of geodata have become possible within R very recently (mapedit, Appelhans and Russell 2017), extensive manual editing is better done with the help of a GIS.

Figure 1: The R-interface to geospatial software - geospatial libraries, Desktop GIS, geobrowsers as well as web mapping and the position of RQGIS (left circle; WMS: Web Mapping Service). QGIS and corresponding third-party providers (right circle, the upper three symbols correspond to (from left to right): LiDAR tools, TauDEM, Orfeo Toolbox.

Many of R’s geospatial shortcomings could potentially be addressed through R programming directly. However, R was designed from the very beginning as an interactive interface to the algorithms of other software (Chambers 2016). Hence, it is unnecessary and even counterproductive to duplicate the functionality provided by an existing dedicated software with an expert developer and user community as long as there is a way to access it from within R. Therefore, it is barely surprising that numerous R packages provide access to third-party geoprocessing tools (Figure 1), only some of which will be discussed here. rgdal (Roger Bivand, Keitt, and Rowlingson 2017) accesses the geospatial data abstraction library (GDAL/OGR) (GDAL Development Team 2017). rgeos (Roger Bivand and Rundel 2017) is an interface to geometry engine - open source (GEOS, GEOS Development Team 2017), which opens the way to GIS vector operations. However, GEOS performance is somewhat limited. Think, for instance, of the spatial union of all US American census tracts and postal code layers, and it may be quite possible that rgeos::gUnion may take a very long time. The successor of the sp package, package sf combines the functionality of sp (spatial classes), rgdal (here: import/export of spatial vector data) and rgeos (geometrical operations) in just one package. Note also that GEOS is a C API for topology operations on geometries. Consequently, it expects topologically correct data. To make sure that our geodata lives up to topological expectations in general, our best approach is probably through another third-party integration, namely R-GRASS (R. Bivand 2007; Roger Bivand 2017). Additionally, GRASS GIS comprises a large suite of vector and raster functions. Basically, the user has to set up a spatial database before being able to use GRASS’s geoprocessing utilities (Neteler and Mitasova 2008). Hence, less experienced GIS users will likely prefer faster-to-use GIS interfaces also providing extensive geoprocessing capabilities. In particular, RSAGA (Alexander Brenning, Koszinski, and Sommer 2008) integrates R with SAGA (Conrad et al. 2015) and RPyGeo (Alexander Brenning 2012a) provides an interface to ArcGIS (ESRI 2017), which is probably still the most popular GIS environment in the world with >1 million users and the greatest market share among proprietary GIS (Longley et al. 2011).

What has been missing, however, is an R interface to one of the most widely used open-source GIS, QGIS (QGIS Development Team 2017; Graser and Olaya 2015). So far, the QGIS processing toolbox provided only the opposite interface by letting the user integrate R scripts as a user-defined ‘tool’ in QGIS. This is fine for people unwilling to use R directly. However, interfacing from R to QGIS has multiple benefits to the R user community. First and foremost, native QGIS geoalgorithms are now available from within R for the first time. Moreover, it is a special feature of QGIS that it acts as an umbrella integrating various other GIS power houses under its hood. These include SAGA, GRASS, GDAL, the Orfeo Toolbox (Inglada and Christophe 2009), TauDEM (Tarboton and Mohammed 2017) and additional tools for light detection and ranging (LIDAR) data (Rapidlasso 2017). RQGIS (Muenchow and Schratz 2017) brings this incredibly powerful geoprocessing environment to the R console in just one package. This, however, does not mean that specialized packages such as RSAGA and rgrass7 (R. Bivand 2007) will become obsolete, as discussed later. RQGIS also aims to be user-friendly by automatically retrieving GIS function parameter names and corresponding default values as well as supporting R named arguments for geoalgorithm parameters through the ellipsis argument.

In general, R–GIS interfaces open the way to extremely powerful and innovative statistical geoprocessing as for example shown by Alexander Brenning (2008), T. Hengl, Heuvelink, and van Loon (2010), Muenchow, Brenning, and Richter (2012), Vanselow and Samimi (2014), A. Brenning et al. (2015) M. Mergili, Krenn, and Chu (2015), Martin Mergili and Kerschner (2015), Poggio and Gimona (2015) and Zandler, Brenning, and Samimi (2015). In this paper we will first introduce the general architecture and main features of the RQGIS package. We will then demonstrate the application of this integrated scientific programming approach with an ecological example. Subsequently, we will show how to easily complement and extend RQGIS with Python programming, especially PyQGIS (Sherman 2014). In our discussion, we will finally compare and contrast RQGIS with other approaches to R–GIS integration, and provide an outlook and motivation for future developments.

Introducing the RQGIS package

Basic concepts

The RQGIS package utilizes the QGIS Python API in order to access QGIS modules. To successfully run the QGIS Python API, RQGIS first sets up all required environment variables (Figure 2). And secondly, it establishes a tunnel to Python using reticulate (Allaire, Tang, and Geelnard 2017) - a package providing an R interface to Python. The older package rPython (Bellosta 2015) is similar to reticulate, however, it is only available for Unix-based systems which is why we had to dismiss it as an option for RQGIS. With reticulate, we set up the Python environment only once, and use the resulting tunnel to exchange functions and objects between R and Python seamlessly.

We can divide RQGIS roughly into two major components:

The Python code (python_funs.py located in inst/python of RQGIS) defines a Python class named "RQGIS" with methods to be called during the geoprocessing. Defining an own class has the additional benefit that it becomes highly unlikely that (advanced) users interacting with the QGIS Python API accidentally overwrite some of our predefined methods.
The processing.R file (found in the R folder of RQGIS) actually establishes the QGIS Python interface, and lets the user run QGIS from within R. The most important functions are (see also 2.2 for a detailed description):
1. open_app() to establish a tunnel to Python and a QGIS custom application
2. find_algorithms() to retrieve the QGIS command-line names for all available geoalgorithms
3. open_help() and get_args_man() to access help resources as well as function arguments and default values
4. run_qgis() to call QGIS geoalgorithms from within R

The most notable features of RQGIS are:

For the first time, native QGIS algorithms are available from within
Additionally, RQGIS provides access to hundreds of third party geoalgorithms including GDAL, GRASS GIS and SAGA GIS. In the future many more integrations can be expected. For instance, there is already a plugin providing access to PostGIS geoprocessing tools (clip, dissolve, distance, etc.) available in the QGIS processing toolbox (https://plugins.qgis.org/plugins/postgis_geoprocessing/).
R users can stay in their preferred programming language without having to touch Python.
Convenient access to QGIS help resources facilitates the geoprocessing work flow. While open_help accesses the QGIS online help for a specific geoalgorithm, get_args_man() retrieves function arguments and their default values.
run_qgis() also accepts "sf", "sp" and "raster" objects as arguments. Similarly, users may directly load the QGIS output into R by setting load_output to TRUE when using run_qgis().

Figure 2: Conceptual model of how RQGIS calls QGIS from within R.

Usage

Since RQGIS is an interface to various GIS software packages, the user needs to install this software beforehand. To facilitate the installation process we have written an installation guide, see http://jannes-m.github.io/RQGIS/articles/install_guide.html. Or after having installed the package, one can also access the corresponding vignette by typing:

vignette("install_guide", package = "RQGIS")

We will demonstrate the usage of RQGIS by showing how to compute the plan and tangential curvatures of a digital elevation model (DEM). The first thing to do is to make sure, that all paths are set correctly to successfully run the Python API from within R. Function set_env() facilitates this since the user only needs to specify the root path to the QGIS installation. If the root path remains unspecified, set_env() tries to be smart by checking the default QGIS installation directories. If this is unsuccessful, set_env() will try to find the QGIS installation on the computer which may be time-consuming especially on Windows machines. A much faster way is to explicitly indicate the root path. For Windows this might look like this qgis_env <- set_env(root = ’C:/OSGEO4~1’).

Subsequently, set_env() finds all required paths. Virtually all subsequent RQGIS functions require the output list of set_env(). This is why, RQGIS automatically caches the output of set_env(), and reuses it when required by another function later on. To establish a tunnel to the QGIS Python API, we run open_app(). Explicitly, the function sets all necessary paths (e.g., path to the QGIS Python binary) to successfully run QGIS, and secondly opens a QGIS custom application (i.e., outside of the QGIS GUI interface) while importing necessary Python modules.

library("RQGIS")
set_env()
open_app()

Running set_env() and open_app() is optional here since all subsequent functions dependent on their output will run them automatically in case they have not been executed before. To work in a reproducible manner, and to find out which QGIS and third-party GIS versions we are using, we execute:

## $root
## [1] "C:/OSGeo4W64"
## 
## $qgis_prefix_path
## [1] "C:/OSGeo4W64/apps/qgis"
## 
## $python_plugins
## [1] "C:/OSGeo4W64/apps/qgis/python/plugins"

info_r <- version
info_qgis <- qgis_session_info()
c(platform = info_r$platform, R = info_r$version.string, info_qgis)
## $platform
## [1] "x86_64-w64-mingw32"
## 
## $R
## [1] "R version 3.4.2 (2017-09-28)"
## 
## $qgis_version
## [1] "2.18.14"
## 
## $gdal
## [1] "2.2.2"
## 
## $grass6
## [1] "6.4.3"
## 
## $grass7
## [1] "7.2.2"
## 
## $saga
## [1] "2.3.2"

Continuing with our analysis, we need to find out the command-line name of a geoalgorithm available in QGIS that computes the curvatures from a DEM. find_algorithms() lets the user use regular expressions to search for a function which contains the search terms in its short description. Leaving the search_term-argument empty, will return all available geoalgorithms. Here, we assume that the function we are looking for contains the word ‘curvature’ in its short description. Setting name_only to TRUE gives back the name of the geoalgorithm instead of its name plus the corresponding short description.

find_algorithms(search_term = "curvature", 
                name_only = TRUE)
## [1] "grass7:r.slope.aspect"             "saga:curvatureclassification"     
## [3] "saga:slopeaspectcurvature"         "saga:upslopeanddownslopecurvature"
## [5] "grass:r.slope.aspect"

Several functions are available for our task, we will go on with function grass7:r.slope.aspect. To familiarize ourselves with the function, we can access its online help by calling (not shown):

open_help(alg = "grass7:r.slope.aspect")

Next, we would like to know how to use a specific geoalgorithm. get_usage() prints the parameters and default values for a given geoalgorithm to the console.

get_usage(alg = "grass7:r.slope.aspect")
## ALGORITHM: r.slope.aspect - Generates raster layers of slope
## aspect
## curvatures and partial derivatives from a elevation raster layer.
##  elevation <ParameterRaster>
##  format <ParameterSelection>
##  precision <ParameterSelection>
##  -a <ParameterBoolean>
##  zscale <ParameterNumber>
##  min_slope <ParameterNumber>
##  GRASS_REGION_PARAMETER <ParameterExtent>
##  GRASS_REGION_CELLSIZE_PARAMETER <ParameterNumber>
##  slope <OutputRaster>
##  aspect <OutputRaster>
##  pcurvature <OutputRaster>
##  tcurvature <OutputRaster>
##  dx <OutputRaster>
##  dy <OutputRaster>
##  dxx <OutputRaster>
##  dyy <OutputRaster>
##  dxy <OutputRaster>
## 
## 
## format(Format for reporting the slope)
##  0 - degrees
##  1 - percent
## precision(Type of output aspect and slope layer)
##  0 - FCELL
##  1 - CELL
##  2 - DCELL

get_args_man() lets us retrieve automatically a corresponding parameter-argument list. Setting the options parameter to TRUE (the default) automatically chooses the default value from a list of possible options for a parameter. This is always the first option which is in accordance with the QGIS GUI behavior. To make the user aware of the automatically chosen options, get_args_man() prints the corresponding values to the console.

params <- get_args_man(alg = "grass7:r.slope.aspect", options = TRUE)
## Choosing default values for following parameters:
## format: 0
## precision: 0
## See get_options('grass7:r.slope.aspect') for all available options.

grass7:r.slope.aspect has 17 parameters. Here, we only show the first six parameters for brevity.

head(params)
## $elevation
## [1] "None"
## 
## $format
## [1] "0"
## 
## $precision
## [1] "0"
## 
## $`-a`
## [1] "True"
## 
## $zscale
## [1] "1.0"
## 
## $min_slope
## [1] "0.0"

Next, we specify the required arguments. Of course, grass7:r.slope.aspect expects a spatial input file, here a DEM. Conveniently, run_qgis() accepts as input both a path to a spatial file stored on disk or a spatial object residing in R’s environment (specifically "raster"-, "sp"- and "sf"-objects). Here, we use a DEM that comes as an example file with the RQGIS package. Note that run_qgis() simply saves spatial input files to a temporary output location. Hence, if the file already exists on disk, it is much more efficient to indicate the path to the file instead of loading it into R and letting run_qgis() export it again. By contrast, indicating output paths is not strictly necessary. If an output path parameter equals None (QGIS default), QGIS automatically creates an output file which it saves to a temporary processing output folder. In the code chunk below, we specifically indicate curvature outputs while keeping the default, i.e. None, for the remaining output parameters slope, aspect, dx, dy, dxx, dyy and dxy (check out the GRASS function documentation for more information, i.e. run open_help("grass7:r.slope.aspect")). run_qgis() prints all output paths to the console if show_output_paths is set to TRUE. Here, we turn this behavior off for two reasons. First, we are not interested in the output paths of the seven terrain attributes we left unspecified. Secondly, we have explicitly specified the curvature output paths, i.e., we already know the corresponding output locations. We recommend to specify those output paths which are relevant to the analysis. Manual specification has the additional benefit that we can indicate a specific file format (QGIS default for raster data sets is in most cases .tif, but we might want to use e.g., .asc or SAGA’s .sdat). Additionally, loading QGIS output directly back into R (load_output-parameter) only works with output paths specified by the user.

data("dem", package = "RQGIS")
out <- run_qgis(alg = "grass7:r.slope.aspect",
                elevation = dem, 
                pcurvature = file.path(tempdir(), "pcurv.tif"),
                tcurvature = file.path(tempdir(), "tcurv.tif"),
                show_output_paths = FALSE,
                load_output = TRUE)

Note that we used R named arguments in run_qgis(), i.e., we assigned values or objects to the parameters whose names we have identified with the help of get_args_man() or get_usage(). We could replace the R named arguments also by a parameter-argument list. Remember that we have already created a parameter-argument list named params using get_args_man() (see above):

params$elevation <- dem
params$pcurvature <- file.path(tempdir(), "pcurv.tif")
params$tcurvature <- file.path(tempdir(), "tcurv.tif")
out <- run_qgis(alg = "grass7:r.slope.aspect",
                params = params,
                load_output = TRUE,
                show_output_paths = FALSE)
class(out)
## [1] "list"
names(out)
## [1] "pcurvature" "tcurvature"

However, providing R named arguments and a parameter-argument list is not possible, and run_qgis() will complain telling the user to user either one of these but not a mixture. Since we have set load_output to TRUE, run_qgis() automatically loads the QGIS output into R. In this case, the object out is a list with two "raster" objects. If we only had specified one output raster, out would have been a "RasterLayer" object. In case the output is a vector layer, run_qgis() will load it as an "sf" object. To have a look at the output, we can execute following code (not shown):

library("raster")
plot(stack(out))

Concerning the handling of parameter-argument pairs, run_qgis() uses get_args_man() through pass_args() in the background to access the default values of all missing arguments if available. If the user accidentally omits a required argument, run_qgis() will return an error message that informs about the missing argument. The help documentation of pass_args() presents a detailed list of argument checks that are run before executing run_qgis().

Experienced GRASS users may wonder if there is a need to specify the GRASS_REGION_PARAMETER. pass_args() determines this parameter automatically based on the spatial layers provided as input by the user (in our example above this is dem). However, the GRASS_REGION_PARAMETER can also be set manually (see the pass_args() documentation for details).

Ecological example: combining geocomputing and statistics

Figure 3: The study area Mount Mongón in northern Peru (Landsat image: path 9, row 67, acquisition date 09/22/2000; USGS 2016).

To show the utility of RQGIS in real-world applications, we combine QGIS functionality with R’s modeling and (geo-)statistical capabilities in an ecological study in the coastal desert of northern Peru (Figure 3). Despite the extreme aridity of the Mount Mongón region (200-1100 m asl), this area is the habitat of a distinct flora and fauna (Dillon, Nakazawa, and Leiva 2003). The unique vegetation, locally termed lomas, mainly survives due to heavy fog during the austral winter months (Muenchow, Feilhauer, et al. 2013; Muenchow, Hauenstein, et al. 2013).

Linking species richness to environmental predictors along gradients is a key topic of community ecology and biogeography (Muenchow et al. 2017), and the fundamental basis for conservation planning (Pomara et al. 2012). In our use case, we model vascular plant species richness along an altitudinal gradient as a function of topographic and remotely sensed variables by means of count regression. In the following, we will show a simplified version of an analysis by Muenchow, Feilhauer, et al. (2013).

Before running a Poisson model, we need to compute terrain attributes from a DEM. These will serve as predictors to model species richness. To account for the unimodal relationship between elevation and species richness, we use a second-order orthogonal polynomial function (Figure 4, Panel Elevation). In the original paper, we dropped the least significant variables one at a time until only significant predictors remained (elevation and its squared term, catchment slope, catchment area and the normalized difference vegetation index, NDVI). Due to space constraints and demonstration purposes, we will simply use the final model here (for details see Muenchow, Feilhauer, et al. 2013). Instead of calculating all predictors used in the original paper, we only show how to derive selected geospatial predictors using RQGIS, namely tangential and profile curvature, catchment slope and catchment area.

Figure 4: Scatterplot of all predictors used in the Poisson model against the response variable. Each dot represents a visited plot on Mount Mongón. The gray line smoother should aid visual inspection.

Terrain attributes

The numerical representation as well as the analysis of the land surface is frequently referred to as terrain analysis and terrain modeling. The corresponding surface-characterizing measures are known as terrain attributes (Pike, Evans, and Hengl 2008). Terrain attributes play an important role, for example, in pedometrics (McBratney et al. 2000), precision agriculture (Kühn et al. 2009), geomorphometry (Pike, Evans, and Hengl 2008) and ecology (Muenchow, Bräuning, et al. 2013). They are frequently related to slope stability (Montgomery and Dietrich 1994; Muenchow, Brenning, and Richter 2012). Additionally, they are proxies for variables representing water availability such as soil moisture, soil texture or moisture-holding capacity, among others (Alexander Brenning 2008; Franklin, McCullough, and Gray 2000; Muenchow, Hauenstein, et al. 2013). Especially the latter is of utmost importance regarding plant distribution in a desert environment. While GIS can easily calculate terrain attributes, R is rather limited in this respect. However, without terrain attributes, we would neither be able to model nor predict species richness appropriately.

First we would like to use SAGA to remove local depressions from the DEM, since these may be artifacts. For this, we use the [1] Fill Sinks method. Note that you also may use numbers for specifying the option, here, the fill sinks method corresponds to 1.

get_usage("saga:sinkremoval")
## ALGORITHM: Sink removal
##  DEM <ParameterRaster>
##  SINKROUTE <ParameterRaster>
##  METHOD <ParameterSelection>
##  THRESHOLD         <ParameterBoolean>
##  THRSHEIGHT <ParameterNumber>
##  _RESAMPLING <ParameterSelection>
##  DEM_PREPROC <OutputRaster>
## 
## 
## METHOD(Method)
##  0 - [0] Deepen Drainage Routes
##  1 - [1] Fill Sinks
## _RESAMPLING(Resampling method)
##  0 - Nearest Neighbour
##  1 - Bilinear Interpolation
##  2 - Bicubic Spline Interpolation
##  3 - B-Spline Interpolation
run_qgis(alg = "saga:sinkremoval",
         DEM = dem, 
         METHOD = "[1] Fill Sinks", 
         DEM_PREPROC = file.path(tempdir(), "sdem.sdat"),
         show_output_paths = FALSE)

Next, we compute the catchment area (or upslope contributing area) and its mean slope from the preprocessed DEM. These raster datasets are calculated while calculating the ‘Saga wetness index’, which is also frequently used as a predictor in ecological studies. To calculate the catchment slope instead of the local slope we set the SLOPE_TYPE argument to 1. The names of the output rasters are carea.sdat and cslope.sdat both of which will be stored in tempdir().

get_usage("saga:sagawetnessindex")
## ALGORITHM: Saga wetness index
##  DEM <ParameterRaster>
##  SUCTION <ParameterNumber>
##  AREA_TYPE <ParameterSelection>
##  SLOPE_TYPE <ParameterSelection>
##  SLOPE_MIN <ParameterNumber>
##  SLOPE_OFF <ParameterNumber>
##  SLOPE_WEIGHT <ParameterNumber>
##  _RESAMPLING <ParameterSelection>
##  AREA <OutputRaster>
##  SLOPE <OutputRaster>
##  AREA_MOD <OutputRaster>
##  TWI <OutputRaster>
## 
## 
## AREA_TYPE(Type of Area)
##  0 - [0] absolute catchment area
##  1 - [1] square root of catchment area
##  2 - [2] specific catchment area
## SLOPE_TYPE(Type of Slope)
##  0 - [0] local slope
##  1 - [1] catchment slope
## _RESAMPLING(Resampling method)
##  0 - Nearest Neighbour
##  1 - Bilinear Interpolation
##  2 - Bicubic Spline Interpolation
##  3 - B-Spline Interpolation
run_qgis(alg = "saga:sagawetnessindex",
         DEM = file.path(tempdir(), "sdem.sdat"),
         SLOPE_TYPE = 1, 
         SLOPE = file.path(tempdir(), "cslope.sdat"),
         AREA = file.path(tempdir(), "carea.sdat"),
         show_output_paths = FALSE)

To have a look at the output rasters, we can run (not shown):

library("dplyr")
library("raster")
file.path(tempdir(), c("cslope.sdat", "carea.sdat")) %>%
  raster::stack(.) %>%
  plot

We furthermore need to apply some transformations to these raster files for improved interpretability. On the one hand, we convert slope angle from radians to degrees.

library("raster")
cslope <- raster(file.path(tempdir(), "cslope.sdat"))
cslope <- cslope * 180 /pi

On the other hand, we divide the catchment area variable by one million to change the unit from m² to km². Furthermore, we transform it logarithmically since it is strongly skewed to the right.

carea <- raster(file.path(tempdir(), "carea.sdat"))
log_carea <- log(carea / 1e+06)

Apart from the catchment area and catchment slope, our final model requires the NDVI as an indicator of vegetation properties. To calculate it, we would have to perform pixel-by-pixel arithmetic operations on raster data sets representing the Landsat satellite’s spectral bands three (red) and four (near infrared). These so-called map algebra operations are available through raster and the QGIS modules saga:rastercalculator and grass:r.mapcalculator; however, the RQGIS package already provides the NDVI raster as an example dataset. Therefore, we merely have to attach the data to our workspace.

data("ndvi", package = "RQGIS")

To account for the nonlinear unimodal relationship between elevation and species richness, we need two rasters representing the second-order orthogonal polynomials of the original DEM. First, we convert the elevation unit from m to km by dividing the original DEM raster by 1000. Next, we use R’s built-in poly() function to calculate the orthogonal polynomials for each pixel in our DEM raster to avoid collinearity among the predictors. Before saving the resulting rasters and the catchment slope, the catchment area and the NDVI to a temporary output location, we apply the crop() function from the raster package to ensure that all rasters share the same extent.

data("dem", package = "RQGIS")
dem <- dem / 1000
my_poly <- poly(values(dem), degree = 2)
dem1 <- dem2 <- dem
values(dem1) <- my_poly[, 1]
values(dem2) <- my_poly[, 2]
for (i in c("dem1", "dem2", "log_carea", "cslope", "ndvi")) {
  tmp <- crop(get(i), dem) 
  writeRaster(x = tmp, 
              filename = file.path(tempdir(), paste0(i, ".asc")), 
              format = "ascii",
              prj = TRUE,
              overwrite = TRUE)  
}

After creating these raster datasets, we would like to extract their attributes to the randomly sampled points. Conveniently, RSAGA’s pick.from.ascii.grids() function accepts multiple rasters for parallel attribute extraction. Note that pick.from.ascii.grids() is a pure R function, i.e., it runs without accessing SAGA. Here, we only extract the predictor variables needed in the final model (see above). The "sf" object random_points refers to randomly sampled points we visited in the field. Unfortunately, pick.from.ascii.grids() does not accept spatial point objects as input; instead we have to provide it with a "data.frame" and indicate which columns refer to the coordinates. In the file argument we specify the raster files from which we would like to extract values to our points. The output columns in vals will be named like the input rasters.

library("dplyr")
data("random_points", package = "RQGIS")
random_points[, c("x", "y")] <- sf::st_coordinates(random_points)
raster_names <- c("dem1", "dem2", "log_carea", "cslope", "ndvi")
vals <- RSAGA::pick.from.ascii.grids(data = as.data.frame(random_points),
                                     X.name = "x", 
                                     Y.name = "y", 
                                     file = file.path(tempdir(), raster_names),
                                     varname = raster_names)
dplyr::select(vals, -geometry) %>%
  head(., 3)
##   id spri      x       y     dem1    dem2 log_carea cslope    ndvi
## 1  1    4 797179 8932755 -0.01049 0.01268  -1.21800  21.18 -0.3603
## 2  2    4 796749 8932621 -0.01019 0.01163   0.04145  13.02 -0.3488
## 3  3    3 796816 8932739 -0.01008 0.01124  -0.48148  23.70 -0.3396

Modeling species richness and predictive mapping

To model species richness, which is a count variable, we naturally opt for a Poisson regression. Overall, spatial count regression models are popular across many fields including epidemiology (Fernandez et al. 2012), demography (Chien, Guo, and Zhang 2016), criminology (Jones-Webb and Wall 2008), remote sensing (Comber et al. 2016) and ecology (Moreno-Fernández et al. 2015). Since we observed a unimodal nonlinear relationship between species richness and elevation, elevation enters the model as a second-order polynomial function.

fit <- glm(formula = spri ~ dem1 + dem2 + cslope + ndvi + log_carea,
           data = vals,
           family = "poisson")

To spatially predict species richness (Figure 5), we apply the estimated beta coefficients to the input rasters. raster’s predict function does this by accepting the fitted model and the predictor rasters as input. Finally, the crop-function ensures that the predictions are restricted to the study area.

library("sf")
library("raster")
raster_names <- c("dem1.asc", "dem2.asc", "log_carea.asc", "cslope.asc",
                  "ndvi.asc")
s <- stack(x = file.path(tempdir(), raster_names))
pred <- predict(object = s, 
                model = fit,
                fun = predict,
                type = "response")
pred <- crop(x = pred,
             y = as(random_points, "Spatial"))
# plot the output (shown in Figure 5)
plot(pred)
plot(st_geometry(random_points), add = TRUE)

Note that an alternative to raster::predict() is RSAGA’s multi.local.function() in conjunction with grid.predict() (see Alexander Brenning 2008).

`r''````{r, eval = FALSE, message = FALSE}
library("RSAGA")
multi.local.function(path = tempdir(), in.grids = c("dem1","dem2",
  "cslope", "ndvi", "carea"), out.varnames = "pred", fit = fit,
  fun = grid.predict, trafo = my_trafo, 
  control.predict = list(type = "response"),
  env = rsaga.env(path = "C:/OSGEO4~1/apps/saga"))
```

The model reached a good goodness-of-fit (explained deviance divided by null deviance) of 0.78. The most important variable in predicting species richness was elevation and its squared term. In our interpretation, variation with elevation mainly relates to differences in water availability. Humidity, and thus species richness, is greatest just below the temperature inversion (ca. 750–850 m). For a more detailed interpretation of the model and its predictors, refer to Muenchow, Feilhauer, et al. (2013).

Figure 5: Prediction map of species richness. The points represent the visited plots.

Extending RQGIS through Python and PyQGIS

In this section we would like to show examplarily how one can easily extend RQGIS through Python and especially PyQGIS. As explained in sections 2.1 and 2.2, RQGIS uses the reticulate package to establish a tunnel to the QGIS API. To find out which Python binary is in use we run the py_config() function of the reticulate package. When using Windows, please do so only after having run open_app before, since this sets all necessary paths to run the QGIS Python binary. Otherwise py_config() will be unable to find it, in which case it most likely would use another Python binary (e.g., the one shipped with Anaconda) if available. Additionally, open_app() imports various necessary Python libraries (among others osgeo, processing and qgis), and attaches our Python RQGIS class (see section 2.2).

library("reticulate")
py_config()
## python:         C:/OSGeo4W64/bin/python.exe
## libpython:      C:/OSGeo4W64/bin/python27.dll
## pythonhome:     C:\OSGeo4W64\apps\Python27
## version:        2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:\OSGeo4W64\apps\Python27\lib\site-packages\numpy
## numpy_version:  1.12.1
## 
## NOTE: Python version was forced by use_python function

We have defined the Python class RQGIS in python_funs.py which is part of ‘inst/python’. Aside from set_env() all functions found in ‘R/processing’ make extensive use of the Python RQGIS methods. Most of the Python methods have the same names as their counterparts in R. We can access them with py_run_string() of the reticulate-package. This sends a call to Python, and converts the Python into R output if desired.

py_run_string("methods = dir(RQGIS)")$methods
## [1] "__doc__"           "__init__"          "__module__"       
## [4] "check_args"        "get_args_man"      "get_options"      
## [7] "open_help"         "qgis_session_info"

Of course, we can use the Python RQGIS methods directly via reticulate which is exactly what the RQGIS-package is doing. For example, to find out what the options are for a QGIS geoalgorithm named qgis:randompointsinsidepolygonsvariable, we can run:

py_cmd <- "opts = RQGIS.get_options('qgis:randompointsinsidepolygonsvariable')"
py_run_string(py_cmd)$opts
## $STRATEGY
## [1] "Points count"   "Points density"

Or we can use PyQGIS-functionality directly. For instance, assuming we would like to find out the function parameters of qgis:randompointsinsidepolygonsvariable, we can use alghelp() from the QGIS Python processing framework (Graser and Olaya 2015 and see also https://docs.qgis.org/2.8/en/docs/user_manual/processing/console.html). Here, we only show the first 40 characters of the output.

py_cmd <- "processing.alghelp('qgis:randompointsinsidepolygonsvariable')"
py_capture_output(py_run_string(py_cmd)) %>%
  substring(., 1, 40)
## [1] "ALGORITHM: Random points inside polygons"

Users can easily extend the RQGIS class with additional methods, or they could write their own classes and methods. Also if there is a need to write Python one-liners or make use of some Python functionality, this can easily be done using RQGIS in conjunction with reticulate. Here, we will present one last example. Frequently, users have asked us if it was possible to also use the QGIS map canvas from within R. Therefore, we provide a proof-of-concept how this could be achieved (see also http://docs.qgis.org/testing/en/docs/pyqgis_developer_cookbook/canvas.html). First of all, we save random_points to a temporary output location.

data("random_points", package = "RQGIS")
file <- normalizePath(file.path(tempdir(), "points.shp"), winslash = "/", 
                      mustWork = FALSE)
sf::st_write(random_points, dsn = file)

Before running the subsequent code, make sure that you have already attached the packages RQGIS and reticulate. Additionally, you need to run open_app() first. Then we can create the map canvas, add the previously saved shapefile to it, set the extent to the extent of our shapefile, set the map canvas layer, and finally open a standalone map window (Figure 6). If this does not open a standalone window you might have to run py_run_string(’app.exec_()’) to initialize a Qt event loop which in turn renders the points in a standalone window (pers. comm. Barry Rowlingson). We emphasize that this is only a proof-of-concept, and a rather unstable solution (see section 5.4).

# create the map canvas
py_run_string("canvas = QgsMapCanvas()")
# import point shapefile
py_run_string(sprintf("layer = QgsVectorLayer('%s', 'points', 'ogr')", file))
# add imported point layer to the map canvas
py_run_string("QgsMapLayerRegistry.instance().addMapLayer(layer)")
# set the extent of the map canvas to the extent of the imported shapefile
py_run_string("canvas.setExtent(layer.extent())")
# set the map canvas layer
py_run_string("canvas.setLayerSet([QgsMapCanvasLayer(layer)])")
# open a standalone window
py_run_string("canvas.show()")
# if a standalone window has not already opened, run the next line
# py_run_string("app.exec_()")

Figure 6: A very simple example how to access the QGIS map canvas from within R.

Discussion

R/GIS-integration: Combining the best of two worlds

In our use case (see section 3) we mainly used QGIS for raster preprocessing and the generation of terrain attributes. Naturally, there are many more geospatial analysis problems that can be solved using the thousands of geoalgorithms that are accessible through RQGIS and other R–GIS packages. For instance, SAGA provides more than 600 (Conrad et al. 2015) and GRASS more than 500 functions (http://grass.osgeo.org/grass72/manuals/).

For example, apart from relatively simple DEM derivatives (slope angle and orientation), a GIS can also calculate more complex, process-oriented terrain attributes such as the relative slope position or the topographic wetness index (Conrad et al. 2015). Additionally, most GIS can easily extract stream networks (T. Hengl, Heuvelink, and van Loon 2010) and surface roughness (Grohmann 2004). Somewhat related are geospatial calculations concerned with terrain classification and landform identification (Alexander Brenning 2012b; Rocchini et al. 2013). Physically-based models (e.g., SHALSTAB or Factor of Safety, both available in SAGA GIS) may provide additional insights (Goetz, Guthrie, and Brenning 2011). Of course, GIS also provide an extensive suite of vector processing tools as required, for example, in geomarketing.

As pointed out in the beginning R has its limitations regarding GIS capabilities, but when it comes to statistical analyses, R is the uncontested champion in its field. For instance, instead of using a polynomial function in our use case (see section 3.2), we could have used a generalized additive model (GAM) with a logarithmic link function to allow for nonlinear relationships between various predictors and species richness (Figure 4). A GAM is the nonlinear extension of a generalized linear model (GLM), and uses smoothing functions to deal with nonlinearity (Hastie 2017). In ecology, coupling ordination techniques and GAMs is a fruitful approach to spatially predict ecological communities (Muenchow, Bräuning, et al. 2013).

Equally, machine learning algorithms (support vector machines, random forests, etc.) are readily available in R, although these methods may tend to overfit the training data (A. Brenning 2005). Overfitting in turn limits a model’s ability to spatially predict the response variable. Here, spatial cross-validation through the sperrorest package (Alexander Brenning, Long, and Fieguth 2012) provides an opportunity to assess spatial predictive capabilities.

Frequently, the residuals of spatial models show some form of spatial autocorrelation. This violates the assumption of independent model residuals made by most statistical models (Dormann et al. 2007; Zuur et al. 2009). Violating this assumption can lead to untrustworthy \(p\) values, biased coefficients and subsequently poor predictions (Zuur et al. 2009). Fortunately, packages such as nlme (Pinheiro, Bates, and R-core 2017) and mgcv (Wood 2017) let the user incorporate various correlation structures into a model. This is most helpful in the presence of temporal and spatial autocorrelation (e.g, Iturritxa, Mesanza, and Brenning 2015). Sometimes mixed-effect models may also account for autocorrelation since random intercepts allow for correlation within a group (e.g., Peters et al. 2014). Another way of dealing with spatial autocorrelation is e.g., to include auto-regressive correlation structures in a GLM or GAM within a Bayesian modeling approach (Zuur and Ieno 2017).

To summarize, R offers an incredibly vast suite of advanced statistical and data science methods. On the other hand, QGIS and other GIS software offer a rich suite of geoalgorithms and geocomputational power. Interfacing R with GIS simply combines the best of two worlds for automated statistical geocomputing.

RSAGA and R/GRASS-integration

Compared to the two separate packages rgrass7 and RSAGA, RQGIS has the advantage of providing a unified interface to both GRASS and SAGA GIS toolboxes. Moreover, QGIS facilitates the usage of third-party geoalgorithms by automatically converting vector (e.g., shapefiles) and raster formats (e.g., ASCII grid files) into the particular format supported by the third-party module. For instance, SAGA has its own grid format (sgrd-files) and GRASS uses its own database format. Running SAGA or GRASS functions, (R)QGIS automatically converts the input data using io_gdal() in the case of SAGA and v.in.ogr() or r.in.gdal() in the case of GRASS. Though this is extremely user-friendly especially when providing interfaces to various third-party providers, it comes at the prize of increased computing time due to the necessity of multiple format conversions during one geoalgorithm call. Equally user-friendly it the automatic setup of the GRASS environment (projection, region and mapset) through (R)QGIS, if necessary. This certainly facilitates access to GRASS, especially for less experienced GIS users. Finally, RQGIS is overall quite user-friendly due to its convenience functions open_help() and get_args_man() and through its support of R named arguments for geoalgorithm parameters (see sections 2.1 and 2.2).

However, (R)QGIS only integrates a subset of the modules available in SAGA and GRASS GIS. While this fraction is likely to grow in the near future, a full integration of all modules is improbable as it would duplicate functionality (though this of course already has happened) and interface functions that are unnecessary within the QGIS environment, such as the GRASS database functions. If the user’s intention is to use GRASS’s database management system (DBMS), the direct R–GRASS integration via the spgrass6 (R. Bivand, Pebesma, and Gómez-Rubio 2013) and rgrass7 packages (R. Bivand and Neteler 2000) would be the appropriate path. RQGIS does not provide access to this DBMS since the GRASS plugin of the QGIS processing toolbox only allows restricted access to GRASS’s DBMS functionality. The use of rgrass7 also allows the user to operate within a single GRASS session instead of calling a new one for each GRASS command as implicitly done by QGIS.

In the case of SAGA GIS, RSAGA has additional benefits. First of all, RSAGA provides numerous user-friendly wrapper functions with arguments (and meaningful default values) documented in the R help pages. RSAGA also strives to provide unified access to a range of SAGA versions while using, if possible, persistent function and argument names as well as default values. This allows for an easier migration between SAGA versions. At the moment RSAGA supports versions 2.0.4–2.2.3, but support for SAGA versions until the current version 6.1 has already been developed, and is currently being tested. By contrast, QGIS 2.14 supports SAGA 2.1.2–2.3.1. With the release of QGIS 2.18.10, this support was limited to the long term SAGA release 2.3.x. Extremely useful are furthermore RSAGA’s special geocomputing functions that allow, for example, the application of any user-defined R function to a stack of grids, either locally or within moving windows (functions multi.focal.function() and multi.local.function()). In conjunction with grid.predict(), predict methods of models fitted in R can therefore also be applied to stacks of raster files, as shown in Alexander Brenning (2008).

In the end, it will be up to the user to decide whether to use RQGIS, RSAGA, rgrass7 or some combination of these, depending on the user’s preferences, expertise and tasks at hand. In any case, we would recommend RQGIS if a user

requires mainly the more commonly used SAGA and GRASS functions,
does not want to be bothered with setting up the GRASS environment,
does not plan to use GRASS geodatabase capabilities,
does not want to worry about spatial data format conversions,
would like to use spatial objects residing in R as input-arguments, and load GIS output automatically into R,
cherishes the flexibility of seamlessly integrating QGIS, GRASS, SAGA and other third-party software (GDAL/OGR, Orfeo Toolbox, LAStools, TauDEM) within a single geoprocessing workflow in R.

Accessing ArcGIS through RPyGeo

The integration of GIS functionality into R is not limited to the mentioned open-source GIS software, but also includes the global leader in commercial GIS software (Longley et al. 2011), ArcGIS, through the RPyGeo package (Alexander Brenning 2012a). This package piggybacks on ArcGIS’s own Python interface to ArcGIS functions, or ‘tools’. RPyGeo generates Python code that calls ArcGIS. Some of the challenges currently include

latency times related to launching Python and ArcGIS before each GIS call,
passing data and files between R and ArcGIS,
error tracking based on Python and/or ArcGIS error messages,
interpretation of ArcGIS help pages from the perspective of a geoprocessing interface function in R.

While some of these limitations may be overcome in future releases of RPyGeo and ArcGIS, the growing interest in integrating R and ArcGIS is also evident from recent efforts to provide access to R from within ArcGIS (https://r-arcgis.github.io/). This so-called R–Bridge allows users to

retrieve data from ArcGIS geodatabases into R as "sp" objects, and export R data back into ArcGIS geodatabases,
run R code within ArcGIS as a user-defined ‘tool’. This is similar to QGIS’s ability to integrate R scripts as user-defined GIS modules in the Processing toolbox.

While this connectivity, in principle, goes both ways, the integration of ArcGIS into R through the R–Bridge is currently limited to data import/export, which is complementary to RPyGeo’s capability of executing ArcGIS modules from within R.

Current and future developments

There are several interesting developing directions in the R-spatial world and for RQGIS. For example, we have been asked multiple times to include the QGIS mapping widget within R for fast interactive visualization and styling of multiple layers. In section 4 we have shown that this is theoretically possible. However, mixing R and Qt events seems to cause frequently trouble (also pers. comm. Barry Rowlingson). For instance, when running QgsMapCanvas() twice or enlarging the standalone window manually (Figure 6), one runs a high risk of crashing the current R session. Therefore, Kevin Stadler, Barry Rowlingson and Julia Wagemann have opted for a different approach realized within a Google Summer of Code Project. In this project they wrote the QGIS Plugin Network API which enables the user to access the QGIS API via a HTTP interface from another language capable of making HTTP calls (such as R). Along with the sibling R package qgisremote (https://qgisapi.gitlab.io/qgisremote/index.html) this allows R users to seamlessly exchange spatial data between R and QGIS. For example, one can add vector and raster layers from within R to the QGIS map canvas, interactively edit them (such as adding new points or changing polygon vertices), and load the results back into R. Similarly, the packages mapview (build on top of leaflet, Cheng, Karambelkar, and Xie 2017) and mapedit lets the user interactively visualize and edit spatial objects on leaflet maps. The advantage of qgisremote over these two packages is that the QGIS map canvas probably is better able to handle larger quantities of spatial data compared to leaflet, and that it provides the user with the full power of a Desktop GIS for manually editing spatial data (trace, snap, etc.). Nevertheless, mapview is a great tool for interactive visualization, and mapedit is a good alternative for small and quick modifications of spatial vector data within R. Coming back to the user request to include the QGIS mapping widget into RQGIS, we can state that qgisremote already filled this gap perfectly. In summary, RQGIS and qgisremote complement one another. The first gives an R user direct access to the QGIS processing engine, and the latter hands an R user the power of a Desktop GIS graphical user interface.

Adding new functionalities to RQGIS will generally include Python programming. Since QGIS migrates from Python 2 to Python 3 by the end of 2017, it is probably best to postpone major RQGIS extensions until after the migration. To guarantee a smooth transition, and to offer the possibility to work either with QGIS 2 or QGIS 3, we have designed RQGIS in such a way that we ideally would merely have to add a Python 3 script to ‘inst/python’ to make RQGIS also work with QGIS 3. In the case of a bumpier transition than anticipated, RQGIS users may rest assured that RQGIS will work in any case with the QGIS long term release 2.18 which will be further developed and regularly updated (see https://www.qgis.org/en/site/getinvolved/development/roadmap.html#release-schedule).

Another envisaged RQGIS update includes the support for "stars" classes (see https://github.com/r-spatial/stars). stars aims to mainly extend the raster package.

Conclusions

Combining R and GIS software creates a powerful environment for advanced statistical geocomputing. RQGIS makes this also possible with QGIS—one of the most-widely used open-source GIS, which is therefore probably also very appealing to R users. Conveniently, RQGIS offers a unified interface to various desktop GIS (SAGA, GRASS, etc.) that are integrated into QGIS. The use of GIS tools is facilitated through auxiliary functions for the automatic retrieval of function arguments and their default values, the support of R named arguments in run_qgis(), the seamless exchange of spatial data types, and the quick access of the online help for any QGIS geoalgorithm.

Acknowledgments

We greatly appreciate the work of the QGIS development and the R core team. We are also greatly indebted to Dr. Rainer Krug (University of Zurich) who not only reviewed our manuscript two times but also improved it with his valuable suggestions. Of course, we thank Dr. Roger Bivand, our editor, for handling our manuscript. We are also grateful to Dr. Tomislav Hengl (SRIC – World Soil Information, Wageningen), and an anonymous reviewer for valuable feedback on a previous manuscript version. Finally, we would like to thank Barry Rowlingson (Lancaster University) for fruitful discussions on Python, QGIS and R.

RQGIS: Integrating R with QGIS for Statistical Geocomputing

Jannes Muenchow

Patrick Schratz

Alexander Brenning

2017-12-04