Abstract
The UK National River Flow Archive (NRFA) stores several types of hydrological data and metadata: daily river flow and catchment rainfall time series, gauging station and catchment information. Data are served through the NRFA web services via experimental RESTful APIs. Obtaining NRFA data can be unwieldy due to complexities in handling HTTP GET requests and parsing responses in JSON and XML formats. The rnrfa package provides a set of functions to programmatically access, filter, and visualize NRFA data using simple R syntax. This paper describes the structure of the rnrfa package, including examples using the main functionsgdf() and
cmr() for flow and rainfall data, respectively.
Visualization examples are also provided with a shiny web
application and functions provided in the package. Although this package
is regional specific, the general framework and structure could be
applied to similar databases.
The increasing volume of environmental data available online poses non-trivial challenges for efficient storage, access and share of this information (Vitolo et al. 2015). An integrated and consistent use of data is achieved by extracting data directly from web services and processing them on-the-fly. This improves the flexibility of modelling applications allowing a more seamless workflow integration, and also avoids the need to store local copies that would need to be periodically updated, therefore reducing maintenance issues in the system.
In the hydrology domain, various data providers are adopting web services and Application Programming Interfaces (APIs) to allow users a fast and efficient access to public datasets, such as the National River Flow Archive (NRFA) hosted by the Centre for Ecology and Hydrology in the United Kingdom. The NRFA is a primary source of information for hydrologists, modellers, researchers and practitioners operating on UK catchments. It stores several types of hydrological data and metadata: gauged daily flow and catchment mean rainfall time series as well as gauging station and catchment information. Data are typically served through the NRFA web services via a web-based graphical user interface (http://nrfa.ceh.ac.uk/) and, more recently, via experimental RESTful APIs. REST (Representational State Transfer) is an architectural style that uses the HyperText Transfer Protocol (HTTP) to perform operations such as accessing resources on the web via a Uniform Resource Identifier (URI). In simple terms, the location of a NRFA dataset on the web is a unique string of characters that follows a pattern. This string is assembled using the rules described in the API documentation and can be tested by typing the string in the address bar of a web browser.
This paper describes the technical implementation of the rnrfa package (Vitolo 2016). The rnrfa package takes the complexities related to web development and data transfer away from the user, providing a set of functions to programmatically access, filter, and visualize NRFA data using simple R syntax. Although the NRFA APIs are still in their infancy and prone to further consolidation and refinement, the experimental implementation of the rnrfa package can be used to test these data services and provide useful feedback to the provider.
The package is in line with a Virtual Observatory approach (Beven, Buytaert, and Smith 2012) as it can be used as back-end tool to link data and models in a seamless fashion. It complements R’s growing functionality in environmental web technologies (Leeper et al. 2016), amongst which are rnoaa (Chamberlain 2015 interface to NOAA climate data API), waterData (Ryberg and Vecchia 2014 interface to the U.S. Geological Survey daily hydrologic data services) and RNCEP (Kemp et al. 2011 interface to NASA NCEP weather data).
This paper first presents the NRFA archive, its web services and related APIs. We then illustrate the design and implementation of the rnrfa package, and how it can be used in synergy with existing R packages such as shiny (Chang et al. 2016), leaflet (Cheng and Xie 2015), rmarkdown(Allaire et al. 2016), DT (Xie 2015a), dplyr (Wickham and Francois 2015) and parallel to generate interactive mapping applications, dynamic reports and big data analytics experiments.
The NRFA web services allow to view, filter and download data via a graphical user interface. This approach has a number of limitations. Firstly, time series of daily streamflow discharge and catchment rainfall can only be downloaded one at the time. Therefore, for large scale analyses, downloading datasets for hundreds of sites becomes a rather tedious task. Secondly, metadata can only be visualised (in table format) but not be downloaded. Metadata analyses may require copying and pasting large amounts of information introducing potential errors. Due to the above limitations, the NRFA is also accessible programmatically via a set of RESTful APIs. The API documentation is not in the public domain yet, therefore it must be considered experimental and subject to changes.
Station metadata (called catalogue hereafter) is available in JavaScript Object Notation (JSON) format. The catalogue contains a total of 18 attributes, which are listed in Table 1. The NRFA also provides time series of Gauged Daily Flow (gdf, in \(m^3/s\)) and Catchment Mean Rainfall (cmr, in mm per month), formatted in an XML variant called WaterML2 (http://www.opengeospatial.org/standards/waterml). WaterML2 is an Open Geospatial Consortium (OGC) standard used worldwide to rigorously and unambiguously describe hydrological time series. It builds upon existing standards such as Observations & Measurements (Cox et al. 2011) for the metadata section and GML (Open Geospatial Consortium 2013) for the observed time series. It is typically defined as a “Collection” and made up of five sections:
The metadata section contains the document metadata (e.g., the generation date, the version, the generation system and the list of profiles utilised).
The temporalExtent contains a description of the time period for which there are recordings (with a time stamp of start and end date).
The localDictionary is a gml-based dictionary which stores the identifier (e.g., United Kingdom National River Flow Archive) and two dictionary entries: The first one describes the type of measurement (e.g., Gauged Daily Flow) with details on the variable measured (e.g., flow), units (e.g., m3/s) and frequency of measurements (e.g., daily); the second entry describes the gauging site with details on ratings and their limitations.
The samplingFeatureMember describes the monitoring point (e.g., vertical datum and time zone) and the station owner.
The observationMember contains a set of nodes which schema is borrowed from the OGC Observation and Measurement standard. This section contains a gml-based identifier (station identification number) and additional information, such as ObservationMetadata (contact info, identification info, etc.), phenomenonTime (beginning and end of recordings), ObservationProcess (process type and reference). Finally the sub-section result contains the measurements in a gml-based format.
The nested structure of the WaterML2 files makes parsing of long time series and related metadata relatively slow and complex. In order to improve access to NRFA’s public data and metadata, we implemented a set of functions to assemble HTTP GET requests and parse XML/JSON responses from/to the catalogue and WaterML2 services using simple R syntax.
| Colum | Column | Description |
| number | name | |
| 1 | id | Station identification number. |
| 2 | name | Name of the station. |
| 3 | location | Area in which the station is located. |
| 4 | river | Name of the river catchment. |
| 5 | stationDescription | General station description, containing information on weirs, ratings, etc. |
| 6 | catchmentDescription | Information on topography, geology, land cover, etc. |
| 7 | hydrometricArea | UK hydrometric area identification number, the related map is based on the Surface Water Survey designed in the 1930s and available at http://www.ceh.ac.uk/data/nrfa/hydrometry/has.html. |
| 8 | operator | UK measuring authorities, the related map is available at http://www.ceh.ac.uk/data/nrfa/hydrometry/mas.html. |
| 9 | haName | Name of the hydrometric area. |
| 10 | gridReference | The Ordnance Survey grid reference number. |
| 11 | stationType | Type of station (e.g., flume, weir, etc.). |
| 12 | catchmentArea | Catchment area in \(Km^2\). |
| 13 | gdfStart | First year of monitoring. |
| 14 | gdfEnd | Last year of monitoring. |
| 15 | farText | Information on the regime (e.g., natural, regulated, etc.). |
| 16 | categories | Various tags (e.g., FEH_POOLING, FEH_QMED, HIFLOWS_INCLUDED). |
| 17 | altitude | Altitude measured in metres above Ordnance Datum or, in Northern Ireland, Malin Head. |
| 18 | sensitivity | Sensitivity index calculated as the percentage change in flow associated with a 10 mm increase in stage at the \(Q_{95}\) flow. |
The rnrfa package is a package designed to extend basic R functionalities to interact with the NRFA. It builds on the following packages that should be installed beforehand: cowplot (Wilke 2016), plyr (Wickham 2011), httr (Wickham 2016a), xml2 (Wickham and Hester 2016), stringr (Wickham 2016b), xts (Ryan and Ulrich 2014), rjson (Couture-Beil 2014), ggmap (Kahle and Wickham 2013), ggplot2 (Wickham 2009), rgdal (R. Bivand, Keitt, and Rowlingson 2016), sp (E. J. Pebesma and Bivand 2005; R. S. Bivand, Pebesma, and Gomez-Rubio 2013) and parallel1. The stable version of the package is available on the Comprehensive R Archive Network repository (CRAN; https://CRAN.R-project.org/package=rnrfa/) and can be downloaded and installed by typing the following command in the R console:
> install.packages("rnrfa")
The development version is available from a GitHub repository (https://github.com/cvitolo/rnrfa) and can be installed via devtools (Wickham and Chang 2016), using the following commands:
> install.packages("devtools")
> devtools::install_github("cvitolo/rnrfa")
The package is loaded using the following command:
> library(rnrfa)
The package is fully documented and additional sample applications are available on the dedicated web page http://cvitolo.github.io/rnrfa/. Feedbacks and contributions can be submitted through the GitHub issue tracking system (https://github.com/cvitolo/rnrfa/issues) and pull requests (https://github.com/cvitolo/rnrfa/pulls), respectively.
In many hydrological analyses the importance of efficient data retrieval is often underestimated with the consequence of allocating more time to this first task then to the data processing and analysis of results. The rnrfa packages provides re-usable functions, based on a consistent syntax, that attempts to simplify data retrieval and makes it scalable to multiple data requests.
The full list of gauging stations is in JSON format and can be
retrieved using the function catalogue(), used with no
inputs.
> allStations <- catalogue()
This converts the information into a data frame with one row per
station and 18 columns (Table 1 contains a
detailed description of the attributes). The reader should note that the
server response includes the Ordnance Survey (OS) grid reference, not
latitude and longitude coordinates. The catalogue()
function converts the grid reference to latitude and longitude, then
joins the coordinates to the data frame containing the list of
stations.
The conversion is handled by the osg_parse() function
which can transform OS grid references of different lengths to: a)
latitude and longitude, in the WSGS842 coordinate system; b) easting and northing,
in the BNG3 coordinate system. This function accepts
two arguments: gridRef, a character string containing the
OS grid reference, and CoordSystem, that can be either
"WGS84" (default) or "BNG". The code below
shows how to convert an example OS grid reference,
"NC581062", to the two types of coordinates.
> # Option a: from OS grid reference to WGS84
> osg_parse(gridRef = "NC581062", CoordSystem = "WGS84")
> # Option b: from OS grid reference to BNG
> osg_parse(gridRef = "NC581062", CoordSystem = "BNG")
The catalogue() function provides 5 optional arguments
that can be used to filter metadata based on various criteria. The
argument all, for instance, is TRUE by default
and forces all the metadata to be retrieved. If all is set
to FALSE, the resulting data frame contains only the
following columns: id, name,
river, catchmentArea, lat,
lon. This can be used, for instance, to print a concise
version of the table to the screen.
At the time of writing, 1539 stations are monitored within NRFA. Very rarely the full set of stations is used. Depending on the aim of the analysis, stations might need to be filtered based on a geographical bounding box, length of the recording period, thresholds, etc. Below are some examples showing how to filter stations based on one or multiple criteria.
Stations can be filtered based on a bounding box thanks to the NRFA
web service and a specific functionality of its API. A bounding box
should be defined as a list of four named elements (minimum longitude,
minimum latitude, maximum longitude and maximum latitude) and passed as
input to the catalogue() function using the argument
bbox. The following example shows how to define a bounding
box for the Plynlimon area (mid-Wales, United Kingdom), filter the
related stations and map their location using the ggmap
package. In Figure 1 the location of each station
is shown as a red dot, while the name of the station is used as a
label.
> # Define a bounding box.
> bbox <- list(lonMin = -3.76, latMin = 52.43, lonMax = -3.67, latMax = 52.48)
> # Filter stations based on bounding box.
> someStations <- catalogue(bbox)
> # Map
> library(ggmap)
> library(ggrepel)
> m <- get_map(location = as.numeric(bbox), maptype = 'terrain')
> ggmap(m) + geom_point(data = someStations, aes(x = lon, y = lat),
+ col = "red", size = 3, alpha = 0.5) +
> geom_text_repel(data = someStations, aes(x = lon, y = lat, label = name),
+ size = 3, col = "red")
To calculate summary statistics, it is often useful to select only
stations with at least \(x\) number of
recording years. In the example below, we select only gauging stations
with a minimum of 100 years of recordings, using the argument
minRec. The result is a list of three stations, two of
which are located in South England and one in Wales.
> # Select stations with more than 100 years of recordings.
> s100Y <- catalogue(minRec = 100, all = FALSE)
> # Print s100Y to the screen.
> s100Y
id name river catchmentArea lat lon
636 38001 Lee at Feildes Weir Lee 1036 51.76334 0.01277874
665 39001 Thames at Kingston Thames 9948 51.41501 -0.30887638
1130 55032 Elan at Caban Dam Elan 184 52.26907 -3.57239164
It is also possible to filter stations based on a number of metadata
entries using the arguments: columnName (name of the column
to filter) and columnValue (string or numeric value to
match or compare). The function catalogue() looks for
records containing the string columnValue in the column
columnName. If columnName refers to a
character field, the search is case sensitive and can be used to filter
the stations based on the river name, catchment name, location and so
on. In the example below we filter 34 stations falling within the Wye
(Hereford) hydrometric area:
> stationsWye <- catalogue(columnName = "haName", columnValue = "Wye (Hereford)")
If columnName refers to a numeric field and
columnValue contains special characters such as \(>, \ <, \ \geq\) and \(\leq\) followed by a number, stations are
filtered using a threshold. For instance, there are 7 stations with
drainage area smaller than 1 \(Km^2\),
which can be filtered using the command below:
> stations1KM <- catalogue(columnName = "catchmentArea", columnValue = "<1")
Filtering capabilities can also be combined. In the example below we filter all the stations within the above defined bounding box that belong to the Wye (Hereford) hydrometric area and have a minimum of 50 years of recordings. The only station that satisfies all the criteria is the Wye at Cefn Brwyn.
> catalogue(bbox, columnName = "haName", columnValue = "Wye (Hereford)",
+ minRec = 50, all = FALSE)
id name river catchmentArea lat lon
6 55008 Wye at Cefn Brwyn Wye 10.6 52.43958 -3.724108
Once a certain number of stations are selected, time series of gauged
daily flow and catchment mean rainfall data can be obtained by
requesting access to the NRFA WaterML2 service using the functions
gdf() and cmr(), respectively. These functions
assemble and send data requests to the WaterML2 service, parse responses
and convert them to a time series object (of class from package
xts). They use the same syntax and require the following
arguments:
id, the station identification numbers. This can
either be a single string or a character vector.
metadata, a logical variable. If set to
FALSE (default), metadata are not parsed. If it is set to
TRUE, the result for a single station is a list of two
named elements: data (time series) and meta
(metadata).
When gdf() and cmr() are executed, the
assembled data request is printed to the screen. This is very useful if
the user wants to understand how the API works behind the scenes, but
not when incorporating the code in automated scripts. Although the NRFA
API documentation is not public yet, the patterns are simple and can be
easily extrapolated running a few examples.
Raw flow data are typically measured in \(m^3/s\), at 15-minute intervals. Data are first quality controlled, then the daily mean is calculated and stored in the NRFA public database. These data are typically collected for the monitoring of river networks but can also be used to calibrate hydrological models and build forecasting systems. The example below shows how to get the daily flow for the Tanllwyth at Tanllwyth Flume and the assembled data request (printed to the console).
> flow <- gdf(id = "54090")
http://nrfaapps.ceh.ac.uk/nrfa/xml/waterml2?db=nrfa_public&stn=54090&dt=gdf
The result is a time series (of class “xts”). No station-specific
information is stored, because the argument metadata is set
to FALSE by default. An “xts” object can be easily
converted into a data frame object and exported to a text file (e.g.,
csv) for use in other modelling software, as demonstrated in the example
below.
> # Get gauged daily flow for station 54090.
> flow <- gdf(id = "54090")
> # Convert to csv.
> write.csv(as.data.frame(flow), "flowDF.csv", quote = FALSE)
The main forcing input in any hydrological model is rainfall. In many
cases it is important to calculate the average rainfall over a
catchment, this is achieved by using geospatial interpolation methods
or, more simplistically, calculating the weighted average using a number
of weather stations within the catchment and/or in the nearby areas. The
NRFA provides pre-calculated monthly catchment mean rainfall, measured
in \(mm\), for a number of UK
catchments. As the calculation is consistent across catchments, these
datasets are a valuable resource to ensure reproducibility of
hydrological analyses. Similar to gdf(), the function
cmr() allows users to retrieve the catchment mean rainfall
data by specifying the argument id. The example below shows
that, if we set the argument metadata to TRUE,
we can use metadata to automatically populate title and labels in a
plot, as in Figure 2. The reader should note
that rain$data is an “xts” object, therefore
plot(rain$data) uses the S3 method for “xts”.
> rain <- cmr(id = "54090", metadata = TRUE)
> data <- rain$data
> meta <- rain$meta
> plot(data, main = paste(meta$variable, "-", meta$stationName),
+ xlab = "", ylab = meta$units)
Station information consists of: the station name, location in latitude and longitude coordinates, the variable measured (i.e., rainfall), units (i.e., \(mm\)), aggregation function (i.e., accumulation), time step of recording (i.e., month) and time zone.
In the NRFA, flow and rainfall are stored in \(m^3/s\) and \(mm/month\), respectively, therefore they
are not directly comparable. However, given the catchment area (from the
metadata catalogue), the flow can be easily converted into \(mm/day\) and then compared to the rainfall,
for instance by plotting them on the same time line. Although the
operations are trivial, it is a relatively lengthy procedure that can be
simplified using the function plot_rain_flow(). This
function uses the station id as input to request metadata
as well as flow and rainfall time series for the given catchment,
converts the flow from its original units to \(mm/day\) and then plots the converted flow
and rainfall on two different \(y\)-axes so that they can be visually
compared, as shown in Figure 3.
> plot_rain_flow(id = "54090")
The package rnrfa is particularly useful for large scale
data acquisition. If the id argument is a vector, the
functions gdf() and cmr() can be used to
sequentially fetch time series (meta)data from multiple sites. As the
server can handle multiple requests, concurrent calls can be sent
simultaneously using the parallel package. In order to send
concurrent calls, a cluster object, created by the parallel
package, should be passed to gdf() or cmr()
using the argument cl. Below is a simple benchmark test in
which we compare the processing time for collating flow time series data
for the 9 stations in the Plynlimon area sending: a) 1 data request at
the time and b) 9 simultaneous requests. The operations are repeated 10
times. The results are averaged and summarised in Table 2, which shows that (a) takes about 18
seconds, while (b) about 8 seconds. The reader should note that the time
for retrieval does not reduce proportionally with the number of
simultaneous requests because there is a limit in the number of calls
the server can handle, which depends on the infrastructure and the
number of incoming requests from other users.
> library(microbenchmark)
> library(parallel)
> cl <- makeCluster(getOption("cl.cores", 9))
> microbenchmark(# sequential requests
+ gdf(id = someStations$id, metadata = FALSE, cl = NULL),
+ # concurrent requests
+ gdf(id = someStations$id, metadata = FALSE, cl = cl), times = 10)
> stopCluster(cl)
| Test | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| a | 17.598647 | 17.95601 | 18.419300 | 18.355630 | 19.037328 | 19.16267 | 10 |
| b | 3.564888 | 8.91512 | 8.411546 | 9.504491 | 9.666812 | 10.58291 | 10 |
The rnrfa package is an ideal building block for many scientific workflows but can also work as back-end tool for a number of web applications, from interactive mapping and dynamic reports that improve reproducibility of analysis, to the integration into more sophisticated big data analytics experiments. This can be achieved thanks to the intrinsic interoperability of the R environment. Some example applications are given in the following sections.
Here we demonstrate the generation of a dynamic mapping and reporting application to summarise stations’ metadata and map the spatial distribution of the monitoring network for each operator. The user can select the name of the operator using a drop-down menu and the dynamic document automatically renders an interactive map showing a marker for each station in the network on top of a background map based on OpenStreetMap. Users can zoom in/out and navigate to a specific area. Finally, the user can click on a marker to read name and station identification number from a pop-up window. Figure 4 shows a screenshot of the web application. At the bottom of the page is a dynamic table that summarises the metadata associated with the selected stations in the network. The table can be filtered using an interactive search box. The textual content also updates automatically the reporting of the number of stations within the selected network. The web application depends on the following packages: rmarkdown, knitr, shiny, leaflet and DT and its source code is available as gist at the following URL: https://gist.github.com/cvitolo/d5d46b5e8f3676013857.
The NRFA web site does not allow users to execute geoprocessing tasks, for instance, to intersect the list of stations with user-defined or externally sourced areas. In some cases it might be of interest to explore the distribution of stations based on high-level administrative boundaries such as regions/countries. This is useful to understand whether there are differences in the reliability of the networks that can be explained by the different management approaches. Eurostat established a hierarchy of three levels of administrative divisions within each European country, called Nomenclature of Territorial Units for Statistics (NUTS)4. At the first level, UK is divided into 12 regions: Northern Ireland, Scotland, Wales and 9 English sub-regions (East Midlands, East of England, Greater London, North East, North West, South East, South West, West Midlands, Yorkshire and the Humber). Calculating, for instance, the number/density of stations by region is not possible using the NRFA web site because the stations’ metadata does not contain information on this type of administrative region and users cannot specify their own. However, these simple geoprocessing operations become relatively trivial using the rnrfa package.
The procedure consists of five steps:
retrieve the list of NRFA stations (using the
catalogue() function);
load the NUTS (level 1) shapefile and reproject the polygon to the geographic coordinate system WGS84 (using the rgdal and sp packages);
transform the NRFA list of stations into a
SpatialPointDataFrame ;
spatially overlay NRFA stations (points) and NUTS1 regions (polygons);
add a new column, containing the name of the NUTS1 regions, to the list of NRFA stations.
The updated list of stations is included, as sample dataset, in the
data folder of this package, under the name stationSummary.
Table 3 summarises the number of stations per
region, the area of each region (in \(Km^2\)), and the density of stations
(number of stations/\(Km^2\)). The
metadata can now be easily summarised by NUTS1 region, for instance the
boxplot in Figure 5 shows the
distribution of years of recording. Northern Ireland seem to have the
youngest network, with recording years in the range \[16, 44\]. Only three regions have stations
with more than 100 years of recordings: East of England, London and
Wales. Scotland and Northern Ireland have the lowest density of gauging
stations, while Greater London the highest. The code to reproduce this
example is available as gist at the following URL: https://gist.github.com/cvitolo/aa3bc6f08a8394f653442e276568f9b3.
| NUTS_ID | Region | # stations | Area (\(Km^2\)) | Density | |
|---|---|---|---|---|---|
| 1 | UKC | North East (England) | 54.00 | 8601.77 | 0.006 |
| 2 | UKD | North West (England) | 137.00 | 14170.34 | 0.010 |
| 3 | UKE | Yorkshire and the Humber | 102.00 | 15418.70 | 0.007 |
| 4 | UKF | East Midlands | 101.00 | 15637.21 | 0.006 |
| 5 | UKG | West Midlands | 103.00 | 12999.97 | 0.008 |
| 6 | UKH | East of England | 149.00 | 19159.91 | 0.008 |
| 7 | UKI | Greater London | 36.00 | 1575.97 | 0.023 |
| 8 | UKJ | South East (England) | 169.00 | 19105.67 | 0.009 |
| 9 | UKK | South West (England) | 176.00 | 23912.24 | 0.007 |
| 10 | UKL | Wales | 132.00 | 20817.37 | 0.006 |
| 11 | UKM | Scotland | 324.00 | 78984.40 | 0.004 |
| 12 | UKN | Northern Ireland | 56.00 | 14175.46 | 0.004 |
In the last few years, the UK MetOffice has reported “unusual warmth and lack of rainfall during March and April, particularly over England and Wales”5. Dry springs can affect water resources, because river flow below average translates, for instance, in reduced availability of drinking water. In this section we present a big data analytics experiment in which we try to understand if there is any evidence, in the NRFA data, that springs in the UK are becoming drier, both in terms of rainfall and river flow. This type of experiment consists of retrieving all the available rainfall and flow time series and find out, for each station, whether there is an increasing/decreasing trend.
Using the NRFA web site, the comparison of time series is only feasible for a limited number of sites. Time series should be first downloaded as text files and then compared manually. The biggest advantage of using the rnrfa package, instead, is that multiple downloads can be automated using a single line of code.
In this experiment we used a cluster of 64 cores to download and
analyse all the time series available from the NRFA stations with more
than 10 years of recordings. The time series were first downloaded, then
summarised in terms of annual averages over the spring period. Seasonal
averages can be calculated using the function
seasonal_averages(), which takes as input a time series and
a period of interest (e.g., spring) and calculates the related annual
average. Using a very simplistic approach, a linear model was fit to the
annual averages and the slope coefficient was used to estimate the
trend. Negative slopes correspond to decreasing flow/rainfall, while
positive slopes correspond to an increase of flow/rainfall over time.
Once the fitted slope is calculated for each station, the results can be
plotted using the function plot_trend(). Figures 6 and 7
show only the statistically significant trends for rainfall and flow
respectively. Each figure is divided into two plots: Plot A shows the
spatial distribution of negative trends with red dots and positive
trends with blue dots; plot B shows the variability of trends over NUTS1
regions. In the latter plot, outliers are removed by showing only values
between the 5th and 95th quantiles. From a meteorological perspective
(Figure 6), there are only positive
statistically significant trends and Scotland shows the largest. In
terms of hydrological responses (Figure 7), trends are more subtle as the
interquartile range is concentrated around zero. The most extreme
negative trends were found in Scotland and North East England.
The entire run took about 31 minutes, the code to reproduce this example is available as gist at the following URL: https://gist.github.com/cvitolo/612eb2ae9b47fe8f11a1ed8d06e3b434. There are certainly more rigorous methodologies to estimate seasonal trends. This experiment was just an attempt to demonstrate that the rnrfa can simplify large scale data acquisition tasks.
The cranlogs (Csardi 2015) package provides an API interface to download logs from the RStudio CRAN mirror which contains download counts from unique IP addresses and can be used as a proxy to estimate the volume of package users. By September 2016, the rnrfa package had been downloaded from CRAN 6372 times, just from this mirror, following a trend very similar to the waterData package (see Figure 8). Because the RStudio mirror is located in the US, it is expected that the download counts from UK mirrors could be even higher. We derive that this package is of interest for a large community of users, which gives us scope for future developments.
This article describes the rnrfa package for interacting
programmatically with the UK National River Flow Archive. It allows to
access web resources such as the catalogue of stations’ metadata and the
WaterML2 service to retrieve gauged daily flow and (monthly) catchment
mean rainfall. The package provides functions to query the catalogue
based on various criteria (e.g., geographical bounding box, minimum
number of recording years, river catchment/hydrometric area/operators
amongst many other options), retrieve and visualise flow and rainfall
time series, convert coordinates and flow measurements, and plot basic
seasonal trends grouped on user defined regions. Some of these
capabilities are strongly linked to the particular content of the NRFA
database and are not directly transferable/applicable to other data
sources. However the gdf() and cmr() functions
could be re-used, with minimal changes, to get data/metadata from other
providers adopting the WaterML2 standard.
The package is a convenient standalone application that allows NRFA users a more efficient access to the public database, compared to the web interface, e.g., the possibility to efficiently retrieve data from multiple sites. The rnrfa package can also be used as back-end tool for web applications. Amongst the existing R interfaces to data APIs, rnrfa follows a logic similar to waterData: Sites are first identified through a catalogue, streamflow data are imported via the station identification number, then data are visualised and/or used in analyses. However, our package does not implement any function for data cleanup, because NRFA data are highly quality controlled. Users can currently take advantage of other packages such as xts to calculate aggregate variables, evd (Stephenson 2002) for the analysis of extreme events, outliers (Komsta 2011) to identify possible outliers and sp and spacetime (E. Pebesma 2012; R. S. Bivand, Pebesma, and Gomez-Rubio 2013) for more advanced spatio-temporal processing.
In the future, we plan to implement additional processing functions (e.g., to compare gdf with flow in bankfull condition which is highly important for flood frequency estimations). Further developments are also scheduled on the NRFA side to include Web Feature Service (WFS), Sensor Observation Services (SOS) and updates to WaterML2 OGC standards. WFS layers can already be loaded and manipulated using rgdal (R. Bivand, Keitt, and Rowlingson 2016), while sos4R (Nüst, C. Stasch, and E. J. Pebesma 2011) can be used as client for SOS.
This work was carried out when Claudia Vitolo was working at the Imperial College London and supported by the Natural Environment Research Council pilot Probability, Uncertainty & Risk in the Environment (PURE) NE/I004017/1. Comments from two referees are gratefully acknowledged.
In order to run the examples in this manuscript, the following packages should also be installed: shiny, leaflet, DT, ggrepel (Slowikowski 2016), knitr (Xie 2016, 2015b, 2014) and rmarkdown.↩︎
World Geodetic System 1984, EPSG code: 4326.↩︎
British/Irish National Grid, EPSG codes: 27700/29902.↩︎
The Nomenclature of Territorial Units for Statistics (NUTS), is a standard for referencing the administrative divisions of European countries. There are three levels of NUTS and a shapefile is available from the Eurostat web site (http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/NUTS_2013_01M_SH.zip).↩︎
http://www.metoffice.gov.uk/climate/uk/interesting/2011_spring↩︎