Abstract

This article is a self-contained introduction to the R package ercv and to the methodology on which it is based through the analysis of nine examples. The methodology is simple and trustworthy for the analysis of extreme values and relates the two main existing methodologies. The package contains R functions for visualizing, fitting and validating the distribution of tails. It also provides multiple threshold tests for a generalized Pareto distribution, together with an automatic threshold selection algorithm.

Introduction and overview

Extreme value theory (EVT) is one of the most important statistical techniques for the applied sciences. A review of the available software on extreme value analysis appears in Gilleland, Ribatet, and Stephenson (2013). R software (R Core Team 2017) contains some useful packages for dealing with EVT. The R package evir (Pfaff and McNeil 2012) provides maximum likelihood estimation (MLE) at the same time for the block maxima and threshold model approaches. The R package ismev (Heffernan and Stephenson 2018) allows fitting parameters of a generalized Pareto distribution depending on covariates and offers diagnostics such as qqplots and return level plots with confidence bands. The R package poweRlaw (Gillespie 2015) enables power laws and other heavy tailed distributions to be fitted using the techniques proposed by Clauset, Shalizi, and Newman (2009). This approach had been used to describe sizes of cities and word frequency and is linked to the physics of phase transitions and to complex systems.

This paper shows that the R package ercv (del Castillo, Moriña, and Serra 2017), based on the coefficient of variation (CV), is a complement, and often an alternative, to the available software on EVT. The mathematical background is shown in Section 2, including threshold models and the relationship between power law distribution and the generalized Pareto distributions (GPD), which is the relationship between the two different approaches followed by the aforementioned R packages evir, or ismev, and poweRlaw.

Section 3 introduces the tools for the empirical residual coefficient of variation developed in the papers del Castillo, Daoudi, and Lockhart (2014), del Castillo and Serra (2015) and del Castillo and Padilla (2016). Section 3.2 also shows the exploratory data analysis of nine examples, some of them from the R packages evir and poweRlaw, with the cvplot function, see Figure 1.

Section 4 explains the Tm function in the R package ercv that provides a multiple thresholds test that truly reduces the multiple testing problem in threshold selection and provides clearly defined p-values. The function includes an estimation method of the extreme value index. An automatic threshold selection algorithm provided by the thrselect function is explained in Section 5 to determine the point above which GPD can be assumed for the tail distribution.

Section 6 shows how the methodology developed in the previous sections can be extended with the tdata function to all GPD distributions, even with no finite moments. This technique is applied to the MobyDick example and to the Danish fire insurance dataset, a highly heavy-tailed, infinite-variance model. Finally, Section 7 describes the functions of the R package ercv that allow estimation of the parameters (fitpot) and drawing of the adjustments (ccdfplot) for the peak-over-threshold method.

Mathematical Background

Extreme value theory is widely used to model exceedances in many disciplines, such as hydrology, insurance, finance, internet traffic data and environmental science. The underlying mathematical basis is now thoroughly established in Leadbetter, Lindgren, and Rootzén (1983), Embrechts, Klüppelberg, and Mikosch (1997), de Haan and Ferreira (2007), Novak (2012) and Resnick (2013). Statistical tools and methods for use with a single time series of data, or with a few series, are well developed in Coles (2001), Beirlant et al. (2006) and Markovich (2007).

Threshold models

The first fundamental theorem on EVT by Fisher and Tippett (1928) and Gnedenko (1943) characterizes the asymptotic distribution of the maximum in observed data. Classical analyses now use the generalized extreme value family of distribution functions for fitting to block maximum data provided the number of blocks is sufficiently large. Another point of view emerged in the 1970’s with the fundamental theorem by Pickands (1975) and Balkema and de Haan (1974). The Pickands-Balkema-DeHaan (PBdH) theorem, see McNeil, Frey, and Embrechts (2005, chap 7), initiated a new way of studying extreme value theory via distributions above a threshold, which use more information than the maximum data grouped into blocks.

Let $X$ be a continuous non-negative r.v. with distribution function $F(x)$. For any threshold, $t>0$, the r.v. of the conditional distribution of threshold excesses $X-t$ given $X>t$, denoted as $X_{t}=\{X-t\mid X>t\}$, is called the residual distribution of $X$ over $t$. The cumulative distribution function of $X_{t}$, $F_{t}(x),$ is given by \[\label{eq:thresh} 1-F_{t}(x)=(1-F(x+t))/(1-F(t)). (\#eq:thresh)\] The quantity $M(t)=E(X_{t})$ is called the residual mean and $V(t)=var(X_{t})$ the residual variance. The plot of sample mean excesses over increasing thresholds is a commonly used diagnostic tool in risk analysis called ME-plot (meplot function in evir R package).

The residual coefficient of variation is given by \[\label{eq:CV1} \mathrm{CV}(t)\equiv\mathrm{CV}(X_{t})=\sqrt{V(t)}/M(t), (\#eq:CV1)\] like the usual $\mathrm{CV}$, the function $\mathrm{CV}(t)$ is independent under change of scale.

The PBdH theorem characterizes the asymptotic distributions of the residual distribution over a high threshold under widely applicable regularity conditions, see Coles (2001). The result essentially says that GPD is the canonical distribution for modelling excess over high thresholds. The probability density function for a GPD$\left(\xi,\psi\right)$ is given by \[\label{eq.GPD} g(x;\xi,\psi)=\left\{ \begin{array}{cl} \psi^{-1}(1+\xi x/\psi)^{-(1+\xi)/\xi}, & \xi\neq0,\\ \psi^{-1}\exp(-x/\psi), & \xi=0, \end{array}\right. (\#eq:eq-GPD)\] where $\xi\in\mathbb{R}$ is called the extreme value index (evi) and $\psi>0$ is a scale parameter, $0\leq x\leq-\psi/\xi$ if $\xi<0$, and $x\geq0$ if $\xi\geq0$. The value of $\xi$ determines the tail type. If $\xi<0$, we say that the distribution is light tailed, if $\xi=0$ we say it is exponential tailed. If $\xi>0$ a GPD has finite moments of order $n$ if $\xi<1/n$ and it is called heavy tailed. The mean of a GPD is $\psi/(1-\xi)$ and the variance is $\psi^{2}/[(1-\xi)^{2}(1-2\xi)]$ provided $\xi<1$ and $\xi<1/2$, respectively. Then, the coefficient of variation is \[\label{eq:cvgpd} c_{\xi}=\sqrt{1/(1-2\xi)}, (\#eq:cvgpd)\] the cvevi and evicv functions of the R package ercv correspond to this function and its inverse.

The residual distribution of a $\mathrm{GPD}$ is again $\mathrm{GPD}$ with the same extreme value index $\xi$, for any threshold $t>0$, in fact \[\label{eq:invshape} \mathrm{GPD}_{t}(\xi,\psi)=\mathrm{GPD}(\xi,\psi+\xi t). (\#eq:invshape)\] Therefore, the residual $\mathrm{CV}$ for GPD is independent of the threshold and the scale parameter and is given by equation (@ref(eq:cvgpd)).

The probability density functions $\left(\ref{eq.GPD}\right)$ are monotone decreasing (L-shaped) for $\xi>-1$, covering practically all the applications. Therefore, we are mainly concerned with the subset of data that indicate this behaviour. For example, if the dataset is concentrated in the centre and decreases on either side (bell-shaped) we will study the upper and lower part (changed sign) of the distribution separately, taking the median or some other location statistic as the origin.

The power law distribution and GPD

The power law distribution is the model, introduced by Pareto, \[\label{eq:powlaw} p\left(x;\alpha,\sigma\right)=\frac{\alpha}{\sigma}\left(\frac{\sigma}{x}\right)^{\alpha+1},\quad x>\sigma (\#eq:powlaw)\] where $\alpha>0$ is the tail index and $\sigma>0$ the minimum value parameter. The model corresponds to the distribution functions $F$ with the linear relation \[\label{eq:linear} log\left[1-F\left(x\right)\right]=-\alpha\,log(x)+\alpha\,log\left(\sigma\right), (\#eq:linear)\] see also Gillespie (2015).

Note that if $X$ is a r.v. with probability density function $p\left(x;\alpha,\sigma\right)$, given by (@ref(eq:powlaw)), $Z=X-\sigma$ has probability density function \[\label{eq:relation} g(z;1/\alpha,\sigma/\alpha)=\frac{\alpha}{\sigma}\left(\frac{\sigma}{z+\sigma}\right)^{\alpha+1},\quad z>0, (\#eq:relation)\] that is, there is a one to one correspondence between power law distributions and GPD distributions with heavy tails $\left(\xi>0\right)$, where $\xi=1/\alpha$ and $\sigma=\psi/\xi$. However, the two statistical models (@ref(eq:eq-GPD)) and (@ref(eq:powlaw)), with $\xi>0$, are different since there is no unique transformation for all functions of the model (the transformation $Z=X-\sigma$ depends on the minimum value parameter $\sigma$ of the same variable $X$).

The MLE for model (@ref(eq:powlaw)) leads to the Hill estimator and Hill-plot (hill function in evir R package). The support of the distributions in (@ref(eq:powlaw)) depends on the minimum value parameter $\sigma$. Hence, the MLE has no standard regularity conditions and the minimum value parameter $\sigma$ is estimated with alternative methods, see Clauset, Shalizi, and Newman (2009) and its implementation in the poweRlaw R package by Gillespie (2015).

However, the support of the distributions in (@ref(eq:eq-GPD)), with $\xi>0$, does not depend on parameters and MLE existing for large samples provided $\xi>-1$ and is asymptotically efficient provided $\xi>-0.5$, see del Castillo and Serra (2015) and the references therein for details. The gdp function in the evir R package provides the MLE for (@ref(eq:eq-GPD)).

Note that model (@ref(eq:eq-GPD)) includes all the limit distributions (heavy or not) of the residual distribution over a high threshold and comes from a mathematical result (the PBdH theorem) and often (@ref(eq:powlaw)) comes from empirical evidence of the linear relationship (@ref(eq:linear)) and comparison with other models. Moreover, the linear relationship (@ref(eq:linear)) is also obtained from the relationship between the parameters (@ref(eq:relation)), see the ccdfplot function in Section 7.

The residual CV approach

Gupta and Kirmani (2000) show that the residual $\mathrm{CV}$ characterizes the distribution in univariate and bivariate cases, provided threre is a finite second moment ($\xi<1/2$). In the case of $\mathrm{GPD}$, the residual $\mathrm{CV}$ is constant and is a one to one transformation of the extreme value index suggesting its use to estimate this index. The residual CV can also be expressed in terms of probabilities, rather than the threshold, through the inverse of the distribution function or the quantile function defined by $Q\left(p\right)=inf\left[x:F(x)\geq p\right],$ then the CV can be drawn, for $0\leq p<1$, for the threshold $t=Q\left(p\right)$, that is to plot the function $p\rightarrow CV(Q(p))$. This representation makes it possible to draw on the same scale for the $x$ axis the residual CV of distributions with different supports.

Exploratory data analysis with `cvplot` function

In this section the cvplot function of the R package ercv is introduced as a graphical tool for use in a exploratory data analysis, through the nine examples described in Section 3.2. The cvplot function is essentially the empirical residual CV whose asymptotic distribution as a stochastic process is explained by del Castillo, Daoudi, and Lockhart (2014) and del Castillo and Padilla (2016).

The empirical residual CV and confidence intervals.

Assume that the raw data consist of a sequence of independent and identically distributed measurements $x_{1},...,x_{n}$. Extreme events are identified by defining a high threshold $t$ for which the exceedances are $\{x_{j}:x_{j}>t\}$. Hence, we first identify a threshold $t$ such that its exceedances correspond to a constant residual CV (equivalently a GPD). We denote the ordered sample $x_{(1)}\leq x_{(2)}\leq\dotsb\leq x_{(n)}$. The cvplot function provides the function $cv(t)$ of the sample coefficient of variation of the threshold excesses $(x_{j}-t)$ given by \[\label{eq:sample cv} t\rightarrow cv(t)=sd\{x_{j}-t\mid x_{j}>t\}/\text{mean}\{x_{j}-t\mid x_{j}>t\}, (\#eq:sample-cv)\] in practice $t=x_{(k)}$ are the order statistics, where, $k$ $(1\leq k\leq n)$ is the size of the sub-sample excluded. Hereinafter the graph of this function is called $\mathrm{CV}$-plot. Figure 1 shows the $\mathrm{CV}$-plots of nine examples (blue lines) that we comment on the next section.

Point-wise error limits for $cv(t)$ under $\mathrm{GPD}(\xi,\psi)$ (provided $\xi<1/4$) follow from the asymptotic distribution of the empirical residual $\mathrm{CV}$, by del Castillo and Padilla (2016), in particular for a fixed threshold $t$, the asymptotic confidence intervals in Figure 1 (solid orange lines) are obtained by \[\label{cv_asim} \sqrt{n(t)}(cv(t)-c_{\xi})\overset{d}{\rightarrow}N(0,\sigma_{\xi}^{2}), (\#eq:cv-asim)\] where $c_{\xi}$ is in @ref(eq:cvgpd), $n(t)=\sum_{j=1}^{n}1_{(x_{j}>t)}$ . For an exponential distribution $\left(\xi=0\right),$ $c_{0}=1$ and $\sigma_{0}^{2}=1$, and for a uniform distribution $\left(\xi=-1\right),$ $c_{-1}=1/\sqrt{3}$ and $\sigma_{-1}^{2}=8/45$.

By default, if $\sqrt{2}$ is in the range of $y$’s then the cvplot function draws the line $y=\sqrt{2}$ (black dotted line), which corresponds to $\xi=1/4$ (finite fourth moment). Hence, $\mathrm{CV}$-plot larger than this value for high thresholds lead to very heavy tailed distribution and we suggest to switch to transformed data through function tdata (Section 6). Alternatively, finite moments can be checked by a confidence interval for the MLE estimator of evi, or the methods in the R package RobExtremes (Ruckdeschel, Kohl, and Horbenko 2019) and the references cited therein can be used.

The $\mathrm{CV}$-plot is an alternative tool to Hill-plot an to ME-plot. It has two advantages over ME-plot: first, it depends on a scale parameter and CV-plot does not; second, linear functions are defined by two parameters and the constants by only one. So the uncertainty is reduced from three to one single parameter. On the other hand, the Hill-plot can only be used for heavy tailed distributions.

Figure 1: CV-plots stowed from left to right and top to bottom: four different types of execution time distributions of automotive applications, the frequency of words in the novel Moby Dick, Danish fire insurance data, River Nidd exceedances above value 65, Bilbao waves dataset and positive daily returns of euro/dollar exchange rates between 1999 and 2014.

Examples

The use of the cvplot function and its options is described using nine examples. The first four (iFFT, FFT, BIFP and MA) correspond to different types of execution time distributions observed for a set of representative programs for the analysis of automotive applications. Three others are in R packages: MobyDick (“moby” in R package poweRlaw), Danish and Nidd (“danish” and ’“nidd.thresh” in the R package evir). The Bilbao waves dataset (bilbao) was originally analysed by Castillo and Hadi (1997). EURUSD is the dataset of euro/dollar daily exchange rates between 1999 and 2016.

We collect samples with $n=1,000$ observations for $4$ of the $16$ benchmarks in the EEMBC AutoBench suite (Poovey 2007), which is a well-known suite for real-time systems that includes a number of programs used in embedded automotive systems. Hereinafter, these datasets will be called iFFT (idctrn), FFT (aifftr), BIFP (basefp) and MA (matrix), leaving the real names in parentheses, they correspond respectively to Inverse Fast Fourier Transform, Fast Fourier Transform, Basic Integer and Floating Point and Matrix Arithmetic, see Abella et al. (2017) and del Castillo et al. (2017). The histograms of the four datasets are bell-shaped. Hence, when searching for L-shaped distributions, we start the exploratory data analysis of the upper part of the distribution by taking the median as origin. Note also that large samples increase the precision of the estimates, provided that the fitted model is validated. The $\mathrm{CV}$-plots for these four datasets are obtained, for instance, with:

library("ercv")
data(iFFT)
cvplot(iFFT,thr=median(iFFT))

The plots in Figure 1 are stowed from left to right and top to bottom. For iFFT, the $\mathrm{CV}$-plot is inside the confidence interval of the exponential distribution ($evi=0$). Hence, it can be assumed that the CV is constant equal to $1$ (dashed orange line). For FFT, the $\mathrm{CV}$-plot is inside the confidence interval for the last $250$ observations. For BIFP, the $\mathrm{CV}$-plot looks like a constant with CV lower than 1, hence a light tailed GPD is suggested. For MA, the $\mathrm{CV}$-plot suggests a heavy tailed distribution.

The following three $\mathrm{CV}$-plots in Figure 1 are made from the MobyDick, Danish and Nidd datasets, which can be directly loaded from the R packages. The three plots are made with the default cvplot function options, but including title, for instance:

data("moby", package = "poweRlaw")
cvplot(moby,main="MobyDick")

The second row of Figure 1 shows three examples that suggest heavy tailed distributions. In the centre is MobyDick and on the right is the Danish fire insurance dataset, which is a highly heavy-tailed infinite-variance example used to illustrate the basic ideas of extreme value theory, see Embrechts, Klüppelberg, and Mikosch (1997), McNeil, Frey, and Embrechts (2005, Example 7.23) and Novak (2012, Example 9.8). Section 6 shows how to analyse these examples, with the tdata function, using the methodology developed in del Castillo and Padilla (2016).

Nidd is the dataset of high levels of the River Nidd above a threshold value of 65. Its CV-plot is always lower than $\sqrt{2}$, begins in the area of heavy tails and goes into the confidence interval of exponentially. The Bilbao waves dataset was originally analysed by Castillo and Hadi (1997). The Nidd and Bilbao datasets are two of the most commented examples of extreme values theory, which were also analysed by del Castillo and Serra (2015) from the MLE point of view.

By default, the cvplot function draws a $90$% confidence interval of CV-plot from exponential distribution ($evi=0$). The evi parameter of the function provides confidence intervals of the corresponding GPD ($evi<1/4$). The conf.level parameter allows for changing confidence levels. Both evi and conf.level may be a vector. For light tailed distributions, as is presumably the case with the wave levels, it is also advisable to draw a confidence interval from the uniform distribution ($evi=-1$). Hence, the Bilbao CV-plot in Figure 1 has confidence intervals for exponential (orange) and uniform (green) distributions.

data(bilbao)
cvplot(bilbao,evi = c(0,-1),main="Bilbao")

EURUSD is the data frame object of the euro/dollar daily exchange rates between 1999 and 2016, including the financial crisis of 2007-08, which was obtained from the R package quantmod (Ryan 2016). Various parts of the EURUSD series have been studied by several authors, see Gomes and Pestana (2007) and del Castillo and Padilla (2016). The last plot in Figure 1 shows the CV-plot of the positive log-returns of the euro/dollar daily prices, obtained from

data("EURUSD")
prices<-ts(EURUSD$EUR.USD,frequency=365,start=1999)
#plot(prices,col="blue",main="euro/dollar daily prices(1999-2016)")
return <- 100*diff(log(prices));
pos.return <- subset(return, return >0);
cvplot(pos.return,main="pos.returns EUR/USD 1999-2016")

The dynamics of the daily return can be described by a GARCH(1,1) model. One might then hope that for sufficiently high values of $t$ the subset of daily returns that are above $t$ is so well separated in time that independence can reasonably be assumed. Then, the CV-plot clearly shows that the tail of the distribution looks like an exponential.

Estimation and Model diagnostics with `Tm` function

Following the exploratory analysis, we would like to confirm or deny some of the previous observations. It is known that in order to make optimum decisions, it is necessary to quantify the uncertainty of information extracted from data. Statistics provides mechanisms to ensure a controlled probability of error, but there is always the risk of misuse for multiple testing, especially in EVT where quite small changes can be greatly magnified on extrapolation. The asymptotic distribution of the residual coefficient of variation for GPD as a random process indexed by the threshold by del Castillo and Padilla (2016) provides pointwise error limits for CV-plot, used in the last section, and a multiple thresholds test that truly reduces the multiple testing problem, hence, the $p$-values are clearly defined.

Using the building blocks given by (@ref(eq:cv-asim)) the multiple threshold test $T_{m}$ (the Tm function of the R package ercv) for a (supplementary) number of thresholds $m$ as large as necessary for practical applications is constructed from \[\label{eq:Tm} T_{m}(\xi)=n\sum_{k=0}^{m}p^{k}(cv(q_{k})-c_{\xi})^{2}, (\#eq:Tm)\] where $c_{\xi}$ is in @ref(eq:cvgpd), ${q_{k}}$ are the empirical quantiles corresponding to probabilities ${1-p^{k}}$ and probability $p$ is chosen so that $n\,p^{m}\approx omit$, where $omit$ is the smaller sample size used to calculate CV. This statistic can be used to test whether a sample is distributed as a GPD with parameter $\xi$.

The Tm function makes it possible to see whether the $75$ largest values of Nidd can be assumed to be exponentially distributed.

data("nidd.thresh",package = "evir")
Tm(nidd.thresh,evi=0, nextremes = 75)

nextremes   cvopt   evi    tms   pvalue
       75   1.000   0.000  0.981 0.310

The Tm function provides $tms=Tm(evi)/(m+1)$, which is stable on vary the number of thresholds $m$, the p-value says that it can not be rejected exponenciality (the number of simulations can be increased with nsim). Moreover, by default the Tm function assumes that the parameter $\xi$ is unknown ($evi=NA$), then the cvopt is estimated as the value $\widetilde{c}_{\xi}$ such that achieves the minimum of $T_{m}(\xi)$, and reversing @ref(eq:cvgpd) provides an estimator $\widetilde{\xi}$.

The following code shows that the assumption of constant CV (GDP) is rejected for the complete sample.

Tm(nidd.thresh)

nextremes   cvopt  evi    tms   pvalue
      154   1.225  0.167  1.214 0.030

It is rejected that Bilbao dataset is uniform distributed. However, It can not be rejected GPD as the following code shows

Tm(bilbao,evi=-1,nsim=1000)

nextremes   cvopt     evi    tms     pvalue
      179   0.577   -1.000   0.629   0.003

Tm(bilbao,nsim=1000)

nextremes   cvopt     evi    tms     pvalue
      179   0.650   -0.685   0.254   0.172

The confidence interval for the parameter estimation $evi=-0.685$ can be obtained with

cievi(nextremes=length(bilbao),evi=-0.685)

   5%     95%
-0.778 -0.549

Using a small threshold, $(0.1\%)$, the Tm function shows that the positive and negative returns of the euro/dollar between 1999 and 2016 can be assumed exponentially distributed.

Tm(pos.return,m=50,evi=0,thr=0.1,nsim=1000)

nextremes   cvopt   evi   tms    pvalue
     2207   1.000  0.000  0.392  0.780

neg.return <- -subset(return, return <0);
Tm(neg.return,m=50,evi=0,thr=0.1,nsim=1000)

nextremes   cvopt   evi   tms    pvalue
     2187   1.000  0.000  1.160  0.231

The last statement with Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz takes elapsed=4.73 (R> proc.time()).

Threshold selection algorithm (`thrselect`)

There are two different approaches to the question of threshold choice. The first approach is to regard the free choice of the threshold as an advantageous feature of the procedure. By varying the threshold, the data can be explored, and if a single estimate is needed it can be obtained by subjective choice. It may well be that such a subjective approach is in reality the most useful one.

The other, to some extend opposing, view is that there is a need for an automatic method whereby the threshold is chosen by the data. It is fairer to use the word automatic rather than objective for such a method, because there are arbitrary decisions involved in the choice of the method itself. Nevertheless, it is of course the case that conditional on the automatic method being used, the threshold is indeed objective. Automatic methods need not be used in an uncritical way; they can of course be used as a starting point for fine tuning.

The thrselectfunction in the R package ercv starts with the $T_{m}(\xi)$ calculation (@ref(eq:Tm)) where the number of thresholds $m$ must be fixed by the researcher. This determines the thresholds where the CV is calculated, ${0=q_{0}<q_{1}<\cdots<q_{m}}$, which are fixed throughout the procedure. We accept or reject the null hypothesis for the shape parameter using all the thresholds. If the hypothesis is rejected, the threshold excesses $\left(x_{j}-q_{1}\right)$ are calculated for the sub-sample $\left\{ x_{j}>q_{1}\right\}$ . The previous steps are repeated, but removing one threshold, to accept or reject the null hypothesis that the sample comes from a GPD with parameter $\xi$, see del Castillo and Padilla (2016).

If we apply the function thrselect on the Nidd dataset the code shows

DF <- thrselect(nidd.thresh,m=10, nsim=1000)

    m nextremes threshold   rcv   cvopt   evi     tms    pvalue
5   6        63     87.85   1.193 1.073   0.0656  0.408  0.102

This means that the algorithm need $5$ steps to achieve a $p$-value larger than $0.10$ and it is using in this step $m=6$ thresholds. Then, constant CV can be accepted for the last $63$ extremes over the threshold $87.85$, with the CV $cvopt=1.0728$ and the corresponding $evi=0.0656$.

The output of thrselect is in the data frame DF, the printed values are in DF$solution and DF$options provides complementary information that can be used for a more personal approach.

print(DF$options,digits=4)

    m nextremes threshold   rcv  cvopt      evi     tms    pvalue
1  10       154     65.08 1.2486 1.2249  0.166758 1.33553  0.023
2   9       123     74.38 1.4082 1.2183  0.163112 1.47158  0.012
3   8        99     77.80 1.3163 1.1634  0.130594 0.93927  0.034
4   7        79     81.40 1.2587 1.1175  0.099606 0.64548  0.064
5   6        63     87.85 1.1933 1.0728  0.065559 0.40795  0.102
6   5        50     92.82 1.1328 1.0320  0.030493 0.24415  0.217
7   4        40     99.14 1.0714 0.9945 -0.005584 0.12917  0.457
8   3        32    107.94 1.0054 0.9619 -0.040406 0.05888  0.609
9   2        26    115.93 0.9006 0.9396 -0.066323 0.04218  0.637
10  1        21    131.87 0.9473 0.9667 -0.034986 0.01755  0.597

Transformation from heavy to light tails (`tdata`)

It is possible to extend the previous methodology based on CV to all distributions, even without finite moments. For CV-plots above the straight line $y=\sqrt{2}$ , like the three examples in the second row of Figure 1, the datasets are transformed by the strictly increasing function that applies $\left(0,\infty\right)$ to $\left(0,\sigma\right)$, \[y\left(x\right)=\sigma\,x/(x+\sigma),\] where $\sigma>0$, using the tdata function in the R package ercv.

This technique is founded on the following result: if $X$ is a random variable $GPD\left(\xi,\psi\right)$ distributed and $\xi>0$, then for $\sigma=\psi/\xi$ the transformed random variable $Y=y(X)$ is $GPD\left(-\xi,\psi\right)$ distributed. Furthermore, the converse is also true, as evidenced by applying the inverse transformation $x\left(y\right)=\sigma\,y/(\sigma-y)$, see also del Castillo and Padilla (2016). The $\sigma>0$ parameter is estimated by tdata, using MLE with the internal function egpd, (see del Castillo and Serra (2015)) or may be provided by the researcher as a preliminary estimate.

The CV-plots for Danish and MobyDick transformed by tdata function are obtained with:

data("danish",package = "evir")
tdanish<- tdata(danish)
cvplot(tdanish,main="transformed Danish")
tmoby<- tdata(moby)
cvplot(tmoby,main="transformed MobyDick")

The CV-plots in Figure 2 for the transformed datasets are more stable than the original CV-plots in Figure 1 and actually look light tailed. The CV-plot of the transformed MobyDick has a sawtooth profile because the original dataset only takes positive integer values and the smaller values have a high frequency (among the 18,855 values, 1 appears 9,161 times, 2 appears 3,085, ... ). In order to use a GPD approach for this example we assume that the data correspond to positive values rounded to the nearest integer.

Figure 2: CV-plots under tdata transformation of Danish fire insurance data and frequencies of words in the novel Moby Dick.

The Tm function rejects GPD for the complete transformation of MobyDick. The same result is obtained with the transformation of the dataset on the thresholds 2 and 3. However, GPD is not rejected on threshold 4, hence the frequencies of words that appear four or more times in the novel Moby Dick (4,980 observations) can be approximated by a GPD distribution with $evi=0.982$, as the following code shows (changing the sign of evi):

t4moby<-tdata(moby,thr=4)
Tm(t4moby,m=50,nsim=1000)

nextremes   cvopt    evi     tms    pvalue
     4980   0.581   -0.982   0.198  0.293

The Danish example was studied by del Castillo and Padilla (2016). The results obtained are validated by the Tm function after the transformation tdata

Tm(tdanish,m=20,nextremes = 951,omit = 8, nsim = 1000)

nextremes   cvopt     evi     tms      pvalue
      951   0.676    -0.595   0.256    0.253

Applying the thrselect function to Danish after the transformation by tdata we obtain

DF<-thrselect(tdanish,m=30,nsim=1000)

    m  nextremes  threshold  rcv    cvopt     evi    tms     pvalue
19 12    116      1.283      0.589  0.6747   -0.598  0.265   0.11

The automatic algorithm chooses the threshold $1.283$ ($116$ extremes) with the estimate $evi=0.598$ (changing the sign of evi) really close to the previous one $evi=0.595$. The result is different from that obtained by McNeil, Frey, and Embrechts (2005) by MLE $evi=0.50$ ($109$ extremes). However the $cievi$ function shows that $evi=0.50$ can not be rejected, as shown by the confidence interval provided by the following code (changing the sign of evi again),

cievi(116,evi=-0.596)

    5%    95%
-0.714 -0.440

In the next section we will discuss these results with new features of the R package ercv.

Fitting PoT parameters and tail plots (`fitpot` `ccdfplot`)

The tools described in the previous sections provide an asymptotic model for threshold exceedances over a high quantile, the so-called peak-over-threshold (PoT) method, see McNeil, Frey, and Embrechts (2005). The PoT method is based on determining a high enough threshold from which the distribution of the observations above this value, adjusted to zero, approaches to a GPD distribution. Then, given a threshold $t$, for $x>t$ the complementary cumulative distribution function (ccdf) is estimated by \[\label{eq:pot} 1-\hat{F}\left(x\right)=\hat{p}_{t}\left(1-G\left(x-t;\hat{\xi}_{t},\hat{\psi}_{t}\right)\right) (\#eq:pot)\] where $G\left(x;\xi,\psi\right)$ is the cumulative distribution function of the GDP, whose probability density function was introduced in (@ref(eq:eq-GPD)), and $\left(\hat{\xi}_{t},\hat{\psi}_{t}\right)$ are their estimated parameters for the $n_{t}$ threshold exceedances over $t$ adjusted to zero, from a sample of size $n$ with $\hat{p}_{t}=n_{t}/n$. Alternatively, given $n_{t}$ the estimated parameter is $t$.

The ppot function is the cumulative distribution function for the PoT method. That is, given an estimate of the four parameters in (@ref(eq:pot)), $\left(\hat{\xi},\hat{\psi},\hat{t},\hat{p}\right)$, the right hand part of (@ref(eq:pot)) is provided by $1-$ppot$\left(x,\left(\hat{\xi},\hat{\psi},\hat{t},\hat{p}\right)\right)$. The qpot is the quantile function for the PoT method that assigns to each probability $p$ attained by ppot the value x for which ppot$\left(x\right)=p$, given the same vector of four parameters. The qpot function can be used in the estimation of high quantiles, that in terms of risk is expressed as the value at risk (VaR). For a small $p$, $VaR_{p}=q$ if and only if $1-F\left(q\right)=p$. Hence, if $\varepsilon<\hat{p}$, \[VaR_{\epsilon}=\hat{t}+qpot\left(\left(1-\varepsilon/\hat{p}\right),\left(\hat{\xi},\hat{\psi},\hat{t},\hat{p}\right)\right).\]

The fitpot function of the R package ercv provides an estimate of the four parameters in (@ref(eq:pot)) that allow approximating the empirical cumulative distribution function of a dataset. It is assumed that the threshold $t$, or the number of extremes, has been chosen based on the tools of the previous sections. By default fitpot uses MLE. However, since parameter $\xi$ (evi) can be estimated minimizing (@ref(eq:Tm)) by the Tm function, this value can be entered into the function fitpotand then it uses MLE by the restricted model to a single parameter. From now on this method of estimation will be called CV method.

The two methods of estimation offitpot applied to Danish explain the differences between the results obtained by us and by other researchers, which we have discussed in the previous section, as we can see with the code

fit1<-fitpot(danish,nextremes =116);fit1               #MLE

  evi       psi     threshold   prob
0.446      7.462    9.200      0.054

fit2<-fitpot(danish,evi=0.598,nextremes =116);fit2     #CV

  evi       psi     threshold   prob
0.598      6.450    9.200      0.054

Naturally, different estimation methods provide different estimates, but the question of identifying the best approach still remains. To clarify this point, we can use the ccdfplot function, which draws the empirical complementary cumulative distribution function with the approximations provided by the parameters estimated by fitpot. The ccdfplot function allows to draw several approaches at several scales. The approximation is linear in the log-log scale for datasets with heavy tails, although it is linear in log scale for datasets with exponential tails (log = y, by default). To draw the approach on natural scale the option log = has to be used.

The plots of Figure 3 have been obtained with ccdfplot function applied to Danish data with the estimates obtained by MLE (orange) and CV method (green) on logarithmic and double logarithmic scales, with

ccdfplot(danish,pars=list(fit1,fit2),main="Danish (log scale)")
ccdfplot(danish,pars=list(fit1,fit2),log="xy",main="Danish (log-log
scale)")

Figure 3: Complementary cumulative distribution function of Danish fire insurance data adjusted by MLE and CV methods. in log scale and log-log scale.

Figure 3 shows that both adjustments are reasonable. The CV method is not worse than MLE, perhaps less optimistic or more realistic. The previous PoT approach can be validated using the Clauset, Shalizi, and Newman (2009) point of view.

Based on the four parameters estimated by fitpot $\left(\hat{\xi},\hat{\psi},\hat{t},\hat{p}\right)$ for heavy tailed models (evi > 0), the linear relationship (@ref(eq:linear)) can be obtained for the dataset values over the threshold, with the new threshold $\hat{\sigma}=\hat{\psi}/\hat{\xi}$ and the probability $1$, see the following code.

fit1<-as.numeric(fit1$coeff);sg1<- fit1[2]/fit1[1];sg1
fit2<-as.numeric(fit2$coeff);sg2<- fit2[2]/fit2[1];sg2
exDanish<-danish[danish>fit1[3]]-fit1[3]  #origin to zero
exDanish1<- exDanish+sg1                  #origin to sg1
exDanish2<-exDanish+sg2                   #origin to sg2
exfit1<-c(fit1[1],fit1[2],sg1,1)
exfit2<-c(fit2[1],fit2[2],sg2,1)
ccdfplot(exDanish, pars=c(exfit1),log="xy",main="adjusted by MLE")
ccdfplot(exDanish2, pars=c(exfit2),log="xy",main="adjusted by CV")

The Figure 4 plot (a) shows the linear relationship (@ref(eq:linear)) for the 116 upper extremes of Danish adjusted by MLE. Changing the previous fit1 by fit2 the linear relationship is obtained by the CV method and is shown in plot (b). Notice that the linear relationship (@ref(eq:linear)) begins at the threshold $sg1=16.727$ for MLE and at a threshold $sg2=10.787$ for the CV method, so we can not overlay them in the same graph. The goodness of fit can now be measured by the correlation between the logarithm of the complementary empirical distribution function, $log(1-F_{n})$ and the logarithm of the data, $\log(x+sg)$, where $\left(x+sg\right)$ are the 116 upper extremes of Danish, adjusted to sigma. The results are $correlation=-0.981$ using MLE, plot (a), and $correlation=-0.990$ using CV-method, plot (b).

We can also calculate the threshold $th$ having a maximum correlation between $log(1-F_{n})$ and $log(x+th)$, obtaining $th=6.996$ and $correlation=-0.992$. Thus, the correlation on which the goodness of the CV method adjustment is based on is very close to the best that can be obtained by this procedure, which is in line with Clauset, Shalizi, and Newman (2009) and the poweRlaw R package by Gillespie (2015) (although here the estimation of evi is different). This shows that the methodology provided by the R package ercv complements and connects the contributions of evir (Pfaff and McNeil 2012) and poweRlaw by Gillespie (2015).

Figure 4: The linear relationship for the 116 upper extremes of Danish fire insurance data adjusted by MLE and CV method.

Acknowledgements

This work was supported by the Spanish Ministry of Economy and Competitiveness under Grant: Statistical modelling of environmental, technological and health risks, MTM2015-69493-R. David Moriña acknowledges financial support from the Spanish Ministry of Economy and Competitiveness, through the María de Maeztu Programme for Units of Excellence in R&D (MDM-2014-0445) and from Fundación Santander Universidades.

Fitting Tails by the Empirical Residual Coefficient of Variation: The ercv Package

Joan del Castillo

Isabel Serra

Maria Padilla

David Moriña

2019-12-27

Introduction and overview

Mathematical Background

Threshold models

The power law distribution and GPD

The residual CV approach

Exploratory data analysis with `cvplot` function

The empirical residual CV and confidence intervals.

Examples

Estimation and Model diagnostics with `Tm` function

Threshold selection algorithm (`thrselect`)

Transformation from heavy to light tails (`tdata`)

Fitting PoT parameters and tail plots (`fitpot` `ccdfplot`)

Acknowledgements

Fitting Tails by the Empirical Residual Coefficient of Variation: The ercv Package

Joan del Castillo

Isabel Serra

Maria Padilla

David Moriña

2019-12-27

Introduction and overview

Mathematical Background

Threshold models

The power law distribution and GPD

The residual CV approach

Exploratory data analysis with cvplot function

The empirical residual CV and confidence intervals.

Examples

Estimation and Model diagnostics with Tm function

Threshold selection algorithm (thrselect)

Transformation from heavy to light tails (tdata)

Fitting PoT parameters and tail plots (fitpot ccdfplot)

Acknowledgements

Exploratory data analysis with `cvplot` function

Estimation and Model diagnostics with `Tm` function

Threshold selection algorithm (`thrselect`)

Transformation from heavy to light tails (`tdata`)

Fitting PoT parameters and tail plots (`fitpot` `ccdfplot`)