Abstract
We implement in the R package carx a novel and computationally efficient quasi-likelihood method for estimating a censored autoregressive model with exogenous covariates. The proposed quasi-likelihood method reduces to maximum likelihood estimation in absence of censoring. The carx package contains many useful functions for practical data analysis with censored stochastic regression, including functions for outlier detection, model diagnostics, and prediction with censored time series data. We illustrate the capabilities of the carx package with simulations and an elaborate real data analysis.Censored data are frequently encountered in diverse fields including environmental monitoring, medicine, economics, and social sciences. Censoring may arise, for example, when a measuring device is subject to some detection limits beyond which the device cannot yield a reliable measurement. Censoring can also occur due to regulations on price change, e.g., limits on maximal intra-daily price change in a stock market.
There exists an extensive literature on regression analysis with
censored responses since the pioneering work of Buckley and James (1979). Considerable efforts
have also been spent implementing existing methods for estimating
various models with censored observations, many of which have been
implemented in R. For instance, Henningsen
(2010) introduced the censReg
package (Henningsen 2013), which covers
standard regression models with censored responses including the
standard Tobit model (Tobin 1958), maximum
likelihood estimation with cross-sectional data, and random-effects
maximum likelihood procedure for panel-data using Gauss-Hermite
quadrature. The Tobit model is also implemented in other packages with
possibly different estimation methods, including tobit() in
AER
(Kleiber and Zeileis 2008),
cenmle() in NADA
(Lee 2013), tobit() in VGAM
(Yee 2015), MCMCtobit() in MCMCpack
(Martin, Quinn, and Park 2011), etc.
While there exists an extensive literature on estimating regression models with censored responses and associated software, there are few studies with censored time series response data. More generally, the problem of stochastic regression with both the response and covariates being possibly censored is relatively under-explored. Zeger and Brookmeyer (1986) studied maximum likelihood estimation of a regression model with the errors driven by an autoregressive model of known order \(p\ge 0\) (\(AR(p)\)). Owing to censoring, the “state" vector is generally of variable dimension which can increase rapidly with increasing censoring and AR order. Thus, the maximum likelihood estimation becomes quickly numerically intractable with increasing censoring even for moderately high AR order (Wang and Chan 2017b). Zeger and Brookmeyer (1986) also briefly discussed a pseudo-likelihood approach but did not further develop it. Park, Genton, and Ghosh (2007) proposed an imputation method to estimate a censored autoregressive moving average (ARMA) process. Their method imputes each censored value by some random value simulated from their conditional distribution given the observed data and the censoring information, and treats the imputed time series as the complete data with which estimation can be done by any standard method. However, they focused on the AR(1) model and relied on simulation studies to demonstrate their method, with no derivation of theoretical properties.
In term of publicly available R packages facilitating estimation with
censored time series data, we are aware of only three such packages to
date, namely, cents
(McLeod et al. 2014), ARCensReg
(Schumacher, Lachos, and Galarza 2016),
and our carx
(Wang and Chan 2017a). The
cents package includes the
fitcar1() function, for fitting an AR(1) model in the
presence of censored and/or missing data, and the cenarma()
function which, according to the authors, implements a quasi-EM
algorithm whose M-step is carried out by the arima()
function and the E-step via the Durbin-Levinson recursions. However,
there is little documentation about these functions, rendering it hard
to understand and use the cents package. The
ARCensReg package offers similar functionality
as our carx package. But their estimation is
implemented via a stochastic approximation version of the EM (SAEM),
which is different from our approach. In addition, it seems to be
developed after the carx, as a dataset in
carx is included in
ARCensReg.
Motivated by the need for developing a computationally efficient method for estimating censored stochastic regression models, Wang and Chan (2017b) have recently introduced such a method for censored autoregressions with exogenous covariates (CARX). The basic idea of our new approach assumes that the score of the complete-data conditional log-likelihood of \(Y_t^*\) (the uncensored counterpart of \(Y_t\)) given \(Y_{t-j}^*, j=1,\ldots, p\) (and the covariates) has a closed-form expression and so does its expectation given the possibly censored time series \(Y_{t-j}, j=0,\ldots, p\), evaluated at the same set of model parameters. Setting the preceding conditional mean score to zero then provides an unbiased estimating equation for estimating the model. The proposed method reduces to maximum likelihood estimation in the absence of censoring, hence it is referred to as quasi-likelihood estimation. Furthermore, the consistency and asymptotic normality of the quasi-likelihood estimator have been established under some mild regularity conditions (Wang and Chan 2017b).
In this paper we aim to introduce the R package carx, in which quasi-likelihood estimation of a CARX model is implemented for the important special case of normal innovations. The main functionality of the package is to provide an intuitive interface with comprehensive documentation to enable the user to estimate the parameters of a CARX model. In addition, some utility functions for model summary, model diagnostics, outlier detection, and prediction with censored time series data are also included in the package.
In addition, we have also implemented a new object class for censored
time series, i.e., "cenTS". The "cenTS" class
inherits the extensible time series class "xts" in the R
package xts
(Ryan and Ulrich 2017). Some
functionalities, including plotting and summary, for the
"cenTS" class have been implemented. The
"cenTS" class is expected to be extended in future and is
hoped to be used as a standard data structure for censored time series
data.
In the following sections we first elaborate the CARX model and review the quasi-likelihood estimation method, then present the functionality and main functions of the R package carx and illustrate the package with data analyses using both simulated and real data examples. Some simulation studies assessing the empirical performance of model selection by minimizing the AIC and the accuracy of the proposed forecasting method and real data example are also reported.
In this section we briefly review the quasi-likelihood method for estimating a CARX model, and refer the reader to Wang and Chan (2017b) for details and some theoretical properties of the estimator. We first formulate the problem by specifying the model, then outline the estimation method and discuss some specific topics including model prediction, model diagnostics, and outlier detection.
Let \(\left\{ Y_t^*\right\}_{t=0}^\infty\) denote a real-valued time series of interest with \(Y_t^*\) being not observable if it falls inside a censoring region \(C_t\subset R\) which may be time-varying. The censoring region \(C_t\) is generally an interval of the form \((-\infty,l_t)\), \((u_t,\infty)\), or \((l_t,u_t)\) corresponding to left, right, and interval censoring, respectively (Huang and Rossini 1997; Park, Genton, and Ghosh 2007). (Left and/or right censoring is allowed by carx but interval censoring is not yet implemented in carx.) In practice, when an observation is censored, it is often recorded as the nearest censoring limit, as it is typically known whether it is left or right censored. The carx package assumes the censoring limits to be independent of the underlying process, and automatically treats any missing data as resulting from left censoring with their corresponding censoring limit \(l=\infty\).
In practice, \(Y_t^*\) is often found to be correlated with some vector covariate, say, \(X_t\). We assume a linear regression relationship between \(Y_t^*\) and \(X_t\), with the regression errors following an \(AR(p)\) model, where \(p\) is the AR order.
The Censored Auto-Regressive model with eXgenous variables (CARX) specifies that the uncensored response \(\{Y_t^*\}\) is an autoregressive (AR) process given by \[\begin{aligned} \label{arEqn2} \left(Y_t^*-X_t^\intercal\beta\right)-\sum_{j=1}^p \psi_j \left(Y_{t-j}^*-X_{t-j}^\intercal\beta\right) = \varepsilon_t, \end{aligned} (\#eq:arEqn2)\] and \(Y_t^*\)’s are linked to the observations as follows \[\begin{aligned} \label{yEqn} Y_t &= \left\{ \begin{array}[c]{ll} l_t, & \mbox{if $Y_t^*$ is left censored},\\ u_t, & \mbox{if $Y_t^*$ is right censored},\\ Y_t^*,& \mbox{otherwise}, \end{array}\right. \end{aligned} (\#eq:yEqn)\] where \(\beta\) is the vector of regression coefficients, \(\psi_i, i=1,2,\ldots,p\), are the AR parameters, \(\{\varepsilon_t\}\) is an independent and identically distributed (iid) process with mean 0, variance \(\sigma_\varepsilon^2\), and independent of \(\{X_t\}\).
The \(\varepsilon_t\)’s are also known as the innovations in the time series literature. Eqn. (@ref(eq:arEqn2)) is equivalent to the regression model \(Y_t^*= X_t^\intercal\beta+\eta_t\), where the regression errors \(\eta_t\) are correlated over time and follow an AR(\(p\)) process with the \(\psi\)’s being the AR coefficients. In the package, the innovations are assumed to be normal although it is shown by (Wang and Chan 2017b) that the proposed estimation method is robust to mild departure from the normality assumption.
Let \(\psi = (\psi_1,\cdots,\psi_p)^\intercal\). Throughout, \(\theta = (\beta^\intercal,\psi^\intercal,\sigma_\varepsilon)^\intercal\) denotes a generic parameter vector, while \(\theta_0\) denotes the true parameter vector. Let \(\left\{ (Y_t,X_t)\right\}_{t=1}^n\) be data generated from the CARX model with parameter \(\theta_0\). The quasi-likelihood estimation procedure is motivated by maximum likelihood estimation and leverages on (i) the availability of the closed-form expression of \(\ell_t(\theta) =\ell\left(Y_t^*|Y^*_{t-j}, j=1,\ldots, p, X_{t-k}, k=0,\ldots, p; \theta\right)\) (which holds, for instance, for the case of normal errors as implemented in carx) and (ii) \(\sum_{t=p+1}^n S_t(\theta)=0\) is an unbiased estimating equation, where \(S_t(\theta)\) is the first derivative of \(\ell_t(\theta)\) with respect to \(\theta\). Since \(Y_t^*\) are unobservable, we replace \(S_t(\theta)\) by \(E_\theta\left(S_t(\theta)|Y_{t-k}, X_{t-k}, k=0,\ldots,p\right)\) resulting in the following estimating equation: \[\begin{aligned} \label{estEqn} \sum_{t=p+1}^n E_\theta\left(S_t(\theta)|Y_{t-k}, X_{t-k}, k=0,\ldots,p\right)= 0. \end{aligned} (\#eq:estEqn)\]
The quasi-likelihood method estimates \(\theta\) by solving Eq @ref(eq:estEqn). Note that, in the absence of censoring, solving the preceding estimating equation reduces to maximum likelihood estimation, asymptotically.
The following iterative scheme for solving Eq @ref(eq:estEqn) was proposed by Wang and Chan (2017b).
Initialize the parameter estimate by some consistent estimate, denoted by \(\theta^{(0)}\).
For each \(k=1,\dots\), obtain an update of estimate \(\theta^{(k)}\) by \[\begin{aligned} \label{maxQ} \theta^{(k)} = \mbox{argmax}_{\theta} \; Q\left(\theta|\theta^{(k-1)}\right), \end{aligned} (\#eq:maxQ)\] where \[\begin{aligned} \label{eqn:avg_z} Q\left(\theta|\theta^{(k-1)}\right) &= \sum_{t=p+1}^n Q_t\left(\theta|\theta^{(k-1)}\right),\\ \end{aligned} (\#eq:eqnavg-z)\]
\[\begin{aligned} \label{qt} Q_t\left(\theta|\theta^{(k-1)}\right)&=\mbox{E}_{\theta^{(k-1)}}\left( \ell_t(\theta)|Y_{t-k}, X_{t-k}, k=0,\ldots,p \right). \end{aligned} (\#eq:qt)\]
Iterate Step (2) until \(\|\theta^{(k)}-\theta^{(k-1)}\|_2/\|\theta^{(k-1)}\|_2<\epsilon\) for some positive tolerance \(\epsilon \approx 0\). Let \(\hat{\theta}\) be the estimate obtained from the last iteration.
The optimization in Step (2) for the case of normal innovations is elaborated in Section 2.4 of Wang and Chan (2017b). The value \(Q(\hat{\theta}|\hat{\theta})\) evaluated at the convergence of the algorithm will be referred to as the maximum (quasi-)log-likelihood. In the absence of censoring, it reduces to the maximum log-likelihood, hence it will be used to replace the latter in evaluating information criteria such as the Akaike information criterion (AIC) (Konishi and Kitagawa 2008).
In the carx package, the initial value for the preceding iterative algorithm is set to the conditional least squares estimate obtained with the censored data replaced by the corresponding censoring limit, which appears to work well in simulation examples reported in Wang and Chan (2017b).
Wang and Chan (2017b) proved the consistency and asymptotic normality of the quasi-likelihood estimator under mild regularity conditions. But the asymptotic covariance matrix of the estimator involves two intractable matrices. Consequently, Wang and Chan (2017b) proposed to use parametric bootstrap for drawing inference, including estimating the asymptotic covariance matrix and constructing confidence intervals of the unknown parameters.
It is of practical interest to predict the future values \(Y_{n+h}^*\) given the observations \(\left\{ (Y_t,X_t) \right\}_{t=1}^n\), where \(h=1,2,\dots,H\) and \(H\) is some fixed upper bound, for instance, \(H=14\) for bi-weekly forecast, assuming the data are sampled daily. This is generally a non-trivial problem in the presence of censoring, and can be handled by Monte Carlo simulation for its solution. Since \(X\) is an exogenous process, we consider the simple case of the prediction problem conditioned on the given future covariate values \(\left\{ X_{t+h} \right\}_{h=1}^H\). We also assume normality of \(\varepsilon_t\) and known parameter \(\theta_0\), although the following discussion can be readily extended to non-normal innovations. Relaxation of these assumptions will be discussed at the end of this subsection. The prediction problem is equivalent to finding the conditional distribution \[\begin{aligned} \mathcal{D}_{n,h} &= \mathcal{D}\left(Y^*_{n+h}|\left\{ X_{n+i}\right\}_{i=1}^h ,\left\{ (Y_t,X_t) \right\}_{t=1}^n\right)\\ &= \mathcal{D}\left(Y^*_{n+h}|\left\{ X_{n+i} \right\}_{i=1}^h,\left\{ (Y_t,X_t) \right\}_{t=\tau}^n\right), \end{aligned}\] where \(\tau = \max\left( \left\{ 1 \right\} \cup \left\{u: 1\le u\le n-p+1, \textrm{ and none of } \left\{ Y_t \right\}_{t=u}^{u+p-1}\textrm{ is censored}\right\}\right)\), due to the autoregressive nature of the regression errors \(\eta_t = Y_t^* - X_t^\intercal\beta\) (Zeger and Brookmeyer 1986).
There are two cases. Case 1: \(\tau = n-p+1\), i.e., the most recent \(p\) \(Y_t\)’s are uncensored so that the prediction problem admits a closed-form solution which is well-known; see, e.g., Cryer and Chan (2008, chap. 9). Specifically, for any \(h=1,\dots,H\), \(\mathcal{D}_{n,h}\) is a normal distribution whose mean serves as the point predictor denoted by \(\hat{Y}^*_{n+h}\) that can be recursively computed as follows: \(\hat{Y}^*_{n+h} = X_{n+h}^\intercal{\beta} + \hat{\eta}_{n+h}\), with \(\hat{\eta}_{t} = \sum_{l=1}^p {\psi}_l \hat{\eta}_{t-l}\) for \(t>n\), and \(\hat{\eta}_{t} = Y_{t}-X_t^\intercal{\beta}\) if \(t\le n\). The prediction error, denoted by \(\epsilon_{n+h} = Y_{n+h} - \hat{Y}^*_{n+h}\), can be written as \(\epsilon_{n+h} = \varepsilon_{n+h} + \sum_{l=1}^p \psi_l \epsilon_{n+h-l} = \sum_{i=0}^h\omega_{h,i}\varepsilon_{n+h-i}\), where the coefficients \(\omega_{h,i}\) can be recursively calculated by making use of the preceding identity and the initial condition \(\omega_{h,0}=1\). The prediction variance is given by \(\mathrm{var}(\epsilon_{n+h}) = {\sigma_\varepsilon}^2\sum_{i=0}^h \omega_{h,i}^2\).
We now consider Case 2: \(\tau < n-p+1\). Then \(\mathcal{D}_{n,h}\) is a truncated multivariate normal distribution. Although the first and second moments of \(\mathcal{D}_{n,h}\) admit closed-form solutions (Tallis 1961; Genz et al. 2017), they are not useful for constructing predictive intervals as the predictive distributions are non-normal. Thus, we propose to use a sampling approach to estimate any interesting characteristic of the predictive distribution of \(Y^*_{n+h}\). First, note that the regression errors \(\left\{ \eta_t=Y_t^* - X_t^\intercal\beta \right\}_{t=\tau}^n\) are jointly normal. Let \(\eta_c\) and \(\eta_o\) be the sub-vectors of \(\eta_{\tau:n}\) such that the corresponding elements of \(Y_{\tau:n}\) are censored and observed, respectively. Then given \(\{(Y_t,X_t)\}_{t=\tau}^n\), \(\eta_c\) follows a truncated multivariate normal distribution, whose realizations can be readily simulated, and hence we can simulate \(Y^*_t=X_t^{^\intercal}\beta+\eta_t, \tau\le t\le n\). Then the realizations of \(Y^*_{n+h}, h=1,\ldots, H\) can be drawn from the multivariate normal predictive distribution stated in Case 1. Predictive intervals of \(Y^*_{n+h}\) can then be approximately constructed from a random sample from the predictive distribution of \(Y^*_{n+h}\), using the percentile method.
Note that the proposed predictive scheme is conditional on the future covariate values \(\{X_t\}_{h=1}^H\), which, in general, are non-deterministic. Extension to the case of stochastic \(\left\{ X_{t+h} \right\}_{h=1}^H\) is straightforward, provided that its stochastic generating mechanism is known, as drawing a realization from the predictive distribution of \(Y_{n+h}^*\) can be done in two steps. Step 1 consists of drawing a realization \(\left\{ x_{t+h} \right\}_{h=1}^H\), followed by drawing a future realization for \(Y^*_{n+h}\) given the data and \(\left\{ x_{t+h} \right\}_{h=1}^H\). In practice, \(\theta_0\) is unknown and it can be replaced by the quasi-likelihood estimator or a parametric bootstrap approach which can be readily implemented to incorporate parametric uncertainty in the prediction.
A main task in model diagnostics consists of checking whether or not the data are consistent with the model assumption that the innovations are independent and identically normally distributed of zero mean and constant variance. In the presence of censoring, how to define the residuals is unclear. For the simple case when \(Y_t^*\) is observed, the corresponding residual is universally defined as \(Y_t^*-\hat{Y}^*_{t|t-1}\), where \(\hat{Y}^*_{t|t-1}\) is the mean of \(\mathcal{D}_{t-1,1}\), evaluated at the parameter estimate. In the case of censoring so that some \(Y_t^*\)s are unobserved, (Wang and Chan 2017b) advocated the use of the simulated residuals (Gourieroux et al. 1987) for model diagnostics. The simulated residuals are constructed as follows. First, impute each unobserved \(Y_t^*\) by a realization from the conditional distribution \(\mathcal{D}\left(Y^*_{t}|\left\{ (Y_s,X_s) \right\}_{s=1}^t\right)\), evaluated at the parameter estimate. Then, refit the model with \(\{(Y_t^*, X_t)\}\) so obtained, via conditional maximum likelihood; the residuals from the latter model are the simulated residuals \(\hat{\varepsilon}_t\). Let the corresponding parameter estimate of \(\theta\) be \(\tilde{\theta}\). The corresponding (simulated) partial residuals for the \(X\)’s, i.e., \(X_t^{^\intercal} \tilde{\beta}+\hat{\varepsilon}_t\), can be used to assess the relationship between \(Y\) and \(X\), after adjusting for the autoregressive errors.
A simulation study reported in Wang and Chan (2017b) suggests that the asymptotic null distribution of the Ljung-Box test statistic, for checking residual autocorrelations, based on the simulated residuals is the same as that based on the uncensored data. Thus, standard diagnostic tools for residual analysis may be applicable with the simulated residuals.
Real data are often marred by outliers. An outlier in a time series may result from a perturbation inducing an unknown shift in an observation or an innovation, resulting in the so-called additive or innovative outlier, respectively ( Cryer and Chan 2008). An innovative outlier (IO) may mask as a contiguous block of additive outliers (AO). Since it is harder to detect IOs in censored time series, we focus on detecting AOs with a new method for doing so in censored time series.
As the number of outliers and their locations are generally unknown, outlier detection is carried out one by one and iteratively. The procedure begins with an outlier-free CARX model. Then we check for the presence of additive outliers by a method to be described below. If an outlier is detected at time \(t_o\), the covariate \(X\) will be augmented with the indicator variable \(I_{t_o}\) which equals 1 if \(t = t_o\), and 0 otherwise. The augmented CARX model is then fitted, with which outlier detection is repeated until no more outliers are found.
More specifically, we describe a method to detect any remaining additive outliers given the data and a CARX model. For the sake of fast computation, we consider the predictive distribution of \(Y_t\) given the information from \(t-p\) to \(t\), i.e., \(\tilde{\mathcal{D}}_{t} := \mathcal{D}\left(Y^*_{t}| X_{t},\left\{ \left(Y_{t-j},X_{t-j}\right),j=1,\dots,p \right\}\right)\). Let \(P_{\tilde{\mathcal{D}}_t}\left(E\right)\) be the probability of the event \(E\) evaluated with distribution \(\tilde{\mathcal{D}}_t\) and \(n\) the sample size, for each \(t = p+1,\cdots,n\), we calculate the following probability \(p_t\). \[\begin{aligned} \label{pt} p_t &= \left\{ \begin{array}[c]{ll} P_{\tilde{\mathcal{D}}_t}\left( Y_t > u_t\right), & \mbox{if $Y_t^*$ is right censored},\\ P_{\tilde{\mathcal{D}}_t}\left( Y_t < l_t\right), & \mbox{if $Y_t^*$ is left censored},\\ \min \{P_{\tilde{\mathcal{D}}_t}\left( Y_t > y_t\right), P_{\tilde{\mathcal{D}}_t}\left( Y_t < y_t\right)\},& \mbox{otherwise}. \end{array}\right. \end{aligned} (\#eq:pt)\]
Let \(t_o = \mbox{argmin}_{t=1,\cdots,n} p_t\), and the response at \(t_o\) is declared as an AO if \(p_{t_o} < 0.025/n\), where the Bonferroni inequality is used to limit the family error rate to not exceed 5% ( Cryer and Chan 2008); otherwise, it is deemed that there are no remaining outliers.
In this section we present the R package carx in which the estimation, prediction, and diagnostics procedures discussed in previous section are implemented, assuming the normality of \(\varepsilon_t\). For more detail, see the documentation of the package. Examples will be given in the next section.
First, let us introduce a class "cenTS" designed to
encapsulate a censored time series with its observed values as well as
the left (lower) and right (upper) censoring limits. The
"cenTS" inherits the extensible time series class
"xts" in the R package xts. A
"cenTS" object can be constructed by the following function
call.
cenTS(value, order.by, lcl = NULL, ucl= NULL, value.name = "value", ...)
Note that the value (whose name can be set in
value.name) and order.by denote the observed
values and their corresponding indices respectively, and
lcl and ucl denote the left (lower) and right
(upper) censoring limits respectively. Other time series variables to be
included as covariates in the regression can be supplied via additional
arguments.
A "cenTS" object can be inspected by the
print() and plot() methods. Any covariate time
series can be retrieved by the xreg() method.
The foremost function is the method for the S3 class
"carx", carx(), whose signature is the
following.
carx(formula, data = list(), p = 1,
prmtrX = NULL, prmtrAR = NULL, sigma = NULL,
y.na.action = c("skip","as.censored"), addMu = TRUE,
tol = 1e-4, max.iter = 500,
CI.compute = FALSE, CI.level = 0.95,
b = 1000,b.robust = FALSE,b.show.progress = FALSE,
init.method = c("biased","consistent"),
cenTS = NULL, verbose = FALSE,...)
The carx() method provides a simple-to-use interface for
the user to input a formula, a data set, and other arguments to estimate
a CARX model.
The carx() method returns a "carx" object
which stores the supplied data, the quasi-likelihood coefficient
estimates, as well as other information. It allows many optional
arguments to control the function behavior. The main arguments are
listed below:
formula is a formula representing the regression
part of the model, such as y ~ x1 + x2.
data denotes a data.frame which
includes the following:
The response variable with variable name identified by the supplied formula.
Any covariate(s) with variable name(s) identified by the supplied formula.
A vector with name ci whose components take values
from \(\{-1, 0, 1\}\), where -1 (0,1)
indicates that the corresponding element in the response variable is
left-censored (not censored, right censored).
lcl representing the vector of left (lower)
censoring limits. If not present, indicating no lower limit.
ucl representing the vector of right (upper)
censoring limits. If not present, indicating no upper limit.
p denotes the autoregressive order of the regression
errors, default = 1.
The above arguments supply the data structure including the censoring information, and specify the CARX model to be estimated. Although the function contains many optional arguments for fine-tuning the fitting algorithm and obtaining more information about the fitted model such as confidence intervals, we merely discuss the following two arguments:
prmtrX, prmtrAR, and sigma
are used to specify the initial values of the regression coefficients
\(\beta\), the autoregressive
parameters \(\Psi\), and the innovation
standard deviation \(\sigma_\varepsilon\),
respectively.
y.na.action is a string indicating how to handle
missing (NA) values in y. If it is set to
"skip" (default), cases containing a missing value will be
skipped, so that the estimating equation of future cases will be
conditional on the most recent p complete cases after the
skipped case. For "as.censored", the y value
will be treated as left-censored with the left (lower) censoring limit
replaced by positive infinity. The user may choose to use
skip if there exist few long gaps in the response. Use
"as.censored" in the presence of numerous, non-contiguous
missing values in y. Note that the presence of any missing
values in x will automatically hard-code
y.na.action to be "skip".
As "carx" is an S3 class, some generic
methods have been implemented so that the estimation function can be
easily called for practical use and more information about the model
fitting can be easily extracted.
The function print() simply returns a plain output of
the fitted model, while the summary() function provides a
more elaborate summary of the fitted model including the estimates,
their standard errors, 95% confidence limits and p-values, based on
parametric bootstrap, for each model parameter, if
CI.compute = TRUE. The model parameters can be conveniently
extracted by the function coef(), which returns all
coefficient estimates except that the error (innovation) standard
deviation is returned as the sigma component of the list
returned by carx(). logLik() returns the
maximum (quasi-)log-likelihood \(\sum_{t=p+1}^n \mbox{E}_{\hat{\theta}}\left[
\ell\left(Y_t^*|\mathcal{F}^*_t;\hat{\theta}\right)|\mathcal{G}_t
\right]\), which can be used in lieu of the intractable maximum
log-likelihood. For instance, the function AIC() computes
the AIC of the model with the maximum log-likelihood replaced by maximum
(quasi-)log-likelihood.
There are some other useful functions in the package. The method
plot() draws the time plot of the censored response time
series, superimposed with the fitted values (1-step-ahead predictors)
from the supplied CARX model. The function predict()
computes the multi-step-ahead point predictors and their associated
prediction limits, based on a given model and future values of the
covariates supplied by the user. The function fitted()
returns the fitted values by calling the predict method.
The function residuals() returns the simulated residuals of
the fitted model. The outlier detection method discussed in Section 2.5 is implemented by the method
outlier(). Model diagnostics based on the simulated
residuals are visualized by the method tsdiag(), which
consists of four subplots: the time plot of standardized simulated
residuals, the residuals versus fitted values plot, the residual
autocorrelation function plot, and the plot of the p-values of the
Ljung-Box test statistics, for testing no residual autocorrelations.
In this section we illustrate the various functions of the package
through two examples, the first one is a simulated data set and the
second a real data set. Note that an extensive simulation study about
the performance of the proposed estimation method and some model
diagnostics can be found in Wang and Chan
(2017b) which shows the robustness of the proposed estimation
method to slight departure from the normality assumption of the
innovations. We first load the carx package by the
following command.
> library(carx)
To begin, we introduce the function carxSimCenTS() for
simulating data from a CARX model, whose signature and default values of
arguments are shown below.
carxSimCenTS(nObs = 200, prmtrAR = c(-0.28,0.25),
prmtrX = c(0.2,0.4), sigma = 0.60, lcl = -1, ucl = 1, x = NULL,
seed = NULL, value.name = 'y', end.date = Sys.date())
The carxSimCenTS() function generates a simulated
"cenTS" time series of length nObs, with the
AR parameters \(\left(\psi_i,
i=1,\ldots,p\right)\) supplied through the argument
prmtrAR, the regression coefficients through
prmtrX, and innovation standard deviation through
sigma, the lower and upper censoring limits through
lcl and ucl respectively. The regressors can
be supplied via x, which, if is NULL, will be
generated as independent standard normal variables. The user can also
specify the seed of the random number generator by seed for
ensuring repeatability. As carxSimCenTS() encapsulates the
simulated data into a "cenTS" object, the construction of
which need a time/date-based index. The default treats the data as daily
observations, with the end date specified by end.date. The
user can set the name of the censored time series via
value.name but the names of the regressors are hard-coded
as X1, X2, etc. There is another function
carxSim(), which returns a data.frame
consisting of y, x, lcl,
ucl and ci. We will mainly use the
carxSimCenTS() function for simulation as it encapsulates
the data as a "cenTS" object.
We first simulate a "cenTS" series, using the
carxSimCenTS() function with essentially the default
setting, i.e., simulate interval-censored data from a regression model
with a 2-dimensional covariate comprising independent standard normal
components whose regression coefficients are \(0.2\) and \(0.4\), and AR(2) regression noise terms
with the AR coefficients being \(\psi_1=-0.28,
\psi_2=0.25\); the data are then censored unless they fall inside
the interval \((-1,1)\).
> datSim <- carxSimCenTS(seed = 0,end.date = as.Date('2015-08-01'))
A glimpse of the last few data cases of the series is instructive.
> tail(datSim)
y lcl ucl ci X1 X2
2015-07-26 1.000 -1 1 1 -0.5466 0.288
2015-07-27 -0.321 -1 1 0 -1.6887 -1.505
2015-07-28 -0.259 -1 1 0 -1.5724 1.519
2015-07-29 0.386 -1 1 0 -0.4050 0.367
2015-07-30 0.282 -1 1 0 0.3193 1.700
2015-07-31 0.181 -1 1 0 0.0404 0.644
Censoring rate: 0.205
The simulated series can be readily visualized using the plot function (see Figure 1).
> plot(datSim)
"cenTS" series. The observed responses are connected as a
solid black line, and the lower and upper censoring limits drawn as red
dotted and dashed line with censored observations marked by triangles
pointing up and down, respectively.Then the parameters can be estimated by the carx()
method, with the fitted model saved in the object named
modelSim.
> modelSim <- carx(y ~ X1 + X2 - 1,data = datSim, p = 2, CI.compute = TRUE)
Note that -1 in the formula specifies no intercept in
the regression. Information about the fitted model can be obtained
directly by typing the variable name modelSim.
> modelSim
Call:
carx.formula(formula = y ~ X1 + X2 - 1, data = datSim, p = 2,
CI.compute = T)
Coefficients:
X1 X2 AR1 AR2
0.203 0.460 -0.234 0.279
Residual (innovation) standard deviation:
[1] 0.548
Censoring rate:
[1] 0.205
Sample size:
[1] 200
Number of parameters:
[1] 5
Quasi-log-likelihood:
[1] 20.1
AIC:
[1] -30.1
Confidence interval:
2.50% 97.50%
X1 0.131 0.2766
X2 0.383 0.5446
AR1 -0.390 -0.0919
AR2 0.123 0.4066
sigma 0.483 0.6122
Variance-covariance matrix:
X1 X2 AR1 AR2 sigma
X1 1.41e-03 1.98e-05 5.86e-05 -1.49e-04 9.95e-05
X2 1.98e-05 1.73e-03 1.28e-04 9.08e-05 2.45e-04
AR1 5.86e-05 1.28e-04 5.76e-03 2.08e-03 1.32e-04
AR2 -1.49e-04 9.08e-05 2.08e-03 5.15e-03 5.88e-05
sigma 9.95e-05 2.45e-04 1.32e-04 5.88e-05 1.05e-03
N.B.: Confidence intervals and variance-covariance matrix
are based on 1000 bootstrap samples.
A summary of the fitted model can be obtained by running the
summary() function.
> summary(modelSim)
Call:
carx.formula(formula = y ~ X1 + X2 - 1, data = datSim, p = 2,
CI.compute = T)
Coefficients:
Estimate StdErr lowerCI upperCI p.value
X1 0.2025 0.0375 0.1314 0.28 <2e-16 ***
X2 0.4602 0.0416 0.3826 0.54 <2e-16 ***
AR1 -0.2336 0.0759 -0.3903 -0.09 0.002 **
AR2 0.2792 0.0717 0.1232 0.41 <2e-16 ***
sigma 0.2025 0.0324 0.4834 0.61 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC:
[1] -30.1
Although it can be shown that the quasi-likelihood estimator is
asymptotically normal under some regularity conditions (Wang and Chan 2017b), the asymptotic
variance-covariance matrix is intractable so it is computed via
parametric bootstrap. The summary function prints out the coefficient
estimates and innovation standard deviation estimate, together with
their estimated (bootstrap) standard errors, and lower and upper 95%
confidence limits. Note that the bootstrap computation time increases
almost linearly with the bootstrap replicate size; the default is 1000.
More specific information can be easily obtained by invoking various
methods. For instance, logLik returns the
quasi-log-likelihood of the data, coef returns the
coefficients of the model, and the standard deviation of \(\varepsilon_t\) can be obtained by
modelSim$sigma.
> logLik(modelSim)
[1] 20.1
attr(,"class")
[1] "logLik.carx"
> coef(modelSim)
X1 X2 AR1 AR2
0.203 0.460 -0.234 0.279
> modelSim$sigma
[1] 0.548
The plot() function provides a visual check of how the
fitted values track the data, with the censoring limits superimposed on
the diagram, see Figure 2.
> plot(modelSim)
Model diagnostics are facilitated by the tsdiag()
function which is similar to the tsdiag() function in TSA
package (Chan and Ripley 2012). The
tsdiag() function generates a plot of 4 sub-figures,
namely, the time plot of the simulated residuals which is useful for
visually checking the presence of residual temporal patterns and/or
outliers, the simulated residuals versus fitted values plot which is
useful for checking the adequacy of the linear regression model
assumptiom, the residual autocorrelation function (ACF) plot that
quantifies the residual correlations, and the plot of the p-values of
the Ljung-Box tests for the presence of residual autocorrelation. The
following command generates the diagnostics plot for the model fitted to
the simulated data.
> tsdiag(modelSim)
The uppermost diagram in Figure 3 shows no apparent residual temporal patterns, which is also confirmed by the fact that none of the examined residual autocorrelations in the second sub-figure from the bottom are significant and that the bottom sub-figure shows that all p-values of the Ljung-Box test statistics based on the first \(k\) lags of residual autocorrelations are larger than 5% for all allowable \(k\le 23\). Moreover, the second sub-figure from the top shows that the linear regression assumption is justifiable and so is the constant innovation variance assumption. Hence, we can conclude that the model is correctly specified, as it should be, and it provides a good fit to the data.
In this example we utilize the package to analyze the change of total
phosphorus (P) concentrations in river water. Phosphorus is a major
nutrient in river water, of which an excessive amount can result in
environmental problem such as eutrophication. Phosphorus concentration
in many rivers in Iowa has been monitored under the ambient water
quality program conducted by the Iowa Department of Natural Resources
(Libra, Wolter, and Langel 2004). An
analysis of the change of P concentration has been reported by (Wang, Chan, and Schilling 2016). Here we
illustrate the analysis for a particular data set from an ambient site
located in the West Fork Cedar River at Finchford, with the data
available in a "cenTS" object named pts in the
carx package.
The P concentrations (in mg/l) were left-censored whenever they fell below certain time-varying detection limits, resulting in a censoring rate of \(12.6\%.\) The data were collected monthly from October 1998 to October 2013, but data collection was suspended between September 2008 to March 2009, owing to lack of funding. In the data set, there are serveral variables.
> names(pts)
[1] "logP" "lcl" "ci" "tInMonth" "logQ" "season"
The variable logP consists of the logarithmic P,
lcl the corresponding censoring limits, ci the
indicator variable of censoring, tInMonth is the time index
(in month), logQ is the corresponding logarithmic water
discharge data (Q) measured in cf/s, obtained from the U.S. Geological
Survey, and season indicates to which season the month
index belongs (see below for further details). P is generally correlated
with the water discharge (Schilling et al.
2010). We will explore the relationship between P and Q. See
Figure 4 for the time plots of P, Q,
and the historical censoring limits.
It is also conjectured that the association between the logP and logQ
may be seasonal. The variable season in pts is
constructed to denote the quarter of the month with Quarter 1 consists
of the first three months, namely, January, February, and March; Quarter
2 comprises the next three months, and so on. Figure 5 illustrates the seasonal
relationship between logP and logQ. It is of
interest whether there exists a linear trend in the logarithmic P.
Preliminary analysis (not reported here) suggests the presence of
significant autocorrelation in the regression errors. Thus, the general
model takes the following form \[\begin{aligned}
\log\left(P_t\right) = \beta_1 t+ f\left(\log\left(Q_t\right)\right) +
\eta_t,
\end{aligned}\] where \(f\) is
some linear function that may be seasonal in the intercept and/or
seasonal in the coefficient of logQ, and \(\eta_t\) follows an AR process.
logP
versus logQ. The data are labeled by different numbers (1
for Quarter 1 and so on) and colors (black, red, green, blue for Quarter
1, 2, 3 and 4, respectively). Least squares quaterly regression lines
(solid, dashed, dotted, and dash dotted) are superimposed with the same
color as the data points.Note that we need to determine whether the intercept and/or the
regression coefficient are seasonal, and whether to include in the model
a time trend, resulting in 8 combinations. Moreover, the AR order for
the regression errors has to be specified. Assuming the maximal AR order
to be \(m\), we have to select among
\(8\times m\) models, which can be done
by selecting the best model that achieves the smallest AIC. Model
selection by AIC is automated by the function carxSelect().
Here, the maximal AR order is 3.
> arOrder <- 3
The list of models, named M1 to M8, is
specified in the following code.
> s1 <- logP ~ logQ
> s2 <- logP ~ tInMonth + logQ
> s3 <- logP ~ logQ:as.factor(season)
> s4 <- logP ~ tInMonth + logQ:as.factor(season)
> s5 <- logP ~ as.factor(season) + logQ - 1
> s6 <- logP ~ tInMonth + as.factor(season) + logQ - 1
> s7 <- logP ~ as.factor(season) + logQ:as.factor(season) - 1
> s8 <- logP ~ tInMonth + as.factor(season) + logQ:as.factor(season) - 1
> fmls <- c(s1,s2,s3,s4,s5,s6,s7,s8)
> names(fmls) <- paste0('M',seq(1,8))
The model selection is performed by invoking the function
carxSelect() which has two required arguments: a list of
formulas and the maximal AR orders, plus an optional argument
detect.outlier, which by default is TRUE,
indicates whether outliers should be detected in the model. The function
carxSelect() returns a "carx" object
comprising an additional element selectionInfo which is a
list containing more information about the selection result, including
aicMat which is a matrix whose \((i,j)\)th element is the AIC of the model
represented by the \(i^{th}\) formula
in the list and AR order \(j\).
For the purpose of illustrating the prediction by the
predict() method, we use all data up to the end of 2012 for
model selection and fitting, and then check the model prediction against
the observed data in 2013.
> cs <- carxSelect(fmls, arOrder, data = pts['1998/2012'],
+ detect.outlier = TRUE, CI.compute = TRUE)
> print(round(cs$selectionInfo$aicMat,1))
AR1 AR2 AR3
M1 -41.8 -39.6 -38.2
M2 -39.7 -38.1 -36.8
M3 -51.2 -47.5 -45.1
M4 -49.4 -45.9 -43.6
M5 -53.2 -49.6 -47.1
M6 -51.5 -48.2 -45.8
M7 -53.9 -50.8 -47.6
M8 -52.8 -49.8 -46.8
A summary of the model fit is shown below.
> summary(cs)
Call:
carx.formula(formula = formula(m0), data = m0$data, p = m0$p,
CI.compute = ..1)
Coefficients:
Estimate StdErr lowerCI upperCI p.value
as.factor(season)1 -6.053239 0.633190 -7.325537 -4.9191 <2e-16 ***
as.factor(season)2 -3.455734 0.612764 -4.696565 -2.2607 <2e-16 ***
as.factor(season)3 -4.235837 0.414149 -5.028568 -3.4297 <2e-16 ***
as.factor(season)4 -4.854407 0.476015 -5.764154 -3.9324 <2e-16 ***
as.factor(season)1:logQ 0.633582 0.111002 0.436431 0.8566 <2e-16 ***
as.factor(season)2:logQ 0.248797 0.086893 0.073705 0.4247 0.006 **
as.factor(season)3:logQ 0.373538 0.068555 0.235532 0.5053 <2e-16 ***
as.factor(season)4:logQ 0.389721 0.087467 0.217672 0.5584 <2e-16 ***
AR1 0.075072 0.090671 -0.137839 0.2227 0.596
sigma 0.482897 0.028770 0.412907 0.5265 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AIC:
[1] -53.85639
The fitted model can be visualized by calling the plot()
function, which is shown in Figure 6. The
fitted values appear to track the data well.
> plot(cs)
We examine the goodness-of-fit of the fitted model via the
tsdiag() function which generates 4 diagnostic plots in
Figure 7. These plots indicate that
the fitted model provides good fit to the data.
> tsdiag(cs)
The selected model can be interpreted as follows. The linear trend was not selected, suggesting no significant long-term change in the P concentrations. The intercept and the regression coefficient of logQ were seasonal. The regression errors appeared to be mildly auto-correlated and can be modeled as an order 1 AR process, although the AR coefficient was not significant.
Finally, we compute the prediction of logP via the
predict() method and compare the prediction with the actual
data from January to October 2013. Note the prediction makes use of
observed discharge data in 2013. Figure 8 shows the point predictors (blue dashed
line) against the actual values (black solid line) and the 95%
prediction bands (red lines), which indicates that the prediction tracks
the actual data well.
In this section, we report a simulation study on the effectiveness of
model selection by minimizing the AIC. Recall this functionality is
implemented by the carxSelect() function which outputs the
model with the smallest AIC from a set of models of various AR orders up
to some pre-specified maximum order.
We restrict the simulation study to the problem of selecting the AR
order with the same model specification, i.e, the same set of
regressors, which is conducted as follows. We simulated 1000 series by
calling carxSim() with the default setting, hence the true
AR order is equal to 2, and for each simulated series we selected the
best model among the models with the AR order from 1 to 6. Since the
uncensored data were available in the simulation, we repeated the model
selection with the uncensored observed data, for comparing with the
results using the censored data. This simulation study can be reproduced
by the following code.
> singleTestSelectAROrder <- function(iter)
+ {
+ seed <- 1375911
+ cts <- carxSim(seed = iter*seed)
+ m0 <- carxSelect(list(f1 = as.formula(y~X1+X2-1)), max.ar = 6,
+ data=cts[,c("y","X1","X2")],detect.outlier = FALSE)
+ m1 <- carxSelect(list(f1 = as.formula(y~X1+X2-1)),max.ar = 6,
+ data = cts, detect.outlier = FALSE)
+ c(m0$fitted$p, m1$fitted$p)
+ }
> nRep <- 1000
> orders <- parallel::mclapply( 1:nRep, singleTestSelectAROrder,
+ mc.cores = parallel::detectCores() - 1)
> orders <- do.call(rbind, lapply(orders, matrix, ncol = 2, byrow = TRUE))
> freqComDat = count(orders[1,])
> freqCenDat = count(orders[2,])
The selected orders are reported in Table 1, which shows that the true order can be recovered with an empirical probability of \(52.7\%\), and the results using censored and complete data are comparable.
| AR order | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Frequency (complete data) | 17 | 626 | 139 | 97 | 69 | 52 |
| Frequency (censored data) | 37 | 527 | 160 | 99 | 101 | 76 |
In this section we report a simulation study about the empirical
performance of the model prediction procedure. A series of 210 data was
simulated using the default parameters of carxSim(), with
the first 200 data used to estimate the model, and the last 10
observations used to compare with the predicted values based on the
fitted model and the simulated future covariate values. The above
procedure was repeated 500 times. The empirical coverage rates of the
95% \(\ell\)-step ahead prediction
intervals, \(\ell=1,2,\ldots,10\), are
summarized in Table 2, which indicates a
close match between the empirical and nominal coverage rates. The
simulation exercise can be reproduced by the following code.
> nRep = 500; nObs = 200; n.ahead=10
> runSimPredCR <- function()
+ {
+ set.seed(0)
+ crMat = matrix(nrow = n.ahead, ncol = nRep)
+ for(iRep in 1:nRep)
+ {
+ sdata = carxSim(nObs = nObs + n.ahead)
+ trainingData = sdata[1:nObs,]
+ testData = sdata[-(1:nObs),]
+ mdl = carx(y ~ X1 + X2 - 1, data = trainingData, p = 2)
+ newxreg = testData[,c('X1','X2')]
+ predVal = predict(mdl, newxreg = newxreg, n.ahead = n.ahead)
+ crInd = (predVal$ci[,1] <= testData$y) & (predVal$ci[,2] >= testData$y)
+ crMat[,iRep] = crInd
+ }
+ crPred = apply(crMat,1,mean)
+ }
> runSimPredCR()
| n.ahead | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Empirical Coverage | 0.954 | 0.932 | 0.946 | 0.952 | 0.946 | 0.946 | 0.938 | 0.946 | 0.948 | 0.952 |
In summary, we have reviewed the quasi-likelihood method to estimate a censored time series regression model and introduced the carx package in which quasi-likelihood estimation is implemented, together with other useful functions for model selection, prediction, diagnostics and outlier detections. We illustrated the carx package with two major examples, and shed light on the effectiveness of model selection via minimizing AIC and the prediction accuracy.
Future work includes extending the method for more complex regression noise structure than the AR model, for instance, the more general ARIMA model, and updating the package according to the feedback from the public.