Abstract
Bivariate time-to-event data frequently arise in research areas such as clinical trials and epidemiological studies, where the occurrence of two events are correlated. In many cases, the exact event times are unknown due to censoring. The copula model is a popular approach for modeling correlated bivariate censored data, in which the two marginal distributions and the between-margin dependence are modeled separately. This article presents the R package CopulaCenR, which is designed for modeling and testing bivariate data under right or (general) interval censoring in a regression setting. It provides a variety of Archimedean copula functions including a flexible two-parameter copula and different types of regression models (parametric and semiparametric) for marginal distributions. In particular, it implements a semiparametric transformation model for the margins with proportional hazards and proportional odds models being its special cases. The numerical optimization is based on a novel two-step algorithm. For the regression parameters, three likelihood-based tests (Wald, generalized score and likelihood ratio tests) are also provided. We use two real data examples to illustrate the key functions in CopulaCenR.Bivariate data arise frequently in many research areas such as health, epidemiology, and economics. For example, bivariate time-to-event endpoints are often used in clinical trials studying bilateral diseases (e.g., eye diseases) or complex diseases (e.g., cancer and psychiatric disorders). The two events are correlated as they come from the same subject. In many situations, the two event times cannot be precisely observed, leading to bivariate censored data. Specifically, bivariate right-censored data occur when the study ends prior to the occurrence of one or both events. An example of such data comes from a clinical study assessing the treatment effect on preventing blindness in Diabetic Retinopathy patients where each patient had one eye randomized to the treatment and the other eye received no treatment (Huster, Brookmeyer, and Self 1989), and the time-to-blindness are bivariate and right-censored. We will illustrate the analysis of this study in Section 4. In another situation, bivariate interval-censored data occur when the status of both events are periodically examined at intermittent assessment times. In this case, the right censoring could also happen if the event still does not occur at the last assessment time. A special case of interval-censored data is the current status data if there is only one assessment time and the event is only known to occur or not by its assessment time. An example of bivariate interval-censored data will be demonstrated in Section 4, which came from a clinical trial studying the progression of a bilateral eye disease, Age-related Macular Degeneration (AMD), where the progression time to late-AMD are interval or right censored (AREDS Group 1999). More examples can be found in books Hougaard (2000) and J. Sun (2007).
The development of our package is motivated by researches that are interested in (1) discovering covariates that are significantly associated with the bivariate censored outcomes, and (2) characterizing the joint and conditional risks of two events. For the bivariate events, the joint and conditional risks could be clinically more important than the marginal risk (of a single event). For example, the joint 5-year progression-free probability for both eyes helps identify patients with a high risk of progressing to late-AMD. For another example, for patients having one eye already progressed, the conditional 5-year progression-free probability for the non-progressed eye (given its fellow eye already progressed) provides important information for both clinicians and the patient since patients with both eyes progressed to the late stage of the disease may lose the ability to live independently.
There are three major approaches to fit regression models for bivariate censored data. The simplest way is to fit a marginal model and estimate the variance-covariance by a robust sandwich estimator (for example, (Wei, Lin, and Weissfeld 1989)). This approach takes a working independence assumption, and thus cannot generate joint or conditional distributions. The second approach is based on frailty models (for example, (Oakes 1982)), which are essentially mixed effects models and account for the dependence between two events by a latent frailty variable. However, the covariate effects in frailty models are usually interpreted on a conditional level (by conditioning on the frailty term), which is not straightforward. The third approach is to use copula models (for example, (Clayton 1978)). Unlike the marginal or frailty approaches, the copula approach models the joint survival distribution by directly connecting the two marginal distributions through a copula function. One unique advantage of the copula is that it separately models the marginal distributions and the dependence parameter(s), allowing flexibility in marginal models and straightforward interpretation for covariate effects. Moreover, the challenge from censoring can be naturally handled through the marginal distributions within the copula function. Besides, the joint and conditional distributions can be obtained based on the copula model.
Along with these three major approaches, multiple endeavors have been
devoted to the development of software, mostly R (R Core Team
2019) packages, to build regression models for bivariate censored
data. For bivariate right-censored data, the survival
(Therneau 2018a) package can fit
parametric or semiparametric Cox (D. Cox
1972) marginal and frailty models. Also, packages such as parfm (Munda, Rotolo, and Legrand 2012) and frailtypack
(Rondeau, Marzroui, and Gonzalez 2012)
implement proportional hazards (PH) frailty models under the parametric
and semiparametric settings. Other R
packages such as coxme (Therneau 2018b) and phmm (Donohue and Xu 2019) also fit PH frailty models
for right-censored data. For bivariate interval-censored data, the survival
and frailtypack
packages provide marginal and frailty models under the parametric or
semiparametric (Cox PH) situation, respectively. The C++ program IntCens (codes
located under https://dlin.web.unc.edu/software/intcens/) implements a
class of semiparametric frailty models, including both PH and
proportional odds (PO) models.
To the best of our knowledge, there exists no R package for fitting copula-based regression models for both bivariate right-censored and interval-censored data. The existing copula packages for bivariate data handle either the non-censoring (i.e., complete data) or the right-censoring situation. In the non-censoring situation, the package copula (Hofert et al. 2018) by Yan (2007) and Kojadinovic and Yan (2010) implements multivariate copula models without covariates for complete data and obtains the maximum likelihood estimator for the copula dependence parameter. It gives useful codes for implementing regression models in bivariate complete data in the appendix of Yan (2007). It also provides copula goodness-of-fit tests for model selection purpose. The package VineCopula (Schepsmeier et al. 2018) can also model bivariate or multivariate complete data without covariates through the vine copula models (Aas et al. 2009). Packages such as CopulaRegression (Nicole Kraemer 2014) and gcmr (Masarotto and Varin 2017) can provide copula-based regression models with parametric margins for bivariate or multivariate complete data and provide maximum likelihood estimators for model parameters. The package gamCopula (Nagler and Vatter 2020) implements a generalized additive model that can take into account the effect of the predictors on the dependence structure of bivariate and vine copula models (Vatter and Chavez-Demoulin 2015). For the right-censoring situation, the Copula.surv package (Emura 2018) can estimate the Clayton copula dependence parameter in bivariate right-censored data without covariates and also perform a goodness-of-fit test for a fitted Clayton model (Emura, Lin, and Wang 2010). The Sunclarco package (Prenen et al. 2017) provides Clayton or Gumbel copula-based regression models with parametric (Weibull and piecewise constant) or Cox semiparametric margins for multivariate right-censored data (Prenen, Braekers, and Duchateau 2017). The package GJRM (Marra and Radice 2020) can fit both marginal and copula regression models in complete and right-censored data (Marra and Radice 2017, 2019; Marra et al. 2017). By far, there is no copula-based R package for bivariate interval-censored data.
We develop the CopulaCenR package, which fits copula-based regression models for both bivariate right-censored and interval-censored data. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=CopulaCenR. The main advantage of CopulaCenR relies on the diverse choice of copula and marginal models for both bivariate right-censored and interval-censored data. Specifically, it provides a class of Archimedean copulas that correspond to a variety of dependence structures, as illustrated in Table 1. In particular, in addition to these frequently used one-parameter Archimedean copulas, a two-parameter copula function (Copula2) is also included. This Copula2 has more flexibility in modeling dependence structure, as we show in Section 2. Furthermore, CopulaCenR implements a list of parametric and semiparametric marginal regression models, as illustrated in Table 2. For parameter estimation, the package utilizes a novel two-step procedure that is computationally stable and efficient. For the inference of regression parameters, three likelihood-based tests such as Wald, generalized score and likelihood ratio tests are provided.
We will describe the major features of CopulaCenR in Section 2 and presents the model and estimation procedure in Section 3. We will demonstrate two real data examples in Section 4 using the version \(1.1.2\) of CopulaCenR. Finally, we will conclude and discuss in Section 5.
The most popular copula family for bivariate censored data is the Archimedean copula family, which has an explicit form of \[C_\eta(u,v)=\phi_{\eta}\{\phi_{\eta}^{-1}(u)+\phi_{\eta}^{-1}(v)\},\] where \(u\) and \(v\) are two uniformly distributed margins; \(\phi_{\eta}\) is the generator function, which is a continuous, strictly decreasing and convex function; \(\phi_{\eta}^{-1}\) is the inverse of \(\phi_{\eta}\). One generator function uniquely determines an Archimedean copula. The copula parameter \(\eta\) has a one-to-one correspondence with the popular dependence measure Kendall’s \(\tau\). Another property of the copula is the tail dependence (i.e., \(\tau_L\) and \(\tau_U\) for lower and upper tail dependence), which measure the dependence between two margins in the lower and upper tails. More details about Archimedean copulas can be found in Nelsen (2006).
Table 1 lists six Archimedean family copula models that are implemented in CopulaCenR. Two most frequently used Archimedean copulas are Clayton (Clayton (1978), (1978)) and Gumbel (Gumbel (1960), (1960)) models, which account for the lower or upper tail dependence between two margins using a single parameter \(\eta\). Other Archimedean copulas, such as Frank (Frank 1979), Joe (H. Joe 1993) and Ali-Mikhail-Haq (AMH) (Ali, Mikhail, and Haq 1978), are also one-parameter copulas. In addition to these five copulas, we also include a flexible two-parameter Archimedean copula model (Harry Joe and Hu 1996; H. Joe 1997), namely, Copula2 (also called the “BB1" family), which is formulated as \[\label{two-para-c} C_{\alpha,\kappa}(u,v)=[1+\{(u^{-1/\kappa}-1)^{1/\alpha} + (v^{-1/\kappa}-1)^{1/\alpha}\}^{\alpha}]^{-\kappa}, \ \alpha \in (0,1], \ \kappa \in (0,\infty). (\#eq:two-para-c)\] The two dependence parameters (\(\alpha\) and \(\kappa\)) are explicitly connected to Kendall’s \(\tau\) with \(\tau = 1- {2\alpha\kappa}/(2\kappa + 1)\), and they account for the correlation between \(u\) and \(v\) at upper and lower tails. In particular, when \(\alpha = 1\), Copula2 becomes the Clayton copula, and when \(\kappa \rightarrow \infty\), it becomes the Gumbel copula. Thus, the two-parameter copula model provides more flexibility in modeling the between-margin dependence than the one-parameter copulas such as Clayton or Gumbel (Harry Joe 2014). Figure 1 illustrates the scatter plots of bivariate event times generated from the six copula models in Table 1.
| Family | Parameter Space | Generator \(\phi_{\eta}(t), t \in [0,\infty)\) | Generator Inverse \(\phi_{\eta}^{-1}(s), s \in (0,1]\) | \(\tau_L\) | \(\tau_U\) | Kendall’s \(\tau\) | |
|---|---|---|---|---|---|---|---|
| Clayton | \(\eta > 0\) | \((1+t)^{-1/\eta}\) | \(s^{-\eta} -1\) | \(2^{-1/\eta}\) | 0 | \(\eta/(2+\eta)\) | |
| Gumbel | \(\eta \geq 1\) | \(\exp(-t^{1/\eta})\) | \((-\log s)^{\eta}\) | \(0\) | \(2-2^{1/\eta}\) | \(1 - 1/\eta\) | |
| Frank | \(\eta \geq 0\) | \(-\eta^{-1}\log \{1+e^{-t}(e^{-\eta}-1)\}\) | \(-\log \{(e^{-\eta s}-1)/(e^{-\eta}-1)\}\) | \(0\) | \(0\) | \(1+4\{D_1(\eta)-1\}/\eta\) | |
| AMH | \(\eta \in [0,1)\) | \((1-\eta)/(e^{t}-\eta)\) | \(\log[\{1+\eta(s-1)\}/s]\) | \(0\) | \(0\) | \(1-2\{(1-\eta)^2 \log (1-\eta) + \eta\}/(3\eta^2)\) | |
| Joe | \(\eta \geq 1\) | \(1-(1-e^{-t})^{1/\eta}\) | \(-\log \{1-(1-s)^{\eta}\}\) | \(0\) | \(2-2^{1/\eta}\) | \(1 - 4 \sum_{k=1}^{\infty} 1/\{k(\eta k+2)[\eta(k-1)+2]\}\) | |
| Copula2 | \(\alpha \in (0,1], \kappa > 0\) | \(\{1/(1+t^{\alpha})\}^{\kappa}\) | \((s^{-1/\kappa}-1)^{1/\alpha}\) | \(2^{-\alpha\kappa}\) | \(2-2^{\alpha}\) | \(1-2\alpha\kappa/(2\kappa+1)\) |
\(\tau_L\) and \(\tau_U\) are the lower and upper tail
dependence measures.
\(D_1(\cdot)\) is the Debye function
written as \(D_1(\eta) = \frac{1}{\eta}
\int_{0}^{\eta} \frac{t}{e^t-1}dt\).
To fit a copula-based regression model, one also needs to choose a
regression model for the margins. Table 2
lists the available marginal models in CopulaCenR.
For bivariate right-censored data, users can fit either a parametric
marginal model via the function rc_par_copula or a
semiparametric Cox PH model via the function
rc_spCox_copula (T. Sun et al.
2019). Specifically, the parametric models incorporate both the
PH (e.g., Weibull, Gompertz) and the PO (e.g., Loglogistic) models. For
bivariate interval-censored data, one can choose to fit a parametric
marginal model via the function ic_par_copula. Moreover,
the package can also fit a semiparametric transformation model via the
function ic_spTran_copula. It contains a variety of
marginal models including the PH and PO models, as we explain in Section
3.3. A novel two-step sieve
estimation procedure is implemented (Tao Sun and
Ding 2019).
| Type | Models | Survival Distributions \(S(t)\) | Corresponding R Functions |
|---|---|---|---|
| Parametric | Weibull | \(\exp \{-(t/\lambda)^k e^{Z^{\top}\beta}\}\) | rc_par_copula,
ic_par_copula |
| Gompertz | \(\exp \{-\frac{b}{a}(e^{at}-1) e^{Z^{\top}\beta}\}\) | ||
| Loglogistic | \(\{1+(t/\lambda)^{k} e^{Z^{\top}\beta} \}^{-1}\) | ||
| Semiparametric | Cox | \(\exp\{-\Lambda(t) e^{Z^{\top}\beta}\}\) | rc_spCox_copula |
| Transformation | \(\exp[-G\{\Lambda(t) e^{Z^{\top}\beta}\}]\) | ic_spTran_copula |
For the inference of the covariate effects, three types of
likelihood-based tests are implemented in CopulaCenR:
the Wald test (built within rc_par_copula,
rc_spCox_copula, ic_par_copula, and
ic_spTran_copula), the generalized score test
(score_copula) and the likelihood-ratio test
(lrt_copula).
After a copula model being fitted, fitted values (i.e., linear
predictors, survival probabilities) can be extracted by the general S3
function fitted. For new observations, the linear
predictors and survival probabilities can be obtained using the function
predict. Moreover, the user can plot three types of
distributions (joint, conditional and marginal) using the general
functions plot and lines. In particular, an
interactive 3D contour will be plotted to visualize the joint
distribution.
Besides, the package provides a bivariate event time generating
function data_sim_copula, which can generate random
bivariate event times based on a specified copula function, a marginal
distribution, and covariate values.
In summary, the key functions of CopulaCenR are
rc_par_copula: for fitting parametric regression models
to bivariate right-censored data;rc_spCox_copula: for fitting a semiparametric Cox
regression model to bivariate right-censored data;ic_par_copula: for fitting parametric regression models
to bivariate interval-censored data;ic_spTran_copula: for fitting a semiparametric
transformation model to bivariate interval-censored data;score_copula: for performing the generalized score test
on covariate effects;lrt_copula: for performing the likelihood ratio test
(LRT) on covariate effects between two nested models;tau_copula: for calculating Kendall’s \(\tau\) from copula parameter
estimates;plot, lines: S3 methods for plotting joint, conditional
and marginal distributions based on a fitted copula model;fitted, predict: S3 methods for extracting fitted
values and predicting new observations;summary, print, coef, logLik, AIC, BIC: other S3
functions for a fitted object;data_sim_copula: for generating bivariate event times
through a specified copula model and marginal distributions.We use two real data examples to illustrate the implementation of these functions in Section 4.
Let \((T_{1},T_{2})\) be the true bivariate event times, with marginal survival functions \(S_{j}(t_j)=Pr (T_j > t_j), \ j=1,2\), and joint survival function \(S(t_{1},t_{2})= Pr (T_{1} > t_{1}, T_{2} > t_{2})\). Assume there are \(n\) independent subjects in a study. When \((T_{1},T_{2})\) are subject to right-censoring, for subject \(i = 1\cdot \cdot \cdot n\), we observe \(D_i=\{(Y_{ij}, \Delta_{ij}, Z_{ij}): Y_{ij}=\min(T_{ij},C_{ij}), \Delta_{ij}=I(T_{ij}\le C_{ij}), j=1, 2\}\), where \(C_{ij}\) is the censoring time of \(T_{ij}\), \(\Delta_{ij}\) is the censoring indicator and \(Z_{ij}\) is the covariate vector. When \((T_{1},T_{2})\) are under interval-censoring, we observe \(D_i=\{(L_{ij},R_{ij},Z_{ij}), j = 1,2\}\) for subject \(i\), where \((L_{ij}, R_{ij}]\) is the time interval that \(T_{ij}\) lies in and \(Z_{ij}\) is the covariate vector.
By the Sklar’s theorem (Sklar (1959), (1959)), so long as the marginal survival functions \(S_j\) are continuous, there exists a unique function \(C_\eta\) that connects two marginal survival functions into the joint survival function: \(S(t_1,t_2) = C_\eta\{S_1(t_1),S_2(t_2)\},\ t_1, t_2 \geq 0.\) Here, the function \(C_\eta\) is called a copula and its parameter \(\eta\) measures the dependence between the two margins. A signature feature of the copula is that it allows the dependence to be modeled separately from the marginal distributions.
In this section, we present the joint likelihood functions for bivariate right-censored data and bivariate interval-censored data, respectively.
Define the density function for copula \(C_{\eta}(u,v)\) as \(c_{\eta}(u,v) = \partial^2 C_{\eta}(u,v)/\partial u \partial v\). Let \(f(t_1,t_2)=\partial^2S(t_1,t_2)/\partial t_1\partial t_2 = c_{\eta}\{S_1(t_1), S_2(t_2)\} f_1(t_1) f_2(t_2)\) denote the corresponding density function of \(S(t_1, t_2)\). Denote by \(\theta = (\beta^\top = (\beta_1^\top, \beta_2^\top), \eta, S_{01}, S_{02})^\top\) all the unknown parameters in \(S(t_{1},t_{2})\), where \(\beta_j\) is the regression coefficient vector and \(S_{0j}\) is the baseline survival function for the \(j\)th margin. Then, the joint likelihood for the observed data \(D = \{D_i\}_{i=1}^n\) can be written as \[\begin{aligned} \label{rc_joint_likelihood} \begin{split} L_{n}(\theta | D) = & \prod_{i=1}^n \, f(y_{i1},y_{i2}| Z_{i1}, Z_{i2})^{\delta_{i1}\delta_{i2}} \times\left[-\frac{\partial S(y_{i1},y_{i2} | Z_{i1}, Z_{i2})}{\partial y_{i1}}\right]^{\delta_{i1}(1-\delta_{i2})} \\ & \times \left[-\frac{\partial S(y_{i1},y_{i2} | Z_{i1}, Z_{i2})}{\partial y_{i2}}\right]^{(1-\delta_{i1})\delta_{i2}} \times S(y_{i1},y_{i2} | Z_{i1}, Z_{i2})^{(1-\delta_{i1})(1-\delta_{i2})} \\ = & \prod_{i=1}^n\left[ c_\eta\{S_1(y_{i1} | Z_{i1}),S_2(y_{i2} | Z_{i2})\} f_1(y_{i1} | Z_{i1}) f_2(y_{i2} | Z_{i2})\right]^{\delta_{i1}\delta_{i2}} \\ & \times\left[-\frac{\partial \, C_\eta\{S_1(y_{i1} | Z_{i1}),S_2(y_{i2} | Z_{i2})\}}{\partial y_{i1}}\right]^{\delta_{i1}(1-\delta_{i2})} \\ & \times\left[-\frac{\partial \, C_\eta\{S_1(y_{i1} | Z_{i1}),S_2(y_{i2} | Z_{i2})\}}{\partial y_{i2}}\right]^{(1-\delta_{i1})\delta_{i2}} \\ & \times C_\eta\{S_1(y_{i1} | Z_{i1}),S_2(y_{i2} | Z_{i2})\}^{(1-\delta_{i1})(1-\delta_{i2})}, \end{split} \end{aligned} (\#eq:rc-joint-likelihood)\] where \((\delta_{i1},\delta_{i2}) \in \{(0,0),(0,1),(1,0),(1,1)\}\).
Similarly, using the notation introduced in Section 3.1, the joint likelihood function for bivariate interval-censored data from \(n\) subjects can be written as \[\begin{aligned} \label{ic_joint_likelihood} L_{n}(\theta|D) &=& \prod_{i=1}^n Pr (L_{i1} < T_{i1} \leq R_{i1}, L_{i2} < T_{i2} \leq R_{i2} | Z_{i1}, Z_{i2}) \nonumber \\ & = & \prod_{i=1}^n \biggl[Pr (T_{i1} > L_{i1}, T_{i2} > L_{i2} | Z_{i1}, Z_{i2}) - Pr (T_{i1} > L_{i1}, T_{i2} > R_{i2} | Z_{i1}, Z_{i2}) \nonumber \\ && \quad - Pr (T_{i1} > R_{i1}, T_{i2} > L_{i2} | Z_{i1}, Z_{i2}) + Pr (T_{i1} > R_{i1}, T_{i2} > R_{i2} | Z_{i1}, Z_{i2}) \biggl] \nonumber \\ & = & \prod_{i=1}^n\biggl[C_{\eta}\{S_1(L_{i1}| Z_{i1}),S_2(L_{i2}| Z_{i2})\} - C_{\eta}\{S_1(L_{i1}| Z_{i1}),S_2(R_{i2}| Z_{i2})\} \nonumber \\ && \quad - C_{\eta}\{S_1(R_{i1}| Z_{i1}),S_2(L_{i2}| Z_{i2})\} + C_{\eta}\{S_1(R_{i1}| Z_{i1}),S_2(R_{i2}| Z_{i2})\} \biggr]. \end{aligned} (\#eq:ic-joint-likelihood)\] The right interval \(R_{ij}\) can take values in \((0, \infty]\). For a given subject \(i\), if \(R_{ij} = \infty\) (i.e., \(T_{ij}\) is right-censored), then any term involving \(R_{ij}\) becomes 0 and the joint survival function for subject \(i\) reduces to only one (if both \(R_{i1}\) and \(R_{i2}\) are \(\infty\)) or two (if one \(R_{ij}\) is \(\infty\)) terms. The special case of bivariate current status data (i.e., only one assessment time for each subject) can also fit into this framework, where for each \(T_{ij}\), either \(L_{ij} = 0\) (\(T_{ij}\) is smaller than the assessment time, which is \(R_{ij}\) in this case) or \(R_{ij} = \infty\) (\(T_{ij}\) is greater than the assessment time, which is \(L_{ij}\) in this case). Therefore, the likelihood function (@ref(eq:ic-joint-likelihood)) can handle the bivariate data under general interval-censoring.
We implement several popular parametric marginal models in CopulaCenR, as shown in Table 2. For example, the marginal Weibull survival distribution can be written as \[S_{j}(t_{j}|Z_{j}) = \exp \{-(t_{j}/\lambda_j)^{k_j} e^{Z_{j}^{\top}\beta_j}\}, \ j=1,2, \nonumber\] where \(\lambda_j\) and \(k_j\) are the scale and shape parameters of the baseline Weibull distribution, and \(\beta_j\) are the covariate effects. The model follows the PH assumption. In this case, the parameter set \(\theta\) becomes \((\beta^\top, \eta, \lambda_1, k_1, \lambda_2, k_2)^\top\). Other parametric distributions including Gompertz and Loglogistic are also implemented in the package.
More generally, we implement the semiparametric Cox PH marginal model for bivariate right-censored data. The model does not specify the marginal distribution for the baseline hazards function. Instead, the baseline hazards are treated as piecewise constants between all uncensored event times as suggested by Breslow (1972). The model is expressed as \[ S_{j}(t_{j}|Z_{j}) = \exp\{-\Lambda_{j}(t_j) e^{Z_{j}^{\top}\beta_j}\}, \ j=1,2, \] in which the Breslow baseline cumulative hazard function \(\Lambda_{j}(t)\) is given by \[\Lambda_{j}(t) = \sum_{i = 1}^{n} \frac{I(Y_{ij} \leq t) \delta_{ij}}{\sum_{k \in R_{ij}}\exp{Z_{k}^{\top}\beta_j}}, \] where \(R_{ij} = \{k: Y_k \geq Y_{ij}\}\) denotes the at-risk set at time \(Y_{ij}\).
We also consider a class of semiparametric linear transformation models for the marginal distribution of the interval-censored data. The model is expressed as: \[\label{tran_mod} S_{j}(t_j|Z_j) = \exp[-G_j\{\Lambda_{j}(t_j) e^{Z_j^{\top}\beta_j} \}], \ j = 1,2. (\#eq:tran-mod)\] \(\Lambda_{j}(\cdot)\) is an unknown and non-decreasing function of \(t\), which is not necessarily the baseline cumulative hazards function. In CopulaCenR, we approximate \(\Lambda_j\) in a sieve space constructed by Bernstein polynomials. A Bernstein basis polynomial with degree \(m\) is expressed as: \[\label{Bern} B_{k}(t,m,l,u) = {m \choose k} (\frac{t-l}{u-l})^{k} (1-\frac{t-l}{u-l})^{m-k}, \ k = 0,...,m, (\#eq:Bern)\] where \(l\) and \(u\) are the lower and upper bounds of all observed times. One big advantage of Bernstein polynomials is that they do not require the specification of interior knots, as seen from (@ref(eq:Bern)), making them easy to work with. More details can be found in Tao Sun and Ding (2019).
In model (@ref(eq:tran-mod)), \(G_j(\cdot)\) is a pre-specified strictly increasing function, such as the Box-Cox and the logarithmic transformation functions. The package uses a \(G(\cdot)\) function as specified in Zhou, Hu, and Sun (2017): \[\begin{aligned} G_j(x)=\left\{ \begin{array}{ll} \frac{(1+x)^r - 1}{r}, \ 0 < r \leq 2, \\ \\ \frac{\log\{1 + (r-2)x\}}{r - 2}, \ r > 2. \label{tranform_G} \end{array} \right. \end{aligned} (\#eq:tranform-G)\]
Note that the model (@ref(eq:tran-mod)) contains a class of survival models. For example, when \(G(x) = x\) at \(r=1\), the marginal function \(S_j(t|Z)\) becomes \(\exp\{-\Lambda_j(t) e^{Z^{\top}\beta_j}\}\), which is essentially a PH model. Likewise, when \(G_j(x) = \log(1+x)\) at \(r=3\), \(S_j(t|Z)\) becomes \(\{1+\Lambda_j(t) e^{Z^{\top}\beta_j}\}^{-1}\), which is a PO model. In practice, the value of \(r\) can be either selected according to model AIC or treated unknown and estimated together with other model parameters.
In this section, we illustrate the estimation procedure for the unknown parameter \(\theta\). For simplicity, we use the general notation \(\theta = (\beta_1^\top, \beta_2^\top, \eta, S_{01}, S_{02})^\top\) throughout this section. In principle, we can maximize the joint log-likelihood function based on formula (@ref(eq:rc-joint-likelihood)) or (@ref(eq:ic-joint-likelihood)) directly, written as \(l_{n}(\theta|D)=\log{L_n(\theta|D)}=\sum_{i=1}^{n}\log{L(\theta|D_i)}\). Due to the complex structure of the log-likelihood function, we implement a novel two-step estimation procedure, which is proven to be computationally more stable and efficient than the one-step procedure, as shown in T. Sun et al. (2019) and Tao Sun and Ding (2019). Essentially, the two-step procedure implements an extra step to obtain appropriate initial values for all the unknown parameters. The estimation procedure is described below:
Obtain initial estimates of \(\theta_n\):
\((\hat{\beta}_{jn}^{(1)}, \hat{S}_{0j}^{(1)}) = \mathop{\mathrm{\arg\max}}\limits_{(\beta_j, S_{0j})} l_{jn}(\beta_j, S_{0j})\), where \(l_{jn}\) denotes the log-likelihood for the marginal model, \(j=1, 2\);
\(\hat{\eta}_{n}^{(1)}=\mathop{\mathrm{\arg\max}}\limits_{(\eta)} l_n \{ \hat{\beta}_{n}^{(1)}=(\hat{\beta}_{1n}^{(1)},\hat{\beta}_{2n}^{(1)}), \eta, \hat{S}_{01}^{(1)},\hat{S}_{02}^{(1)} \}\), where \(\hat{\beta}_{jn}^{(1)}\) and \(\hat{S}_{0j}^{(1)}\) are the initial estimates from (a), and \(l_n\) is the joint log-likelihood.
Simultaneously maximize the joint log-likelihood to get final
estimates:
\(\hat{\theta}_n =
(\hat{\beta}_n,\hat{\eta}_n,\hat{S}_{01},\hat{S}_{02})=\mathop{\mathrm{\arg\max}}\limits_{(\beta,\eta,S_{01},
S_{02})} l_n(\beta,\eta, S_{01}, S_{02})\) with initial values
\((\hat{\beta}_{n}^{(1)},\hat{\eta}_{n}^{(1)},\hat{S}_{01}^{(1)},\hat{S}_{02}^{(1)})\)
obtained from step 1(a) and 1(b).
\[Remark 1.\] In the case of semiparametric Cox PH margins (with the Breslow baseline cumulative hazard estimator), although the maximum likelihood estimators from step 2 are consistent and asymptotically normal, the Hessian matrix cannot be directly used for estimating the variance-covariance matrix of (\(\hat{\beta}\), \(\hat{\eta}\)) (T. Sun et al. 2019). Therefore, the bootstrap procedure is implemented in the package for producing a valid variance-covariance estimator.
\[Remark 2.\] In the case of semiparametric transformation model margins (with the use of Bernstein polynomials), the two-step estimation procedure becomes a two-step “sieve” estimation procedure. Tao Sun and Ding (2019) rigorously proved the asymptotic properties of the sieve maximum likelihood estimators.
The main model-fitting functions (rc_par_copula,
rc_spCox_copula, ic_par_copula and
ic_spTran_copula) provide a built-in optimization option,
which is a wrapper to the optimization routines optim and
nlm in R.
We now separate \(\beta\) into two parts: \(\beta_g\) and \(\beta_{ng}\), where \(\beta_g\) is the parameter set of interest for hypothesis testing and \(\beta_{ng}\) denotes the rest of the regression coefficients. In certain cases, \(\beta_g\) can be the entire regression parameter \(\beta\). The package implements three likelihood-based tests including the Wald test, the generalized score test (D. R. Cox and Hinkley 1979) and the likelihood ratio test, which are asymptotically equivalent and follow the chi-squared distribution with \(df = \dim(\beta_g)\). In particular, the generalized score test is usually faster than the other two tests for large-scale testings such as the genome-wide association study (GWAS) (T. Sun et al. 2019; Tao Sun and Ding 2019). Due to the complex structure of the joint log-likelihood, instead of analytically deriving the first and second order derivatives, we use the Richardson’s extrapolation (Lindfield et al. (1989), (1989)) to approximate the score function and observed Fisher information numerically.
The package CopulaCenR
provides a user-friendly function data_sim_copula for
generating random bivariate event times based on a specified copula
model, marginal distributions and covariate values. The arguments
n, copula, and eta assign the
sample size, the copula type, and the dependence parameter value. For
marginal distributions, the argument dist can be one of the
three parametric distributions in Table 2
(i.e., Weibull, Loglogistic and Gompertz), and their distribution
parameters are given through baseline. For Weibull and
Loglogistic, the baseline parameters are \(\lambda\) (scale) and \(k\) (shape); whereas \(a\) (shape) and \(b\) (rate) for the Gompertz distribution.
In this current version, we assume that the two margins share the same
set of covariates and effects, which are assigned by
var_list and COV_beta, respectively. Lastly,
x1 and x2 input a data frame of covariate
values for the two margins, respectively. Figure 2 illustrates a scatter plot of \(500\) simulated bivariate event times from
a Clayton model with Weibull margins, as demonstrated in the code
below.
library(CopulaCenR)
set.seed(1)
dat <- data_sim_copula(n = 500, copula = "Clayton", eta = 3, dist = "Weibull",
baseline = c(0.1,2), var_list = c("var1", "var2"), COV_beta = c(0.1, 0.1),
x1 = cbind(rnorm(500, 6, 2), rbinom(500, 1, 0.5)),
x2 = cbind(rnorm(500, 6, 2), rbinom(500, 1, 0.5)))
head(dat)
id ind var1 var2 time
1 1 6.130533 1 8.062168
1 2 6.154606 1 7.472649
2 1 8.070653 1 6.317247
2 2 5.406263 1 5.904064
3 1 10.520432 1 4.195788
3 2 3.633516 1 4.771523
plot(x = dat$time[dat$ind == 1], y = dat$time[dat$ind == 2],
xlab = expression(t[1]), ylab = expression(t[2]), cex.axis = 1, cex.lab = 1.3)
The bivariate right-censored input dataset shall be a data frame including the covariates and four additional key input columns:
We use the DRS (Diabetic Retinopathy Study) data as an example. The DRS data contain bivariate right-censored time to blindness from 197 diabetic retinopathy patients. These patients were from a 50% random sample of the patients with "high-risk" diabetic retinopathy as defined by the DRS (Huster, Brookmeyer, and Self 1989). Each patient had one eye randomized to one of the two laser treatments and the other eye received no treatment. For each eye, the event of interest was the time from initiation of treatment to the time to blindness in months. Censoring was caused by death, dropout, or end of the study. The data can be loaded by
data("DRS", package = "CopulaCenR")
head(DRS)
id ind obs_time status treat age type
5 1 46.23 0 0 28 2
5 2 46.23 0 2 28 2
14 1 42.50 0 2 12 1
14 2 31.30 1 0 12 1
16 1 42.27 0 1 9 1
16 2 42.27 0 0 9 1
There are three covariates: treat is treatment with
\(0\) for no treatment, \(1\) for xenon laser treatment and \(2\) for argon laser treatment;
age is the age at diagnosis of diabetes; type
is the type of diabetes with \(1\) for
juvenile (age \(\leq 20\) at diagnosis)
and \(2\) for adult. The primary
question of the DRS study was to assess the treatment effectiveness
while accommodating the dependence between two eyes.
We now demonstrate how to fit a Clayton copula model with Weibull
margins to the DRS data using the function rc_par_copula.
We are interested in the treatment effect, as indicated in argument
var_list. The arguments copula and
m.dis specify the fitted copula model and marginal baseline
distributions. The default optimization method is BFGS
(Nash 1990). Other optimization methods
and control parameters can also be applied (see
?optim).
library(CopulaCenR)
clayton_wb <- rc_par_copula(data = DRS, var_list = "treat", copula = "Clayton",
m.dist = "Weibull", method = "BFGS")
summary(clayton_wb)
Copula: Clayton
Margin: Weibull
estimate SE stat pvalue
lambda 90.6440318 13.1887218 47.2360 6.293e-12 ***
k 0.8062766 0.0586207 189.1758 < 2.2e-16 ***
treat1 -0.5714498 0.1997080 8.1878 0.004217 **
treat2 0.0052997 0.1739106 0.0009 0.975689
eta 0.6205855 0.2610638 5.6508 0.017447 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(The Wald tests are testing whether each coefficient is 0)
Final llk: -839.7212
Convergence is completed successfully
The estimation and Wald test results suggest the xenon treatment
significantly reduced the risk of blindness compared to controls (\(p=0.004217\) for treat1). We
also compared our estimates with previous findings in Huster, Brookmeyer, and Self (1989). Due to the
differences in the model parameterization, we first transformed our
estimates into comparable forms. Specifically, the xenon treatment
effect in Huster, Brookmeyer, and Self
(1989) can be expressed as \(-k\log(\lambda) + treat1 = -4.20\) and
similarly the argon treatment effect is \(-k\log(\lambda) + treat2 = -3.63\), which
are consistent with the reported estimates \((-4.20, -3.42)\) in the Table 2 (page \(151\)) of Huster,
Brookmeyer, and Self (1989). The AIC and BIC of this model can be
obtained from the S3 methods AIC and BIC.
AIC(clayton_wb)
1689.442
BIC(clayton_wb)
1705.858
After the model is fitted, Kendall’s \(\tau\) can be estimated through the
function tau_copula.
tau_copula(eta = as.numeric(coef(clayton_wb)["eta"]), copula = "Clayton")
0.2368118
The fitted values (i.e., linear predictors and survival
probabilities) can be extracted through the function
fitted. As the model is a PH model, the linear predictors
(type is “lp”) are the estimated log proportional
hazards.
fit1 <- fitted(clayton_wb, type = "lp")
fit1[1:3, ]
id lp1 lp2
5 0.000000000 0.005299655
14 0.005299655 0.000000000
16 -0.571449835 0.000000000
When type is “survival”, the fitted outputs are marginal
(S1, S2) and joint (S12) survival
probabilities at the observed times (t1, t2).
fit2 <- fitted(clayton_wb, type = "survival")
fit2[1:3, ]
id t1 t2 S1 S2 S12
5 46.23 46.23 0.5592967 0.5575724 0.3643588
14 42.50 31.30 0.5793467 0.6542323 0.4234880
16 42.27 42.27 0.7369175 0.5823995 0.4655204
Similarly, the predict function provides predictions for
new observations with covariates. Its outputs can be either linear
predictors or survival probabilities (at specified times). The following
newdata1 example contains two subjects under different
treatments.
newdata1 <- data.frame(id = rep(1:2, each=2), ind = rep(c(1,2),2),
time = rep(40,4), treat = factor(c(0,1,0,2)))
newdata1
id ind time treat
1 1 40 0
1 2 40 1
2 1 40 0
2 2 40 2
predict(clayton_wb, newdata = newdata1, type = "lp")
id lp1 lp2
1 0 -0.571449835
2 0 0.005299655
predict(clayton_wb, newdata = newdata1, type = "survival")
id t1 t2 S1 S2 S12
1 40 40 0.5962669 0.7467754 0.4799705
2 40 40 0.5962669 0.5946309 0.4024998
The bivariate interval-censored input dataset shall be a data frame including the covariates and five key input columns:
We use the AREDS (Age-Related Eye Disease Study) data as an example. The event of interest is the AMD (Age-related Macular Degeneration) disease, which is a leading cause of blindness in the developed world (Swaroop et al. 2009). It is known as a polygenic, progressive and neurodegenerative disorder. The AREDS study is a multi-center randomized clinical trial studying the development and progression of AMD, sponsored by the National Eye Institute (AREDS Group (1999), (1999)). Due to intermittent assessment times (every 6 months up to the first 6 years and every 1 year since after), the exact time when each eye progressed to late-AMD was only known to lie in a certain interval. As a result, the outcome data are bivariate interval-censored. The package includes a subset data of 629 Caucasian participants from AREDS who had at least one eye in moderate AMD stage at baseline. The data can be loaded by
data("AREDS", package = "CopulaCenR")
head(AREDS)
id ind Left Right status SevScaleBL ENROLLAGE rs2284665
1 1 0.0 2.0 1 6 67.0 1
1 2 0.0 2.0 1 8 67.0 1
2 1 0.0 2.0 1 7 68.0 0
2 2 5.9 9.3 1 4 68.0 0
3 1 8.0 9.1 1 7 64.9 0
3 2 10.0 Inf 0 7 64.9 0
Out of these 629 subjects, 273 subjects developed late-AMD in both eyes during the study and the times to late-AMD were interval-censored; 138 subjects developed late-AMD in one eye (interval-censored) and did not develop late-AMD before the end of the study (right-censored); the rest 218 subjects were right-censored for late-AMD in both eyes.
There are three continuous covariates: SevScaleBL for
baseline AMD severity score (a value between 1 and 8 with a higher value
indicating more severe AMD), ENROLLAGE for baseline age and
rs2284665 for a genetic variant (\(0,1,2\) for \(GG,GT,TT\)) that might be associated with
AMD progression. The two clinical covariates SevScaleBL and
ENROLLAGE are well-known risk factors of AMD. Thus, our
primary interest is to find out whether the genetic variant
rs2284665 is significantly associated with AMD
progression.
We fit a two-parameter copula semiparametric transformation model for
the AREDS data through the function ic_spTran_copula. The
arguments l and u are the range of event
times, which need to be pre-specified by the user. In practice,
l and u can be set as the minimum and maximum
of observed times. The argument m corresponds to the degree
of Bernstein polynomials (as shown in formula @ref(eq:Bern)), with the
default value \(m=3\). The argument
r specifies the form of marginal transformation model (as
shown in formula @ref(eq:tranform-G)). In practice, the values of
m and r can be chosen based on the smallest
AIC for a list of fitted models with different values.
We now demonstrate how to fit a two-parameter copula semiparametric model to the AREDS data. We chose the range of event times as \(l=0\) and \(u=15\), use the default Bernstein polynomial degree as \(m=3\) and assume PO for the margins (i.e., \(r=3\)).
library(CopulaCenR)
copula2_sp <- ic_spTran_copula(data = AREDS, copula = "Copula2",
var_list = c("ENROLLAGE","rs2284665","SevScaleBL"),
l = 0, u = 15, m = 3, r = 3)
summary(copula2_sp)
Copula: Copula2
Margin: semiparametric
estimate SE stat pvalue
ENROLLAGE 0.042610 0.012271 12.057 0.0005159 ***
rs2284665 0.397712 0.091180 19.026 1.290e-05 ***
SevScaleBL 0.722681 0.053258 184.132 < 2.2e-16 ***
alpha 0.930508 0.058714 251.167 < 2.2e-16 ***
kappa 0.974037 0.226081 18.562 1.645e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(The Wald tests are testing whether each coefficient is 0)
Final llk: -2104.178
Convergence is completed successfully
From the output, the estimated odds ratio for the genetic variant \(rs2284665\) is \(\exp(0.397712) = 1.49\) with a \(p\)-value \(1.29 \times 10^{-5}\), implying it has a “harmful” effect for AMD patients by having more copies of its minor allele \(T\). The AIC and BIC values are \(4226.356\) and \(4266.353\), respectively.
AIC(copula2_sp)
4226.356
BIC(copula2_sp)
4266.353
Also, the estimated Kendall’s \(\tau\) is \(0.38\), suggesting moderate dependence in AMD progression between two eyes.
tau_copula(eta = as.numeric(coef(copula2_sp)[c("alpha","kappa")]),
copula = "Copula2")
0.3851248
Furthermore, we can test the effect of \(rs2284665\) by the generalized score test.
We first fit a null model without \(rs2284665\) and then test its effect using
the function score_copula.
copula2_sp_null <- ic_spTran_copula(data = AREDS, copula = "Copula2",
var_list = c("ENROLLAGE","SevScaleBL"),
l = 0, u = 15, m = 3, r = 3)
score_copula(object = copula2_sp_null, var_score = "rs2284665")
stat pvalue
1.943163e+01 1.042661e-05
The LRT can also be performed by applying two nested models to the
function lrt_copula.
lrt_copula(model1 = copula2_sp, model2 = copula2_sp_null)
stat pvalue
9.543119588 0.002007003
The following codes plot the 3D joint survival probabilities for the
three subjects in newdata2, which have the same
SevScaleBL = 3 in both eyes and
ENROLLAGE = 60, but vary in the genotype of
rs2284665. In the plot function, the argument
class specifies the plot type, which can be one of “joint”,
“conditional” and “marginal”. When class = "joint", it
generates a 3D interactive contour that can be manually rotated for the
desired visualization. Figure 3 is a
snapshot of 3D contours for the three subjects in
newdata2.
newdata2 <- data.frame(id = rep(1:3, each=2), ind = rep(c(1,2),3),
SevScaleBL = rep(3,6), ENROLLAGE = rep(60,6),
rs2284665 = c(0,0,1,1,2,2))
newdata2
id ind SevScaleBL ENROLLAGE rs2284665
1 1 3 60 0
1 2 3 60 0
2 1 3 60 1
2 2 3 60 1
3 1 3 60 2
3 2 3 60 2
plot(x = copula2_sp, class = "joint", newdata = newdata2)
Similarly, the conditional survival probabilities (Figure 4) can be obtained for the left eyes from the
same three subjects, given their right eyes (i.e.,
cond_margin \(= 2\)) had
progressed (to late-AMD) at year \(5\)
(i.e., cond_time \(=
5\)).
plot(x = copula2_sp, class = "conditional", newdata = newdata2,
cond_margin = 2, cond_time = 5, ylim = c(0.25,1),
ylab = "Conditional Survival Probability")
Likewise, we can also obtain the eye-level marginal survival
probabilities (i.e., plot_margin \(= 1\) for the left eyes) for the same three
subjects, as illustrated in Figure 5.
plot(x = copula2_sp, class = "marginal", newdata = newdata2,
plot_margin = 1, ylim = c(0.6,1), ylab = "Marginal Survival Probability")
This paper presents the R package CopulaCenR for implementing copula-based regression models in bivariate censored data, including both bivariate right-censored data and bivariate interval-censored data. A variety of Archimedean copulas, including a flexible two-parameter copula, are built in the package to accommodate different dependence structures. Moreover, the package can fit various parametric and semiparametric regression models for the two margins within the copula function. In particular, a general semiparametric transformation model with PH and PO models being its special cases is implemented for the margins in this package. For parameter estimation, a novel two-step procedure is adopted to guarantee stable and fast computation. For the inference of covariate effects, all three likelihood-based tests are provided. Lastly, two real data examples are given to demonstrate the key features and capabilities of this package.
One future extension of this package is to allow multivariate copula functions for handling multivariate censored events. Another important research extension is to add goodness-of-fit tests, which is critical for choosing a proper copula model. However, there are limited works in testing copula models in bivariate censored data, especially in bivariate interval-censored data under the regression setting. The current literature (e.g., (Shih 1998; Andersen et al. 2010; Emura, Lin, and Wang 2010; Wang 2010)) only focus on testing copulas in bivariate right-censored data without covariates. We are currently investigating these directions and plan to incorporate them in a future version of CopulaCenR.
The Authors are grateful to Dr. Wei Chen for providing valuable suggestions about package development and the AREDS data analysis.