Abstract
Marginal methods have been widely used for analyzing longitudinal ordinal data due to their simplicity in model assumptions, robustness in inference results, and easiness in the implementation. However, they are often inapplicable in the presence of measurement errors in the variables. Under the setup of longitudinal studies with ordinal responses and covariates subject to misclassification, Chen, Yi, and Wu (2014) developed marginal methods for misclassification adjustments using the second-order estimating equations and proposed a two-stage estimation approach when the validation subsample is available. Parameter estimation is conducted through the Newton-Raphson algorithm, and the asymptotic distribution of the estimators is established. While the methods of Chen, Yi, and Wu (2014) can successfully correct the misclassification effects, its implementation is not accessible to general users due to the lack of a software package. In this paper, we develop an R package, mgee2, to implement the marginal methods proposed by Chen, Yi, and Wu (2014). To evaluate the performance and illustrate the features of the package, we conduct numerical studies.Analysis of longitudinal ordinal data is a common research topic in health science and survey sampling. Typically, Liang and Zeger (1986) introduced the generalized estimating equations (GEE) method that gave consistent estimation with mild assumptions of the joint distribution of the repeated measurements. This method has been used widely in analyzing longitudinal binary and categorical data. The validity of the GEE method hinges on the critical condition that data are precisely observed, which is commonly infeasible and violated in practice (Grace Y. Yi 2017). Extensive discussions about covariate error (Carroll et al. 2006) and response with binary misclassification (Neuhaus 1999; Chen, Yi, and Wu 2011; Grace Y. Yi 2017) have been conducted in the literature. For example, Neuhaus (1999) investigated the bias due to errors in the response. Grace Y. Yi (2008) proposed a simulation–extrapolation (SIMEX) method to handle both dropout and covariate measurement error problems in longitudinal studies. Furthermore, in Grace Y. Yi (2017, Ch5), the impact of covariate measurement error on longitudinal data analysis was investigated, and methods of addressing covariate measurement error effects were described.
To accommodate effects induced from error-prone correlated ordinal responses and ordinal covariates simultaneously, (Chen, Yi, and Wu 2014) proposed GEE-based methods for the estimation of both mean and association parameters. The proposed methods are based on formulating unbiased second-order estimating functions and solving the resulting equations using the Newton-Raphson algorithm. The asymptotic distributions for the proposed estimators are established. While the methods of (Chen, Yi, and Wu 2014) correct for error effects due to misclassified variables, the methods cannot be used by the analysts without programming the implementation procedures. To expedite the use of the methods for problems in applications, in this paper, we develop an R package, called mgee2, to implement the methods of (Chen, Yi, and Wu 2014).
Our work offers an R package complement to available R packages for analyzing longitudinal data with misclassified observations. It is relevant to but differs from available R packages about measurement error. For example, the package SAMBA, developed by (Beesley and Mukherjee 2020), provides resources for fitting logistic regression with misclassified binary outcomes. The R package misclassGLM implements inferential procedures for generalized linear models with misclassified covariates proposed by (Dlugosz, Mammen, and Wilke 2017); (Zhang and Yi 2019) developed the package augSIMEX to implement the method proposed by (Grace Y. Yi et al. 2015) for fitting generalized linear models with mixed continuous and discrete covariates subject to mismeasurement.
When the degree of measurement error is very severe, the observed surrogate measurements are virtually useless, and hence the corresponding variables may be alternatively treated as subject to missingness. Regarding the analysis of longitudinal data with missing observations, packages kml and kml3d, developed by (Genolini et al. 2015), describe the implementation procedures of \(\textit{k}\)-means for longitudinal clustered data with missing observations. Carey (2015) developed the package gee to solve generalized estimation equations with longitudinal data missing completely at random. (Xu, Li, and Wang 2018) developed the package wgeesel for using weighted generalized estimating equations approaches to analyze longitudinal clustered data with data missing at random. (Xiong and Yi 2019) developed the package swgee for analyzing longitudinal data with missingness in the response and measurement error in covariates. Our package mgee2 differs from those packages in its ability to simultaneously handle the features of misclassification in correlated ordinal responses and ordinal covariates, which to our best knowledge, is the first software package to address this problem.
The article is organized as follows. Section 2 introduces the notations and estimation procedures proposed by (Chen, Yi, and Wu 2014). Section 3 describes the usage of the package mgee2. Section 4 illustrates the package by simulation studies and a real dataset. We finally conclude the article in Section 6.
We first review the notation and formulations of (Chen, Yi, and Wu 2014). For \(i=1,\dots, n\) and \(j=1, \dots, m_i\), let \(\texttt{Y}_{i j}\) denote an error-prone ordinal response variable for subject \(i\) at visit \(j\). Suppose that the response variable \(\texttt{Y}_{i j}\) has \((K+1)\) categories, denoted \(0, 1, \dots, K\), and that an error-prone ordinal covariate \(\mathbf X_{ij}\) has \((H + 1)\) categories, denoted \(0, 1, \dots, H\). Let \(\mathbf X_{ij}=\left(X_{ij1},\dots,X_{ijH}\right)^T\) be the misclassification-prone vector of binary variables such that \(X_{ijq}=I\)(the covariate \(\mathbf X_{ij}\) in category \(q\)) for \(q=0, 1, \dots, H\), and let \(\mathbf Z_{ij}\) be the vector of covariates that are free of measurement error, where \(I(\cdot)\) is the indicator function. Furthermore, we define \(\mathbf X_i=\left(\mathbf X_{i1}^T,\dots,\mathbf X_{im_i}^T\right)^T\) and \(\mathbf Z_i=\left(\mathbf Z_{i1}^T,\dots,\mathbf Z_{im_i}^T\right)^T\).
Let \[\label{eq:lambda_def} \lambda_{ijk}=P\left(\texttt{Y}_{i j} \geq k | \mathbf X_i, \mathbf Z_i\right) (\#eq:lambda-def)\] be the univariate cumulative probability with \(k=1,\dots,K\), and adopt the assumption \(P\left(\texttt{Y}_{i j} \geq k | \mathbf X_i, \mathbf Z_i\right)=P\left(\texttt{Y}_{i j} \geq k | \mathbf X_{ij}, \mathbf Z_{ij}\right)\) (Pepe and Anderson 1994). Consider the proportional odds models \[\operatorname{logit}\, \lambda_{i j k}=\mathbf\beta_{0 k}+\mathbf X_{i j}^T \mathbf\beta_x+\mathbf Z_{i j}^T \mathbf\beta_z,\] where \(\mathbf\beta_{0 k}\), \(\mathbf\beta_x\), and \(\mathbf\beta_z\) are regression parameters, \(k=1,\dots,K\), \(j=1,\dots,m_i\), and \(i=1,\dots,n\). Similar to Williamson, Kim, and Lipsitz (1995), we measure the association between a pair of responses for the same subject at two different visits by the global odds ratio \[\label{eq:psi_def} \psi_{i, j k, j^{\prime} k^{\prime}}=\frac{P\left(\texttt{Y}_{i j} \geq k, \texttt{Y}_{i j^{\prime}} \geq k^{\prime} | \mathbf X_i, \mathbf Z_i\right) \times P\left(\texttt{Y}_{i j}<k, \texttt{Y}_{i j^{\prime}}<k^{\prime} | \mathbf X_i, \mathbf Z_i\right)}{P\left(\texttt{Y}_{i j} \geq k, \texttt{Y}_{i j^{\prime}}<k^{\prime} | \mathbf X_i, \mathbf Z_i\right) \times P\left(\texttt{Y}_{i j}<k, \texttt{Y}_{i j} \geq k^{\prime} | \mathbf X_i, \mathbf Z_i\right)}, (\#eq:psi-def)\] where \(k, k^{\prime}=1, \dots, K\), and \(j \neq j^{\prime}\). To characterize the dependence of the global odds ratios on covariates, the log-linear models can be expressed as \[\log \, \psi_{i, j k, j^{\prime} k^{\prime}}=\phi+\phi_k+\phi_{k^{\prime}}+\phi_{k k^{\prime}}+\mathbf{u}_{i j j^{\prime}}^T \mathbf \alpha_1,\] where \(\phi\) is the global intercept, \(\phi_k\) and \(\phi_{k^{\prime}}\) correspond to the effect of category \(k\) and of category \(k^{\prime}\), respectively, \(\phi_{k k^{\prime}}\) is the interaction effect between categories \(k\) and \(k^{\prime}\) with \(\phi_{k k^{\prime}}=\phi_{k^{\prime}k}\), and \(\mathbf\alpha_1\) is a vector of parameters corresponding to pair-specific covariates, denoted \(\mathbf{u}_{i j j^{\prime}}\). The constraint \(\phi_1=\phi_{1k}=\phi_{k1}=0\) is set for the model identification for \(k=1,\dots,K\) (Williamson, Kim, and Lipsitz 1995).
Let \(\mathbf\beta=\left(\mathbf\beta_{0k}^T, \mathbf\beta_{x}^T, \mathbf\beta_{z}^T\right)^T\), \(\mathbf\alpha=\left(\phi,\phi_k,\phi_{k k^{\prime}},\mathbf\alpha_1^T\right)^T\), and \(\mathbf\theta=\left(\mathbf\beta^T,\mathbf\alpha^T\right)^T\). For \(k=1,\dots,K\), let \(\textbf{\texttt{Y}}_{ij}=\left(\texttt{Y}_{ij1},\dots,\texttt{Y}_{ijk}\right)^T\) with \(\texttt{Y}_{ijk}=I\left(\texttt{Y}_{i j}=k\right)\). Define \(\textbf{\texttt{Y}}_i=\left(\textbf{\texttt{Y}}_{i1}^T,\dots,\textbf{\texttt{Y}}_{i m_i}^T\right)^T\). For \(j< j^{\prime}\), let \(C_{i,jk,j^{\prime}k^{\prime}}=\texttt{Y}_{ijk}\texttt{Y}_{ij^{\prime}k^{\prime}}\), \(\mathbf C_{ijj^{\prime}}=\left(C_{i,j1,j^{\prime}1},C_{i,j1,j^{\prime}2},\dots,C_{i,jK,j^{\prime}K^{\prime}}\right)^T\), and \(\mathbf C_i=\left(\mathbf C_{ijj^{\prime}}^T,j<j^{\prime}\right)^T\). The univariate and bivariate marginals, \(\mathbf\mu_i=E\left(\textbf{\texttt{Y}}_i|\mathbf X_i,\mathbf Z_i\right)\) and \(\mathbf \xi_i=E\left(\mathbf C_i|\mathbf X_i,\mathbf Z_i\right)\), can be expressed in terms of the global odds ratios and univariate and bivariate cumulative probabilities; the detailed expressions are given by (Chen, Yi, and Wu 2014).
As a result, the estimating functions for the mean and association parameters \(\mathbf\beta\) and \(\mathbf\alpha\) are given by \[\label{u1} \mathbf U_{1 i}\left(\mathbf \theta; \textbf{\texttt{Y}}_i, \mathbf X_i, \mathbf Z_i\right)=\mathbf D_{1 i} \mathbf V_{1 i}^{-1}\left(\textbf{\texttt{Y}}_i-\mathbf \mu_i\right) (\#eq:u1)\] and \[\label{u2} \mathbf U_{2 i}\left(\mathbf \theta; \textbf{\texttt{Y}}_i, \mathbf X_i, \mathbf Z_i\right)=\mathbf D_{2 i} \mathbf V_{2 i}^{-1}\left(\mathbf C_i-\mathbf \xi_i\right), (\#eq:u2)\] respectively, where \(\mathbf D_{1 i}=\partial\mathbf\mu_i^T/\partial\mathbf\beta\), \(\mathbf D_{2 i}=\partial\mathbf\xi_i^T/\partial\mathbf\alpha\), and \(\mathbf V_{1i}\) and \(\mathbf V_{2i}\) are the conditional covariance matrices for \(\textbf{\texttt{Y}}_i\) and \(\mathbf C_i\), given \(\mathbf X_i\) and \(\mathbf Z_i\).
If the true measurements of the responses and covariates are available, (@ref(eq:u1)) and (@ref(eq:u2)) can be used for estimation of \(\mathbf\beta\) and \(\mathbf\alpha\). However, in applications, the \(\textbf{\texttt{Y}}_{i j}\) and the \(\mathbf X_{ij}\) may be subject to misclassification. Let \(\mathbf S_{ij}\) and \(\mathbf W_{ij}\) be surrogate measurements of \(\textbf{\texttt{Y}}_{ij}\) and \(\mathbf X_{ij}\), respectively. Let \(\tau_{ijkl}=P\left(\mathbf S_{ij}=l|\textbf{\texttt{Y}}_{i j}=k, \mathbf Z_i\right)\) be the conditional probability concerning the response for subject \(i\) at visit \(j\) where \(k,l=0,\dots,K\). Let \(\pi_{ijqr}=P\left(\mathbf W_{ij}=r|\mathbf X_{ij}=q, \mathbf Z_i\right)\) be the conditional probability concerning the covariate for subject \(i\) at visit \(j\) where \(q,r=0,\dots,H\). Consider the generalized logistic models by setting category 0 as the reference: \[\log\left(\tau_{i j k l} / \tau_{i j k 0}\right)=\mathbf L_{i j}^T \mathbf\gamma_{k l} \quad \text{for} \quad l=1, \dots, K; k=0, \dots, K\] and \[\log\left(\pi_{i j q r} / \pi_{i j q 0}\right)=\mathbf L_{i j}^{x T} \mathbf\varphi_{q r}\quad\text{for} \quad r=1, \dots, H; q=0, \dots, H,\] where \(\mathbf L_{ij}\) and \(\mathbf L_{ij}^x\) are vectors of variables related to response and covariate misclassification processes, respectively, and \(\mathbf\gamma_{k l}\) and \(\mathbf\varphi_{q r}\) are vectors of regression parameters.
Let \(\mathbf\gamma_{k}=\left(\mathbf\gamma_{k 1}^T, \dots, \mathbf\gamma_{k K}^T\right)^T\) and \(\mathbf\gamma=\left(\mathbf\gamma_{0}^T, \dots, \mathbf\gamma_{K}^T \right)^T\). Let \(\mathbf\varphi_{q}=\left(\mathbf\varphi_{q 1}^T, \dots, \mathbf\varphi_{q H}^T \right)^T\) and \(\mathbf\varphi=\left(\mathbf\varphi_{0}^T, \dots, \mathbf\varphi_{H}^T\right)^T\). Let \(\mathbf\eta=\left(\mathbf\gamma^T, \mathbf\varphi^T\right)^T\). Define the \(K\times K\) matrix \(\mathbf R_{i j}=\left(\mathbf \tau_{i j 1}-\mathbf\tau_{i j 0}, \dots, \mathbf\tau_{i j K}-\mathbf\tau_{i j 0}\right)\) and the \(H\times H\) matrix \(\mathbf G_{i j}=\left(\mathbf \pi_{i j 1}-\mathbf\pi_{i j 0}, \dots, \mathbf\pi_{i j K}-\mathbf\pi_{i j 0}\right)\), where \(\mathbf\tau_{i j k}=\left(\tau_{i j k 1}, \dots, \tau_{i j k K}\right)^T\) and \(\mathbf\pi_{i j k}=\left(\pi_{i j k 1}, \dots, \pi_{i j k K}\right)^T\). Then the unbiased surrogates for \(\textbf{\texttt{Y}}_{ij}\) and \(\mathbf X_{ij}\) are constructed, respectively, by \[\textbf{\texttt{Y}}_{i j}^{*}=\mathbf R_{i j}^{-1}\left(\mathbf S_{i j}-\mathbf\tau_{i j 0}\right)\] and \[\mathbf X_{i j}^{*}=\mathbf G_{i j}^{-1}\left(\mathbf W_{i j}-\mathbf\pi_{i j 0}\right),\] where we write \(\textbf{\texttt{Y}}_{i j}^{*}=\left(Y_{i j 1}^*, \dots, Y_{i j K}^{*}\right)^T\), \(\mathbf X_{i j}^{*}=\left(X_{i j 1}^*, \dots, X_{i j K}^{*}\right)^T\), and let \(\textbf{\texttt{Y}}_{i}^{*}=\left(\textbf{\texttt{Y}}_{i 1}^T, \dots, \textbf{\texttt{Y}}_{i m_i}^{* T}\right)^T\). Let \(\mathbf e_q\) be an \(H\)-dimensional vector whose \(r\)th element is an indicator \(I(r=q)\) for \(q=1,\dots,H\) and let \(\mathbf e_0=0\).
If \(\mathbf\eta\) is known, then \[\mathbf U_{1 i}^*(\mathbf\theta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{1 i}\left\{\theta ; \textbf{\texttt{Y}}_i^*,\left(\mathbf e_{q_{1}}^T, \dots, \mathbf e_{q_{i_{i}}}^T\right)^T, \mathbf Z_i\right\} \prod_{j=1}^{m_{i}} X_{i j q_{j}}^{*}\right]\] and \[\mathbf U_{2 i}^*(\mathbf\theta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{2 i}\left\{\theta ; \textbf{\texttt{Y}}_i^*,\left(\mathbf e_{q_{1}}^T, \dots, \mathbf e_{q_{i_{i}}}^T\right)^T, \mathbf Z_i\right\} \prod_{j=1}^{m_{i}} X_{i j q_{j}}^{*}\right]\] are unbiased estimating functions of \(\mathbf\theta\), as shown in Appendix 2 of Chen, Yi, and Wu (2014). Estimation of \(\mathbf\theta\) can then be obtained by solving \[\label{u12} \sum_{i=1}^{n}\left\{\begin{array}{c}{\mathbf U_{1 \mathrm{i}}^*(\mathbf\theta)} \\ {\mathbf U_{2 i}^{*}(\mathbf\theta)}\end{array}\right\}=\mathbf 0 (\#eq:u12)\] for \(\mathbf\theta\).
Case 1 highlights the estimation of \(\mathbf\theta\) when the parameter \(\mathbf\eta\) for the misclassification models is known or specified as a given value. In applications, \(\mathbf\eta\) is unknown and may be estimated from a validation subsample. In this case, we modify the estimation procedure based on (@ref(eq:u12)) and describe a two-stage estimation procedure. First, let \(\delta_{ij}=I\)(subject \(i\) at visit \(j\) is included in the validation subsample). Using validation data (i.e., \(\delta_{ij}=1\)), we may estimate \(\tau_{ij}\) and \(\pi_{ij}\).
Define \(\mathbf D_{\mathbf\gamma i j}=\partial \mathbf\tau_{i j}^T/ \partial \mathbf\gamma\) and \(\mathbf D_{\mathbf\varphi i j}=\partial \mathbf\pi_{i j}^T/ \partial \mathbf\varphi\), then estimating functions for \(\mathbf\gamma\) and \(\mathbf\varphi\) are given by \(\mathbf Q_{\gamma i}(\mathbf\gamma)=\sum_{j=1}^{m_{i}} \mathbf D_{\gamma i j} \mathbf V_{\gamma i j}^{-1}\left\{\mathbf S_{i j}-\mathbf\tau_{i j}\right\} \delta_{i j}\) and \(\mathbf Q_{\varphi i}(\mathbf\varphi)=\sum_{j=1}^{m_{i}} \mathbf D_{\varphi i j} \mathbf V_{\varphi i j}^{-1}\left\{\mathbf W_{i j}-\mathbf\pi_{i j}\right\} \delta_{i j}\), where \(\mathbf V_{\mathbf\gamma i j}\) and \(\mathbf V_{\mathbf\varphi i j}\) are, respectively, the conditional covariance matrix for \(\mathbf S_{ij}\) and \(\mathbf W_{ij}\), given \(\textbf{\texttt{Y}}_{ij}\) and the true covariates.
Let \(\tilde{\texttt{Y}}_{i j k}=\texttt{Y}_{ijk}\) if \(\delta_{ij}=1\) and \(\tilde{\texttt{Y}}_{i j k}=\texttt{Y}_{ijk}^*\) otherwise, then we write \(\tilde{\textbf{\texttt{Y}}}_{i j}=\left(\tilde{\texttt{Y}}_{i j 1}, \dots, \tilde{\texttt{Y}}_{i j K}\right)^T\). Let \(\tilde{X}_{i j q}=X_{ijq}\) if \(\delta_{ij}=1\) and \(\tilde{X}_{i j q}=X_{ijq}^*\) otherwise. Then the augmented estimating functions of \(\mathbf\theta\) are given by \[\label{au1} \tilde{\mathbf U}_{1 i}(\mathbf\theta, \mathbf\eta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{1 i}\left\{\theta ; \tilde{\textbf{\texttt{Y}}}_{i},\left(\mathbf e_{q_{1}}^T, \dots, \mathbf{e}_{q_{m}}^T\right)^T, \mathbf Z_{i}\right\} \prod_{j=1}^{m_{i}} \tilde{X}_{i j q_{j}}\right] (\#eq:au1)\] and \[\label{au2} \tilde{\mathbf U}_{2 i}(\mathbf\theta, \mathbf\eta)=\sum_{q_{m_{i}}=0}^{H} \dots \sum_{q_{1}=0}^{H}\left[\mathbf U_{2 i}\left\{\theta ; \tilde{\textbf{\texttt{Y}}}_{i},\left(\mathbf e_{q_{1}}^T, \dots, \mathbf{e}_{q_{m}}^T\right)^T, \mathbf Z_{i}\right\} \prod_{j=1}^{m_{i}} \tilde{X}_{i j q_{j}}\right]. (\#eq:au2)\] Consequently, estimation of \(\mathbf\eta\) and \(\mathbf\theta\) can be carried out by the two-stage procedure.
\(\bf{Stage}\) \(\bf 1.\) Solve \(\sum_{i=1}^{n}\left\{\begin{array}{c}{\mathbf Q_{\mathbf\gamma i}(\mathbf\gamma)} \\ {\mathbf Q_{\varphi i}(\mathbf\varphi)}\end{array}\right\}=\mathbf 0\) for \(\mathbf\gamma\) and \(\mathbf\varphi\) and write \(\hat{\mathbf\eta}=\left(\hat{\mathbf\gamma}^T, \hat{\mathbf\varphi}^T\right)^T\), where \(\hat{\mathbf\gamma}\) and \(\hat{\mathbf\varphi}\) are the estimators for \(\mathbf\gamma\) and \(\mathbf\varphi\), respectively.
\(\bf{Stage}\) \(\bf 2.\) Substitute \(\mathbf\eta\) with \(\hat{\mathbf\eta}\) in (@ref(eq:au1)) and (@ref(eq:au2)) and solve \(\sum_{i=1}^{n}\left\{\begin{array}{l}{\tilde{\mathbf U}_{1 i}(\mathbf\theta, \hat{\mathbf\eta})} \\ {\tilde{\mathbf U}_{2 i}(\mathbf\theta, \hat{\mathbf\eta})}\end{array}\right\}=\mathbf 0\) for \(\mathbf\theta\). Let \(\hat{\mathbf\theta}=\left(\hat{\mathbf\beta}^T, \hat{\mathbf\alpha}^T\right)^T\) denote the resulting estimator \(\mathbf\theta\).
(Chen, Yi, and Wu 2014) established the asymptotic distribution of \(\hat{\mathbf\theta}\). Let \(\tilde{\mathbf U}_i\left(\mathbf \theta, \mathbf \eta\right)=\left\{\tilde{\mathbf U}_{1 i}^T(\mathbf \theta, \mathbf \eta), \tilde{\mathbf U}_{2 i}^T(\mathbf \theta, \mathbf\eta)\right\}^T\), \(\mathbf Q_i(\mathbf\eta)=\left\{\mathbf Q_{\gamma i}^T(\mathbf\gamma), \mathbf Q_{\varphi i}^T(\mathbf\varphi)\right\}^T\), \(\mathbf\Omega_i(\mathbf\theta, \mathbf\eta)=\tilde{\mathbf U}_i(\mathbf\theta, \mathbf\eta)-E\left\{\partial \tilde{U}_i(\mathbf\theta, \mathbf\eta) / \partial \mathbf\eta^T\right\} \cdot\left[E\left\{\partial \mathbf Q_i(\mathbf\eta) / \partial \mathbf\eta^T\right\}\right]^{-1}\), and \(\tilde{\mathbf\Gamma}(\mathbf\theta, \mathbf\eta)=E\left\{\partial \tilde{\mathbf U}_i(\mathbf\theta, \mathbf\eta) / \partial \mathbf\theta^T \right\}\). Then, under regularity conditions, \(n^{1 / 2}(\hat{\mathbf \theta}-\mathbf\theta)\) is asymptotically normally distributed with mean \(\mathbf 0\) and covariance matrix \(\tilde{\mathbf\Gamma}^{-1} \tilde{\mathbf\Sigma}\left(\tilde{\mathbf\Gamma}^{-1}\right)^T\), where \(\tilde{\mathbf\Sigma}=E\left\{\mathbf\Omega_i(\mathbf\theta, \mathbf\eta) \mathbf\Omega_i^T(\mathbf\theta, \mathbf\eta)\right\}\).
We develop an R package, called mgee2, to
implement the misclassification adjustment method described in the
preceding section. This package requires support from the external
packages MASS (Venables and Ripley 2002), Matrix
(Bates and Maechler 2019), and ggplot2
(Wickham 2016). Our mgee2
package mainly contains two functions, mgee2k and
mgee2v, respectively, implementing cases 1 and 2 described
in the previous section. Specifically, mgee2k implements
the method where the misclassification parameters are given, and
mgee2v implements the misclassification method for the case
where validation data are available to estimate misclassification
probabilities. We now describe the details of these two functions.
mgee2kmgee2k implements the misclassification adjustment
method outlined in Case 1 of the previous section, where the
misclassification parameters are known. In this case, validation data
are not required, and only the observed data of the outcome and
covariates are needed for the implementation.
The function mgee2k requires the data set to be grouped
by the individual id, \(i=1,...,n\),
and each individual has \(m_i\) rows of
data each corresponding to the visit time \(j=1,...,m_i\). The column name of the
individual id is indicated by the argument id. The
misclassification matrices for the response and covariate variables are
recorded by the arguments gamMat and
varphiMat, respectively, which need to be specified by the
user.
To call mgee2k, we issue the following command,
mgee2k(formula, id, data, corstr="exchangeable", misvariable,
gamMat, varphiMat, maxit=50, tol=1e-3)
where the meaning of each argument is described as follows:
formula: a formula object which specifies the
relationship between the response and covariates for the observed
data.id: a character object which records individual id in
the data.data: a dataframe or matrix object for the observed
data set.corstr: a character object. The default value is
"exchangeable", corresponding to the structure where the association
between two paired responses is considered to be a constant. The other
option is "log-linear" which indicates the log-linear association
between two paired responses.misvariable: a character object which names the
error-prone covariate W.maxit: an integer which specifies the maximum number of
iterations. The default is 50.tol: a numeric object which indicates the tolerance
threshold. The default is 1e-3.gamMat: a matrix object which records the
misclassification parameter \(\gamma\)
for response \(\texttt{Y}\).varphiMat: a matrix object which records the
misclassification parameter \(\phi\)
for covariate X.The function mgee2k returns a list of components:
beta: the coefficients in the same order as that
specified in the formula for the response and covariates.alpha: the coefficients for paired responses global
odds ratios. The number of \(\alpha\)
coefficients corresponds to the paired responses odds ratio structure
selected in corstr. When
corstr="exchangeable", only one baseline \(\alpha\) is fitted.variance: the variance-covariance matrix of the
estimators of all parameters.convergence: a logical variable; TRUE if the model
converges.iteration: the number of iterations for the estimates
of the model parameters to converge.call: an unevaluated function call which consists of
the named function applied to the given arguments.mgee2vThe function mgee2v does not require the
misclassification parameters to be known, but requires the availability
of validation data.
Similar to mgee2k, the function mgee2v
needs the data set to be structured by individual id, \(i=1,...,n\), and visit time, \(j=1,...,m_i\). The data set should contain
the observed response and covariates, \(S\) and \(W\). To indicate whether or not a subject
is in the validation set, an indicator variable delta
should be added in the data set, and we use a column named
valid.sample.ind for this purpose. The column name of the
error-prone covariate \(W\) should also
be specified in misvariable. To call mgee2v,
we issue the command,
mgee2v(formula, id, data, corstr="exchangeable", misvariable, valid.sample.ind,
y.mcformula, x.mcformula, maxit=50, tol=1e-3)
where the arguments are described as follows:
formula: a formula object which specifies the
relationship between the response and covariates for the observed
data.id: a character object which records individual id in
the data.data: a dataframe or matrix object for the observed
data set.corstr: a character object. The default value is
"exchangeable", corresponding to the structure where the association
between two paired responses is considered to be a constant. The other
option is "log-linear" which indicates the log-linear association
between two paired responses.misvariable: a character object which names the
error-prone covariate W.valid.sample.ind: a string object which names the
indicator variable delta. When a data point belongs to the validation
set, delta = 1; otherwise 0.y.mcformula: a string object which indicates the
misclassification formula between true response \(\texttt{Y}\) and the surrogate response
S.x.mcformula: a string object which indicates the
misclassification formula between true error-prone covariate X and the
surrogate W.maxit: an integer which specifies the maximum number of
iterations. The default is 50.tol: a numeric object which indicates the tolerance
threshold. The default is 1e-3.The function mgee2v returns a list of components:
beta: the coefficients in the same order as that
specified in the formula for the response and covariates.alpha: the coefficients for paired responses global
odds ratios. The number of \(\alpha\)
coefficients corresponds to the paired responses odds ratio structure
selected in corstr. When
corstr="exchangeable", only one baseline \(\alpha\) is fitted.variance: the variance-covariance matrix of the
estimators of all parameters.convergence: a logical variable; TRUE if the model
converges.iteration: the number of iterations for the estimates
of the model parameters to converge.call: an unevaluated function call which consists of
the named function applied to the given arguments.ordGEE2In addition to developing the package mgee2 to
implement the methods of (Chen, Yi, and Wu
2014), which accommodate misclassification effects in inferential
procedures, we also implement the naive method of ignoring the feature
of misclassification and call the resulting function
ordGEE2. This function can be used together with the
preceding described mgee2k or mgee2v to
evaluate the impact of not addressing misclassification effects:
ordGEE2(formula, id, data, corstr = "exchangeable", maxit = 50, tol = 0.001)
In this section, we conduct numerical studies to demonstrate the
usage of our developed R package as well as to show supplementary
functions such as summary and plot functions in this package. We first
demonstrate all of the external functions in mgee2 through
an example with a simulated data set, known as obs1,
provided in our package.
The simulated data set, called "obs1", includes 8
columns and 3000 rows, with each patient having 3 entries of visits. The
format of this data set is as follows.
> head(obs1)
ID Y X treatment visit S W delta
1 1 2 2 1 1 2 2 1
2 1 0 0 1 2 0 0 1
3 1 <NA> <NA> 1 3 1 2 0
4 2 <NA> <NA> 1 1 1 0 0
5 2 <NA> <NA> 1 2 0 1 0
6 2 <NA> <NA> 1 3 0 0 0
> summary(obs1)
ID Y X treatment visit
Min. : 1.0 0 : 352 0 : 444 0:1500 1:1000
1st Qu.: 250.8 1 : 283 1 : 269 1:1500 2:1000
Median : 500.5 2 : 256 2 : 178 3:1000
Mean : 500.5 NA's:2109 NA's:2109
3rd Qu.: 750.2
Max. :1000.0
S W delta
0:1181 0:1460 Min. :0.000
1: 955 1: 944 1st Qu.:0.000
2: 864 2: 596 Median :0.000
Mean :0.297
3rd Qu.:1.000
Max. :1.000
Here, \(\texttt{Y}\) and \(X\) represent the true outcome and
covariate variables, both being ordinal variables, each taking 3
possible values, denoted 0, 1, and 2, whereas \(S\) and \(W\) are the observed surrogates for \(\texttt{Y}\) and \(X\), respectively, with a \(5\%\) misclassification rate.
delta is 1 when the subject is in the validation set and 0
otherwise. About \(30\%\) of subjects
are randomly chosen to be included in the validation set. We include the
subscripts \(i\) and \(j\) to \(\texttt{Y}\) and \(X\) to indicate the measurements for the
corresponding variables for subject \(i\) at time point \(j\), in considering the proportional odds
model indicated by (@ref(eq:lambda-def)), \[\label{equ:model}
\text{logit} \:
\lambda_{ijk}=\beta_{0k}+\beta_{X1}X_{ij1}+\beta_{X2}X_{ij2}+\beta_{Z1}Z_{ij1}+\beta_{Z2}Z_{ij2}+\beta_{Z3}Z_{ij3}\text{
for }k=1,2, (\#eq:equmodel)\] where \(\lambda_{ijk}\) is defined as for
(@ref(eq:lambda-def)); the treatment variable, denoted \(Z_{ij1}\), is an error-free binary
variable; we simulated 3 visits for each patient, denoted by dummy
variables \(Z_{ij2}\) and \(Z_{ij3}\), with the first visit as a
reference level; \(X_{ij1}\) and \(X_{ij2}\) represent the dummy variables to
reflect the three levels of the error-prone covariate, \(X_{ij}\); and \((\beta_{0k},\beta_{x1},\beta_{x2},\beta_{z1},\beta_{z2},\beta_{z3})^T\)
is the vector of regression coefficients.
In the case corstr = "exchangeable", the association,
defined as in (@ref(eq:psi-def)), between paired responses is assumed to
be \[\text{log} \:
\psi_{i,jk,j'k'}=\phi;\] while in the case
corstr = "log-linear", the association is assumed to be
\[\text{log} \:
\psi_{i,jk,j'k'}=\phi+\phi_2 I(k=2)+\phi_2
I(k'=2)+\phi_{22}I(k=2,k'=2),\] where \(\phi,\phi_2\), and \(\phi_{22}\) are parameters.
We now apply mgee2k and mgee2v, in contrast
to ordGEE2, to fit the data to the models, respectively.
The results are displayed as follows. In the summary tables for the R
output, we use "Y>=1" and "Y>=2" to
denote the coefficients \(\beta_{01}\)
and \(\beta_{02}\), respectively, and
let "Delta" correspond to the parameter \(\phi\) in the dependence structure.
To use function mgee2k, we need to specify the
misclassification matrices beforehand. Here, we set the
misclassification matrices the same as used in the simulation
process.
> data(obs1)
> obs1$visit <- as.factor(obs1$visit)
> obs1$treatment <- as.factor(obs1$treatment)
> obs1$S <- as.factor(obs1$S)
> obs1$W <- as.factor(obs1$W)
> ## set misclassification parameters to be known.
> varphiMat <- gamMat <- log( cbind(0.04/0.95, 0.01/0.95,
+ 0.95/0.03, 0.02/0.03,
+ 0.04/0.01, 0.95/0.01) )
> mgee2k.fit = mgee2k(formula = S~W+treatment+visit, id = "ID", data = obs1,
+ corstr = "exchangeable", misvariable = "W", gamMat = gamMat,
+ varphiMat = varphiMat)
> summary(mgee2k.fit)
Call:
mgee2k(formula = S ~ W + treatment + visit, id = "ID", data = obs1,
corstr = "exchangeable", misvariable = "W", gamMat = gamMat,
varphiMat = varphiMat)
Summary table of the estimation
Estimate Std.Err Z value Pr(>z)
Y>=1 0.70889 0.08591 8.251 2.22e-16 ***
Y>=2 -0.67521 0.08625 -7.828 4.88e-15 ***
W1 0.58667 0.08719 6.729 1.71e-11 ***
W2 0.94948 0.09745 9.743 < 2e-16 ***
treatment1 -0.70554 0.09114 -7.742 9.77e-15 ***
visit2 -0.24147 0.07735 -3.122 0.0018 **
visit3 -0.62480 0.07571 -8.253 2.22e-16 ***
Delta 1.22606 0.12231 10.024 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To use mgee2v, a column of indicator variable should be
specified in valid.sample.ind.
> data(obs1)
> obs1$visit <- as.factor(obs1$visit)
> obs1$treatment <- as.factor(obs1$treatment)
> obs1$S <- as.factor(obs1$S)
> obs1$W <- as.factor(obs1$W)
> mgee2v.fit = mgee2v(formula = S~W+treatment+visit, id = "ID", data = obs1,
+ y.mcformula = "S~1", x.mcformula = "W~1",
+ misvariable = "W", valid.sample.ind = "delta",
+ corstr = "exchangeable")
> summary(mgee2v.fit)
Call:
mgee2v(formula = S ~ W + treatment + visit, id = "ID",
data = obs1, corstr = "exchangeable", misvariable = "W",
valid.sample.ind = "delta", y.mcformula = "S~1", x.mcformula = "W~1")
Summary table of the estimation
Estimate Std.Err Z value Pr(>z)
Y>=1 0.64876 0.08851 7.330 2.30e-13 ***
Y>=2 -0.68226 0.08703 -7.839 4.44e-15 ***
W1 0.56507 0.08140 6.942 3.88e-12 ***
W2 0.98411 0.09305 10.577 < 2e-16 ***
treatment1 -0.68153 0.09052 -7.529 5.11e-14 ***
visit2 -0.24694 0.07483 -3.300 0.000966 ***
visit3 -0.60027 0.07335 -8.184 2.22e-16 ***
Delta 1.22862 0.12160 10.103 < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> naigee.fit = ordGEE2(formula = S~W+treatment+visit, id = "ID",
+ data = obs1, corstr = "exchangeable")
> summary(naigee.fit)
Call:
ordGEE2(formula = S ~ W + treatment + visit, id = "ID",
data = obs1, corstr = "exchangeable")
Summary table of the estimation
Estimate Std.Err Z value Pr(>z)
Y>=1 0.73276 0.07990 9.171 < 2e-16 ***
Y>=2 -0.69330 0.08004 -8.662 < 2e-16 ***
W1 0.51237 0.07354 6.967 3.23e-12 ***
W2 0.84890 0.08582 9.892 < 2e-16 ***
treatment1 -0.65954 0.08511 -7.749 9.33e-15 ***
visit2 -0.22766 0.07241 -3.144 0.00167 **
visit3 -0.58407 0.07052 -8.282 2.22e-16 ***
Delta 1.06616 0.09846 10.828 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
plot_model
We use the function plot_model to compare the results
obtained from the three functions:
> plot_model(naigee.fit)
> plot_model(mgee2.fit)
> plot_model(mgee2v.fit)
It is helpful to compare the odds ratios when there are multiple
covariates. We use the function plot_model to visualize the
odds ratios. The estimated odds ratios for this simulated data set
across the three methods are displayed in Figure 1, 2, and 3. The red dot gives the odds ratio of each
covariate. The horizontal blue line measures the length of each
confidence interval. The vertical axes of the graphs indicate the
descending order of the covariates. In other words, the red points from
the lowest to the highest in the graph represent the first covariate,
the second covariate, and so on. It is seen that the three methods yield
similar odds ratios.
To further compare the three methods, a simulation study is
conducted. We run 500 simulations where each data set includes 1000
subjects, with three visits for each subjects. obs1 is one
example of the simulated data. The true values of the coefficients are
reported in 1:
| \(\beta_{01}\) | \(\beta_{02}\) | \(\beta_{X1}\) | \(\beta_{X2}\) | \(\beta_{Z1}\) | \(\beta_{Z2}\) | \(\beta_{Z3}\) |
| \(\log 2\) | \(\log(1/2)\) | \(\log 2\) | \(\log 3\) | \(\log(1/2)\) | \(\log(3/4)\) | \(\log(1/2)\) |
2 reports the simulation results for the
care with a 5% misclassification rate set for both the response and
covariate variables, where Bias% records a bias in percentage, EV
represents an empirical variance, AMV stands for an average of
model-based variance, and CR records a coverage rate of 95% confidence
intervals. Simulation results show that the mgee2 and
mgee2v perform better than the naive method
ordGEE2, and they produce reasonable results.
| ordGEE2 | mgee2 | mgee2v | ||||||||||
| Bias% | EV | AMV | CR | Bias% | EV | AMV | CR | Bias% | EV | AMV | CR | |
| \(\beta_{01}\) | 3.119 | 0.007 | 0.007 | 0.942 | -0.915 | 0.008 | 0.008 | 0.944 | -2.223 | 0.008 | 0.008 | 0.951 |
| \(\beta_{02}\) | 3.226 | 0.007 | 0.007 | 0.946 | 1.562 | 0.008 | 0.008 | 0.940 | 3.556 | 0.009 | 0.008 | 0.947 |
| \(\beta_{x1}\) | -12.112 | 0.006 | 0.006 | 0.784 | 1.238 | 0.008 | 0.008 | 0.942 | -6.933 | 0.024 | 0.014 | 0.924 |
| \(\beta_{x2}\) | -9.810 | 0.007 | 0.008 | 0.754 | 1.433 | 0.009 | 0.010 | 0.964 | 3.393 | 0.016 | 0.011 | 0.941 |
| \(\beta_{z1}\) | -7.032 | 0.008 | 0.007 | 0.922 | -0.456 | 0.009 | 0.008 | 0.954 | -0.138 | 0.009 | 0.008 | 0.949 |
| \(\beta_{z2}\) | -6.311 | 0.006 | 0.005 | 0.932 | 0.071 | 0.006 | 0.006 | 0.938 | -0.364 | 0.006 | 0.006 | 0.932 |
| \(\beta_{z3}\) | -6.630 | 0.005 | 0.005 | 0.908 | 0.056 | 0.006 | 0.006 | 0.964 | 0.143 | 0.006 | 0.006 | 0.962 |
| \(\phi\) | -13.130 | 0.009 | 0.010 | 0.690 | 0.217 | 0.014 | 0.015 | 0.954 | 1.257 | 0.016 | 0.017 | 0.956 |
In addition to the preceding simulation with a misclassification rate
of 5%, we conducted another simulation with the same parameters except
that the misclassification rate is changed to be 20%, and
corstr = "log-linear". The results are reported in Table 3, which shows more noticeable differences in
implementing the three functions, ‘ordGEE2’, ‘mgee2’, and ‘mgee2v’.
| ordGEE2 | mgee2 | mgee2v | ||||||||||
| Bias% | EV | AMV | CR | Bias% | EV | AMV | CR | Bias% | EV | AMV | CR | |
| \(\beta_{01}\) | 9.589 | 0.007 | 0.007 | 0.866 | 0.748 | 0.015 | 0.015 | 0.952 | 0.872 | 0.015 | 0.016 | 0.966 |
| \(\beta_{02}\) | 11.131 | 0.006 | 0.007 | 0.842 | -0.210 | 0.014 | 0.015 | 0.958 | -0.523 | 0.015 | 0.016 | 0.958 |
| \(\beta_{x1}\) | -48.891 | 0.005 | 0.006 | 0.000 | -1.233 | 0.027 | 0.029 | 0.964 | -0.874 | 0.023 | 0.023 | 0.940 |
| \(\beta_{x2}\) | -43.506 | 0.008 | 0.008 | 0.000 | -0.400 | 0.026 | 0.027 | 0.958 | -0.399 | 0.023 | 0.023 | 0.946 |
| \(\beta_{z1}\) | -26.284 | 0.006 | 0.006 | 0.346 | 0.364 | 0.011 | 0.012 | 0.960 | 0.084 | 0.011 | 0.011 | 0.940 |
| \(\beta_{z2}\) | -24.870 | 0.006 | 0.006 | 0.832 | 1.703 | 0.012 | 0.011 | 0.938 | 1.396 | 0.010 | 0.010 | 0.948 |
| \(\beta_{z3}\) | -26.893 | 0.006 | 0.006 | 0.314 | 0.228 | 0.011 | 0.012 | 0.954 | 0.144 | 0.010 | 0.011 | 0.968 |
| \(\phi_0\) | -53.210 | 0.009 | 0.009 | 0.000 | 1.873 | 0.078 | 0.068 | 0.942 | 1.034 | 0.052 | 0.066 | 0.976 |
| \(\phi_2\) | -73.139 | 0.006 | 0.006 | 0.042 | -0.353 | 0.052 | 0.047 | 0.948 | -1.241 | 0.037 | 0.049 | 0.970 |
| \(\phi_{22}\) | -59.955 | 0.011 | 0.011 | 0.000 | 0.256 | 0.075 | 0.075 | 0.944 | 0.188 | 0.053 | 0.079 | 0.978 |
To illustrate the usage of the developed R package, we analyze a dataset arising from the Framingham Heart Study, obtained from the NIH website (https://biolincc.nhlbi.nih.gov/teaching/). Similar to (Chen, Yi, and Wu 2014), we consider those 915 male patients who completed both exams #2 and #3, and age between 31 and 65 at the entry of the study. The response variable, HBP, is a categorical variable indicating the status of the systolic blood pressure (SBP), where HBP=0 if SBP is below 140 mmHg, HBP=1 if SBP is between 140 mmHg and 159 mmHg, and HBP=2 if SBP is larger than 160 mmHg.
We are interested in understanding the relationship between HBP and covariates, including the serum cholesterol level (CHOL), age, and the current smoking status (CURSMOKE), as well as the examination status, denoted as "Exam3". CHOL is classified as three categories, with 0, 1, and 2 representing normal (less than 200 mg/dL), borderline high (200-239mg/dL), and hypercholesterolemia (greater than 240 mg/dL), respectively. Exam3 is a dummy variable, with 1 indicating observations for exam 3 and 0 for exam 2.
First, we visualize how SBP may change with age by stratifying the study subjects into different categories according to the exam time, smoking status, or CHOL. To see the trend, we display simple linear regression lines that fit scattered points of SBP against AGE for patients in each category, as shown in Figure 4. Except for the patients with CHOL value 0 and CURSMOKE value 0 at exam 2, there is generally an upward trend of SBP versus age for each category, though the degree varies. While each patient takes 2 exams, the time interval between two exams is different from patient to patient. To reflect this feature of different gap times for the study subjects, in Figure 5 we further display spaghetti plots (Hedeker and Gibbons 2006) for patients in different categories, where the two endpoints of each black line segment mark SBP and age for the corresponding study subject at exams 2 and 3 in each category, respectively. The blue curve represents the loess smooth curve in each panel to show the trend of SBP against AGE. The loess smooth function is a tool to create smooth lines for scattered plots using polynomial approximations. The code for producing Figures 4 and 5 is included in the help file of data set heart in our R package.
Next, we use the proportional odds model to examine how SBP may be quantitatively associated with the covariates. For the \(i\)th patient at the \(j\)th visit, \(X_{ij,CHOL=1}\) and \(X_{ij,CHOL=2}\) are binary indicator variables recording whether the patient’s cholesterol level is 1 and 2, respectively; \(Z_{ij,smoker}\) is a binary variable whether or not the patient is a smoker; \(Z_{ij,exam3}\) is a binary variable showing whether or not the patient is taking exam #3; and \(Z_{i,age}\) records the age of the \(i\)th patient at the entry of the study.
As defined in (@ref(eq:lambda-def)), consider the model \[\begin{aligned} \text{logit}\: \lambda_{ijk}=&\beta_{0k}+\beta_{X,CHOL=1}X_{ij,CHOL=1}+\beta_{X,CHOL=2}X_{ij,CHOL=2}\\ &+\beta_{Z,age}Z_{i,age}+\beta_{Z,smoker}Z_{ij,smoker}+\beta_{Z,exam3}Z_{ij,exam3} \label{eq:casestudy} \end{aligned} (\#eq:casestudy)\] for \(k=1,2\), where \(\beta_{0k}, \beta_{X,CHOL=1}, \beta_{X,CHOL=2}, \beta_{Z,age}, \beta_{Z,smoker},\) and \(\beta_{Z,exam3}\) are the parameters.
| ordGEE2 | mgee2k | ||||||||
| Est. | SD | p-vlue | Est. | SD | p-vlue | ||||
| \(\beta_{01}\) | -4.291 | 0.635 | <0.001 | -4.943 | 0.737 | <0.001 | |||
| \(\beta_{02}\) | -5.623 | 0.638 | <0.001 | -6.195 | 0.740 | <0.001 | |||
| \(\beta_{x,CHOL=1}\) | 0.068 | 0.133 | 0.608 | 0.117 | 0.180 | 0.515 | |||
| \(\beta_{x,CHOL=2}\) | 0.352 | 0.140 | 0.012 | 0.474 | 0.186 | 0.011 | |||
| \(\beta_{z,age}\) | 0.063 | 0.012 | <0.001 | 0.071 | 0.014 | <0.001 | |||
| \(\beta_{z,smoker}\) | -0.044 | 0.105 | 0.673 | -0.042 | 0.118 | 0.722 | |||
| \(\beta_{z,exam3}\) | 0.145 | 0.097 | 0.133 | 0.182 | 0.110 | 0.097 | |||
| \(\phi\) | 2.301 | 0.207 | <0.001 | 2.301 | 0.318 | <0.001 |
The data set used in our example is included in our package called
"heart". To demonstrate the usage of the developed package,
we perceive that the response HBP level and the covariate cholesterol
level are prone to misclassification. Since this example does not have a
validation data set, we only analyze the data using the naive method,
"ordGEE2", and the corrected method with a specified known
misclassification rate, "mgee2k", where the
misclassification rates for both the outcome and the covariate are
assumed to be 5%, and the exchangeable dependence structure is
considered. The analysis results are shown in Table 4. Overall, the naive method and the corrected method
indicate the same significant health factors, yet the magnitude of the
coefficient estimates and their standard errors are different. Higher
cholesterol levels and older ages appear to be positively correlated
with high blood pressure.
Analysis of longitudinal ordinal data is important for research in health science, epidemiological studies, and social science. Marginal analysis using generalized estimating equations has been extensively employed in applications. However, such a strategy is challenged by the presence of mismeasurement of variables. To address this challenge, (Chen, Yi, and Wu 2014) developed estimation methods for analyzing correlated ordinal responses and ordinal covariates, which are subject to misclassification.
To allow analysts to apply the useful methods of (Chen, Yi, and Wu 2014) without doing individual codes, we develop an R package mgee2 to implement the methods for general use. Our package provides three methods for estimation, including the two methods of corrections for misclassification effects, as opposed to the naive method, which disregards the feature of mismeasurement in variables. The package can be used for modeling longitudinal ordinal data with misclassified response and covariates. It provides consistent estimation results by directly inputting the data under required assumptions.
The authors thank the review team for the helpful comments on the initial version. Yi’s research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Yi is Canada Research Chair in Data Science (Tier 1). Her research was undertaken, in part, thanks to funding from the Canada Research Chairs program.