Abstract

Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many scientific fields such as chemometrics, pattern recognition, genomic sequence analysis, and so on. The so-called seedCCA is a newly developed R package that implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-$p$ and small-$n$ data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package, along with the paper, will contribute to high-dimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful.

Introduction

Explanatory studies are important to identify patterns and special structures in data prior to developing a specific model. When a study between two sets of a $p$-dimensional random variables $\mathbf{X}$ $(\mathbf{X}\in\mathbb{R}^{p})$ and an $r$-dimensional random variable $\mathbf{Y}$ $(\mathbf{Y}\in\mathbb{R}^{r})$, are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; (Hotelling 1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many scientific fields such as chemometrics, pattern recognition, genomic sequence analysis, and so on.

In (Lee and Yoo 2014), it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to the least square estimator. Therefore, the CCA is not only an explanatory and dimension reduction method but also can be utilized as an alternative to least square estimation.

If $\max(p,r)$ is bigger than or equal to the sample size, $n$, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by (Leurgans, Moyeed, and Silverman 1993), whose idea was firstly suggested in (Vinod 1976). In practice, the CCA package by (González et al. 2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying ${\hat{\boldsymbol{\Sigma}}}_{x}$ and ${\hat{\boldsymbol{\Sigma}}}_{y}$, invertible, in (González et al. 2008), they are replaced with

${\hat{\boldsymbol{\Sigma}}}_{x}^{\lambda_1}= {\hat{\boldsymbol{\Sigma}}}_{x}+\lambda_1 \mathbf{I}_p$ and ${\hat{\boldsymbol{\Sigma}}}_{y}^{\lambda_2}= {\hat{\boldsymbol{\Sigma}}}_{y}+\lambda_1 \mathbf{I}_r$.

The optimal values of $\lambda_1$ and $\lambda_2$ are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for $\lambda_1$ and $\lambda_2$ can lesson intensive computing in (González et al. 2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in (Cruz-Cano 2012) and (Alfons, Croux, and Filzmoser 2016), respectively.

Another version of CCA to handle $\max(p,r)>n$ is the so-called seeded canonical correlation analysis proposed by (Im, Gang, and Yoo 2014). Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than $n$ is initially reduced based on iterative projections. In the next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize the CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large $p$-small $n$ data. Thus the seed CCA can yield the PLS estimates.

The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses, including the seeded CCA. The package is already uploaded to CRAN (https://cran.r-project.org/web/packages/seedCCA/index.html).

The main goal of the paper is to introduce and illustrate the seedCCA package. Accordingly, three real data are fitted by the standard CCA, the seeded CCA, and partial least square. Two of the three data are available in the package. One of them has been analyzed in (González et al. 2008). So, the implementation results by the seeded and regularized CCA are closely compared.

The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated, and compared with CCA in Section 3. In Section 4, we summarize the work.

We will use the following notations throughout the rest of the paper. A $p$-dimensional random variable $\mathbf{X}$ will be denoted as $\mathbf{X}\in\mathbb{R}^{p}$. So, $\mathbf{X}\in\mathbb{R}^{p}$ means a random variable, although there is no specific mention. For $\mathbf{X} \in \mathbb{R}^p$ and $\mathbf{Y} \in \mathbb{R}^r$, we define that ${\rm cov}(\mathbf{X})={\boldsymbol \Sigma}_x$, ${\rm cov}({\mathbf{Y}})={\boldsymbol \Sigma}_y$, ${\rm cov}(\mathbf{X,Y})={\boldsymbol \Sigma}_{xy}$ and ${\rm cov}(\mathbf{Y,X})= {\boldsymbol\Sigma}_{yx}$. Moreover, it is assumed that ${\boldsymbol \Sigma}_x$ and ${\boldsymbol \Sigma}_y$ are positive-definite.

Collection of implemented methodologies in seedCCA

Canonical correlation analysis

Suppose the two sets of variable $\mathbf{X} \in \mathbb{R}^p$ and $\mathbf{Y} \in \mathbb{R}^r$ and consider their linear combinations of $U=\mathbf{a}^{{\rm {T}}}\mathbf{X}$ and $V=\mathbf{b}^{{\rm {T}}}\mathbf{Y}$. Then we have ${\rm var}(U)$ = $\mathbf{a}^{{\rm {T}}} {\boldsymbol\Sigma}_x \mathbf{a}$, ${\rm var}(V)=\mathbf{b}^{{\rm {T}}}{\boldsymbol\Sigma}_y \mathbf{b}$, and ${\rm cov}(U,V)=\mathbf{a}^{{\rm {T}}}{\boldsymbol\Sigma}_{xy} \mathbf{b}$, where $\mathbf{a}\in\mathbb{R}^{p \times 1}$ and $\mathbf{b}\in\mathbb{R}^{r \times 1}$. Then Pearson-correlation between $U$ and $V$ is as follows: \[\label{cor1} {\rm cor}(U,V)=\frac{\mathbf{a}^{{\rm {T}}} \boldsymbol{\Sigma}_{xy}\mathbf{b}} {\sqrt{\mathbf{a}^{{\rm {T}}}\boldsymbol{\Sigma}_x\mathbf{a}} \sqrt{\mathbf{b}^{{\rm {T}}}\boldsymbol{\Sigma}_y\mathbf{b}}}. (\#eq:cor1)\] We seek to find $\mathbf{a}$ and $\mathbf{b}$ to maximize ${\rm cor}(U,V)$ by satisfying the following criteria.

The first canonical variate pair ($U_1 = \mathbf{a}_1^{{\rm {T}}}\mathbf{X}, V_1=\mathbf{b}_1^{{\rm {T}}}\mathbf{Y}$) is obtained from maximizing (@ref(eq:cor1)).
The second canonical variate pair $(U_2 = \mathbf{a}_2^{{\rm {T}}}\mathbf{X}, V_2 = \mathbf{b}_2^{{\rm {T}}}\mathbf{Y})$ is constructed from the maximization of (@ref(eq:cor1)) with restriction that ${\rm var}(U_2) = {\rm var}(V_2)=1$ and $(U_1, V_1)$ and $(U_2, V_2)$ are uncorrelated.
At the $k$ step, the $k$th canonical variate pair $(U_k=\mathbf{a}_k^{{\rm {T}}}\mathbf{X}, V_k=\mathbf{b}_k^{{\rm {T}}}\mathbf{Y})$ is obtained from the maximization of (@ref(eq:cor1)) with restriction that ${\rm var}(U_k) = {\rm var}(V_k)=1$ and ($U_k, V_k$) are uncorrelated with the previous $(k-1)$ canonical variate pairs.
Repeat Steps 1 to 3 until $k$ becomes $q$ $(=\min(p,r))$.
Select the first $d$ pairs of $(U_k, V_k)$ to represent the relationship between $\mathbf{X}$ and $\mathbf{Y}$.

Under this criteria, the pairs $(\mathbf{a}_i, \mathbf{b}_i)$ are constructed as follows: $\mathbf{a}_i=\boldsymbol{\Sigma}_x^{-1/2}\psi_i$ and $\mathbf{b}_i=\boldsymbol{\Sigma}_y^{-1/2}\phi_i$ for $i = 1,\ldots,q$, where $(\psi_1,...,\psi_q)$ and $(\phi_1,...,\phi_q)$ are, respectively, the $q$ eigenvectors of $\boldsymbol{\Sigma}_x^{-1/2} \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_y^{-1} \boldsymbol{\Sigma}_{yx} \boldsymbol{\Sigma}_y^{-1/2}$ and $\boldsymbol{\Sigma}_y^{-1/2} \boldsymbol{\Sigma}_{yx} \boldsymbol{\Sigma}_x^{-1}\boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_x^{-1/2}$ with the corresponding common ordered-eigenvalues of $\rho_1^{*2}\ge\cdots\ge\rho_q^{*2}\ge0$. Then, matrices of $\mathbf{M}_{x} = (\mathbf{a}_1, ... ,\mathbf{a}_d)$ and $\mathbf{M}_{y} = (\mathbf{b}_1, ... ,\mathbf{b}_d)$ are called canonical coefficient matrices for $d=1, ..., q$. Also, $\mathbf{M}_x^{\rm T}\mathbf{X}$ and $\mathbf{M}_y^{\rm T}\mathbf{Y}$ are called canonical variates. In the sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer to (Johnson and Wichern 2007).

Seeded canonical correlation analysis

Since the standard CCA application requires the inversion of $\hat{\boldsymbol{\Sigma}}_x$ and $\hat{\boldsymbol{\Sigma}}_y$ in practice, it is not plausible for high-dimensional data with $\max(p,r)>n$. In (Im, Gang, and Yoo 2014), a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two-step procedure consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to $m$-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force $m<<n$. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. A more detailed discussion on the seeded CCA is as follows in the next subsections.

Development

Define a notation of $\mathcal{S}(\mathbf{M})$ as the subspace spanned by the columns of ${\mathbf{M}}\in{\mathbb{R}}^{p\times r}$ . (Lee and Yoo 2014) show the following relation: \[\label{eq2} \mathcal{S}(\mathbf{M}_x)= \mathcal{S}(\boldsymbol{\Sigma}^{-1}_{x}\boldsymbol{\Sigma}_{xy}) ~{\rm and}~ \mathcal{S}(\mathbf{M}_y)= \mathcal{S}(\boldsymbol{\Sigma}^{-1}_{y}\boldsymbol{\Sigma}_{yx}). (\#eq:eq2)\] The relation in (@ref(eq:eq2)) directly indicates that $\mathbf{M}_x$ and $\mathbf{M}_y$ form basis matrices of $\mathcal{S}(\boldsymbol{\Sigma}^{-1}_{x}\boldsymbol{\Sigma}_{xy})$ and $\mathcal{S}(\boldsymbol{\Sigma}^{-1}_{y}\boldsymbol{\Sigma}_{yx})$ and that $\mathbf{M}_x$ and $\mathbf{M}_y$ can be restored from $\boldsymbol{\Sigma}^{-1}_{x}\boldsymbol{\Sigma}_{xy}$ and $\boldsymbol{\Sigma}^{-1}_{y}\boldsymbol{\Sigma}_{yx}$.

Now, we define the following two matrices: \[\begin{aligned} \nonumber \mathbf{R}_{x,u_1} \in\mathbb{R}^{p\times r u_1} &=&({\boldsymbol{\Sigma}}_{xy}, {\boldsymbol{\Sigma}}_x{\boldsymbol{\Sigma}}_{xy}, \ldots, {\boldsymbol{\Sigma}}_x^{u_1 -1}{\boldsymbol{\Sigma}}_{xy}) ~\text{\textrm{and}} \\ \mathbf{R}_{y,u_2} \in\mathbb{R}^{r\times p u_2} &=& ({\boldsymbol{\Sigma}}_{yx}, {\boldsymbol{\Sigma}}_y{\boldsymbol{\Sigma}}_{yx}, \ldots, {\boldsymbol{\Sigma}}_y^{u_2 -1}{\boldsymbol{\Sigma}}_{yx}). \label{eq3} \end{aligned} (\#eq:eq3)\] In $\mathbf{R}_{x,u_1}$ and $\mathbf{R}_{y,u_2}$, the numbers of $u_1$ and $u_2$ are called termination indexes. They decide the number of projections of ${\boldsymbol{\Sigma}}_{xy}$ and ${\boldsymbol{\Sigma}}_{yx}$ onto ${\boldsymbol{\Sigma}}_x$ and ${\boldsymbol{\Sigma}}_y$, respectively. Also define that \[\begin{aligned} \nonumber{\mathbf{M}}_{x,u_1}^{0} \in \mathbb{R}^{p\times r} &=& \mathbf{R}_{x,u_1} (\mathbf{R}_{x,u_1}^{{\rm {T}}}{\boldsymbol{\Sigma}}_{x} \mathbf{R}_{x,u_1})^{-1}\mathbf{R}_{x,u_1}^{{\rm {T}}} {\boldsymbol{\Sigma}}_{xy} ~\text{\textrm{and}}\\ {\mathbf{M}}_{y,u_2}^{0} \in \mathbb{R}^{r\times p} &=&\mathbf{R}_{y,u_2}(\mathbf{R}_{y,u_2}^{{\rm {T}}} {\boldsymbol{\Sigma}}_{y} \mathbf{R}_{y,u_2})^{-1}\mathbf{R}_{y,u_2}^{{\rm {T}}} {\boldsymbol{\Sigma}}_{yx}. \label{eq4} \end{aligned} (\#eq:eq4)\] In (Cook, Li, and Chiaromonte 2007), it is shown that ${\mathcal{S}}({\mathbf{M}}_{x,u_1}^{0}) = {\mathcal{S}}(\boldsymbol{\Sigma}^{-1}_{x}\boldsymbol{\Sigma}_{xy})$ and ${\mathcal{S}}({\mathbf{M}}_{y,u_2}^{0}) = {\mathcal{S}}(\boldsymbol{\Sigma}^{-1}_{y} \boldsymbol{\Sigma}_{yx})$ in (@ref(eq:eq4)). Hence ${\mathbf{M}}_{x,u_1}^{0}$ and ${\mathbf{M}}_{y,u_2}^{0}$ can be used to infer $\mathbf{M}_x$ and $\mathbf{M}_y$, respectively. One clear advantage to use ${\mathbf{M}}_{x,u_1}^{0}$ and ${\mathbf{M}}_{y,u_2}^{0}$ is no need of the inversion of $\boldsymbol{\Sigma}_x$ and $\boldsymbol{\Sigma}_y$.

Practically, it is important to select proper values for the termination indexes $u_1$ and $u_2$ as they define that $\Delta_{x,u_1}={\mathbf{M}}_{x,u_1 +1}^{0}-{\mathbf{M}}_{x,u_1}^{0}$ and $\Delta_{y,u_2} ={\mathbf{M}}_{y,u_2 +1}^{0}-{\mathbf{M}}_{y,u_2}^{0}$. Finally, the following measure for increment of $u_1$ and $u_2$ is defined: $nF_{x,u_1} = n{\rm trace}({\boldsymbol{\Delta}}_{x,u_1}^{{\rm {T}}} {\boldsymbol{\Sigma}}_x{\boldsymbol{\Delta}}_{x,u_1})$ and $nF_{y,u_2} = n{\rm trace}({\boldsymbol{\Delta}}_{y,u_2}^{{\rm {T}}} {\boldsymbol{\Sigma}}_y{\boldsymbol{\Delta}}_{y,u_2})$. Then, a proper value of $u$ is set to have little changes in $nF_{x,u_1}$ and $nF_{x, u_1 +1}$ and in $nF_{y,u_2 }$ and $nF_{y, u_2 +1}$. It is not necessary that the selected $u_1$ and $u_2$ for ${\mathbf{M}}_{x,u_1}^{0}$ and ${\mathbf{M}}_{y,u_2}^{0}$ are common.

Next, the original two sets of variables of $\mathbf{X}$ and $\mathbf{Y}$ are replaced with ${\mathbf{M}}_{x,u_1}^{0~\rm T}\mathbf{X}\in{\mathbb{R}}^{r}$ and ${\mathbf{M}}_{y,u_2}^{0~{\rm T}}\mathbf{Y}\in{\mathbb{R}}^{p}$. This reduction of $\mathbf{X}$ and $\mathbf{Y}$ does not cause any loss of information on CCA in the sense that ${\mathcal{S}}({\mathbf{M}}_{x,u_1}^{0})={\mathcal{S}}(\mathbf{M}_{x})$ and ${\mathcal{S}}({\mathbf{M}}_{y,u_2}^{0})={\mathcal{S}}(\mathbf{M}_{y})$, and it is called initialized CCA. The initialized CCA has the following two cases.

case 1: Suppose that $\min(p,r)=r << n$. Then, the original $\mathbf{X}$ alone is replaced with ${\mathbf{M}}_{x,u_1}^{0~\rm T}\mathbf{X}$ and the original $\mathbf{Y}$ is kept.
case 2: If $\min(p,r)=r$ is not fairly smaller than $n$, $\boldsymbol{\Sigma}_{xy}$ and $\boldsymbol{\Sigma}_{yx}$ are replaced by their $m$ largest eigenvectors in the construction of $\mathbf{R}_{x,u_1}$, $\mathbf{R}_{y,u_2}$, ${\mathbf{M}}_{x,u_1}^{0}$ and ${\mathbf{M}}_{y,u_2}^{0}$. The following two ways to determine a proper value of $m$ is recommended among many. One is a graphical determination by a scree plot for eigenvalues of ${\boldsymbol{\Sigma}}_{xy}$. The other is the number of eigenvalues whose sum is to cover 90% or above of the total variations of ${\boldsymbol{\Sigma}}_{xy}$.

The primary goal in the initialized step is the reduction of $\mathbf{X}$ and $\mathbf{Y}$ less than $n$ without loss of information on CCA. In case 1, $\mathbf{X}$ and $\mathbf{Y}$ are reduced to $r$-dimensional variates, while they are replaced with the $m$-dimensional sets of variables in case 2. After the initialized step, $r$ and $m$ are fairly smaller than $n$.

The next step is to conduct the standard CCA for ${\mathbf{M}}_{x,u_1}^{0~{\rm T}}\mathbf{X}$ and ${\mathbf{M}}_{y,u_2}^{0~{\rm T}}\mathbf{Y}$ for the repairing and orthonormality. This CCA application is called finalized CCA. Finally, this two-step procedure for CCA is called seeded CCA.

Partial least squares

The main goal of the two CCA methods is dimension reduction based on the joint relation of $\mathbf{X}$ and $\mathbf{Y}$ rather than the conditional relation of $\mathbf{Y}|\mathbf{X}$. For simplicity, in this subsection, $\mathbf{Y}$ with $r=1$ is assumed as a response variable in a regression of $\mathbf{Y}|\mathbf{X}$.

Recall $\mathbf{R}_{x,u_1}$ in (@ref(eq:eq3)) and ${\mathbf{M}}_{x,u_1}^{0}$ in (@ref(eq:eq4)):

$\mathbf{R}_{x,u_1} = ({\boldsymbol{\Sigma}}_{xy}, {\boldsymbol{\Sigma}}_x{\boldsymbol{\Sigma}}_{xy}, {\boldsymbol{\Sigma}}_x^2{\boldsymbol{\Sigma}}_{xy}, \ldots, {\boldsymbol{\Sigma}}_x^{u_1 -1}{\boldsymbol{\Sigma}}_{xy})$ and ${\mathbf{M}}_{x,u_1}^{0} = \mathbf{R}_{x,u_1} (\mathbf{R}_{x,u_1}^{{\rm {T}}}{\boldsymbol{\Sigma}}_{x} \mathbf{R}_{x,u_1})^{-1}\mathbf{R}_{x,u_1}^{\text{\textrm{T}}} {\boldsymbol{\Sigma}}_{xy}.$

According to (Helland 1990), the population partial least square (PLS) with $u$ components on the regression of $\mathbf{Y}|\mathbf{X}$ is as follows: \[\label{helland} \boldsymbol{\beta}_{u_1,{\rm PLS}} = {\mathbf{M}}_{x,u_1}^{0}. (\#eq:helland)\] It is noted that this PLS representation in (@ref(eq:helland)) is equivalent to the canonical matrix for $\mathbf{X}$ via the seeded CCA.

Illustration of seedCCA package

Outline of seedCCA package

The methods discussed in the previous section are implemented through the main function seedCCA. Its arguments are as follows.

seedCCA(X, Y, type="seed2", ux=NULL, uy=NULL, u=10, eps=0.01, cut=0.9, d=NULL, AS=TRUE, scale=FALSE)

The main function seedCCA returns “seedCCA” class and three subclasses depending on the values of type. The values of type and its resulting subclasses are as follows.

type="cca": standard CCA ($\max(p,r)<n$ and $\min(p,r) >1$) / “finalCCA” subclass
type="cca": ordinary least squares ($\max(p,r)<n$ and $\min(p,r)=1$) / “seedols” subclass
type="seed1": seeded CCA with case1 ($\max(p,r)\geq n$) / “finalCCA” subclass
type="seed2": seeded CCA with case2 ($\max(p,r)\geq n$) / “finalCCA” subclass
type="pls": partial least squares ($p \geq n$ and $r <n$) / “seedpls” subclass

The function seedCCA prints out estimated canonical coefficient matrices for all subclasses, and additionally does canonical correlations for “finalCCA” subclasses, although it produces more outputs. For details, the readers are recommended to run ?seedCCA after loading the seedCCA package. It should be noted that the seedCCA package must be loaded before using all functions in the package.: of CCA and corpcor ((Schafer et al. 2017)).

For illustration purpose, three data sets will be considered. Pulp data is used for the standard CCA, which is available from the author’s webpage (http://home.ewha.ac.kr/~yjkstat/pulp.txt). For the seeded CCA, along with the comparison with the regularized CCA and the partial least squares, cookie and nutrimouse in seedCCA package will be illustrated.

Standard CCA: pulp data

Pulp data is measurements of properties of pulp fibers and the paper made from them. It contains two sets of variables with 62 sample sizes. The first set, $\mathbf{Y}$, is for the pulp fiber characteristics, which are arithmetic fiber length, long fiber fraction, fine fiber fraction, and zero spans tensile. The second set, $\mathbf{X}$, is regarding the paper properties such as breaking length, elastic modulus, stress at failure, and burst strength. To implement the standard CCA application, the function seedCCA with type="cca" should be used. In this case, seeCCA results in the “finalCCA” subclass. The function requires two matrix-type arguments, and it returns the following five components of cor, xcoef, ycoef, Xscores and Yscores. The first component is cor is the sample canonical correlations. The next two ones, xcoef, and ycoef, are the estimated canonical matrices for $\mathbf{X}$ and $\mathbf{Y}$. The last two components, which are Xscores and Yscores, are the estimated canonical variates for $\mathbf{X}$ and $\mathbf{Y}$. A command plot(object) constructs a plot of the cumulative correlations against the number of canonical pairs. The plot(object) will provide a 90% reference line as default, and users can change the reference line with plot(object, ref=percent).

## loading pulp data
> pulp <- read.table("http://home.ewha.ac.kr/~yjkstat/pulp.txt", header=TRUE)
> Y <- as.matrix(pulp[,1:4])
> X <- as.matrix(pulp[,5:8])

## standard CCA for X and Y
> fit.cca <- seedCCA(X, Y, type="cca")
NOTE:  The standard CCA is fitted for the two sets.

> names(fit.cca)
[1] "cor"     "xcoef"   "ycoef"   "Xscores" "Yscores"

## plotting cumulative canonical correlation
> par(mfrow=c(1, 2))
> plot(fit.cca, ref=80)
> plot(fit.cca)

## first two canonical pairs
> X.cc <- fit.cca$Xscores[,1:2]
> Y.cc <- fit.cca$Yscores[,1:2]

Figure 1: Cumulative canonical correlation plot in pulp data in Section 3.2

According to Figure 1(a) and (b), with 80% cumulative canonical correlations, two canonical pairs are enough, while three canonical pairs should be good with the default 90%.

Ordinary least squares: pulp data

If the dimension of either $\mathbf{X}$ or $\mathbf{Y}$ is equal to one, the estimated canonical coefficient matrix from the standard CCA is equivalent to that from ordinary least squares. In such case, the command seedCCA(X, Y[,1], type="cca") results in the ordinary least squares estimate, which is “seedols” subclass. The output of "seedols" has three components, which are the estimated coefficients and the two sets of variables. For example, assume that a regression study of arithmetic fiber length, which is the first column of $\mathbf{Y}$, given $\mathbf{X}$ is of specific interest. It should be noted that the order of seedCCA(X, Y[,1], type="cca") and seedCCA(Y[,1], X, type="cca") does not matter, and any of them yields the same results. Also, the commands of coef(object) and fitted(object) return the estimated coefficients and fitted values from the ordinary least squares, respectively.

## extracting arithmatic fiber from Y
> fit.ols <- seedCCA(X, Y[, 1], type="cca")
NOTE:  One of the two sets are 1-dimensional, so a linear regression via ordinary least
square is fitted.

> names(fit.ols)
[1] "coef"  "X"  "Y"

> coef(fit.ols)
> fitted(fit.ols)

Seeded CCA (case 1): cookie data

The biscuit dough data set called cookie in seedCCA comes from the experiment of analyzing the composition of biscuits by NIR spectroscopy. Two sets of variables are obtained from 72 biscuit samples. The first set of variables is wavelengths measured by spectroscopy. In the original data set, wavelengths at 700 different points from 1100 to 2798 nanometers (NM) at the steps of 2nm were measured. However, since some of the figures seemed to contain little information, wavelengths from 1380nm to 2400 at an interval of 4nm were analyzed. The second set of variables is the percentages of four ingredients: biscuits- fat, sucrose, dry flour, and water. Since the 23rd and 61st samples in the data set were believed to be outliers, they were deleted from the data set. The standard CCA is not applicable because of $p=256>n=72$, and case 1 of the seeded CCA should be fitted, considering that $n=72>>r=4$.

The basic command for this is seedCCA(X, Y, type="seed1"), which results in “finalCCA” subclass. Regardless of the order of X and Y, the lower dimensional set alone is reduced in the initial step. Therefore, seedCCA(X, Y, type="seed1") and seedCCA(Y, X, type="seed1") basically produce the same seeded CCA results. For type="seed1", the values of the options of ux, uy, u, eps, and AS affect the implementation, whose defaults are NULL, NULL, 10, 0.01, and TRUE, respectively.

Option u controls the maximum number of projections unless both ux and uy are specified. The option ux=k works only when the dimension of the first set X is bigger than that of the second set Y. Then, the maximum number of projections becomes the value given in ux=k. The option uy works in the opposite way to ux. The options of AS=TRUE and eps control automatic termination of the projections before reaching the maximum given in u, ux, or uy. The projection is terminated if the increment gets less than the value given in eps. Then, the first candidate value, which satisfies the stopping criteria, is suggested as a proper value of projections. If any of ux, uy, and u are specified not enough to guarantee the automatic stopping, a notice is provided to increase it.

After running seedCCA(X, Y, type="seed1"), a plot for the proper selection of u is automatically constructed, and a blue vertical bar in the plot is the suggested value of u.

## loading cookie data
> data(cookie)
> myseq<-seq(141, 651, by=2)
> A <- as.matrix(cookie[-c(23, 61), myseq])
> B <- as.matrix(cookie[-c(23, 61), 701:704])

## seedec CCA with case 1
> fit.seed1.ab <- seedCCA(A, B, type="seed1") ## the first set A has been initial-CCAed.
NOTE:  Seeded CCA with case 1 is fitted. The set with larger dimension is initially reduced.
The first and second sets are denoted as X and Y, respectively.

> fit.seed1.ba <- seedCCA(B, A, type="seed1") ##  the second set A has been initial-CCAed.
NOTE:  Seeded CCA with case 1 is fitted. The set with larger dimension is initially reduced.
The first and second sets are denoted as X and Y, respectively.

> names(fit.seed1.ab)
[1] "cor"  "xcoef"  "ycoef"  "proper.u"  "initialMX0"  "newX" "Y"  "Xscores"  "Yscores"

> names(fit.seed1.ba)
[1] "cor"  "xcoef"  "ycoef"  "proper.u"  "X"  "initialMY0"  "newY"  "Xscores"  "Yscores"

> fit.seed1.ab$xcoef[, 3] <- -fit.seed1.ab$xcoef[, 3] ## changing the sign
> fit.seed1.ab$xcoef[, 4] <- -fit.seed1.ab$xcoef[, 4] ## changing the sign

> all(round(fit.seed1.ab$cor, 5)== round(fit.seed1.ba$cor, 5))
[1] TRUE

> fit.seed1.ab$proper.u
[1] 3

> fit.seed1.ba$proper.u
[1] 3

> all(round(fit.seed1.ab$xcoef, 5) == round(fit.seed1.ba$ycoef, 5))
[1] TRUE

> fit.seed1.ab.ux <- seedCCA(A, B, type="seed1", ux=2)
The maximum number of iterations is reached. So, users must choose u bigger than 2.

> fit.seed1.ab.ux$proper.u
[1] 2

For fit.seed1.ab, the first set A is reduced in the initial step. The output component initialMX0 is the estimate of ${\mathbf{M}}_{x,u_1}^{0}$ and newX is ${\hat{\mathbf{M}}}_{x,u_1}^{0~{\rm T}}\mathbf{X}$. On the contrary, in case of fit.seed1.ba, the second set A is initially reduced, so initialMY0 and newY are produced. So, it is observed that the canonical correlations and suggested values of u from fit.seed1.ab and fit.seed1.ba are equal, not to mention that fit.seed1.ab$xcoef and fit.seed1.ba$ycoef are the same. The selection plot for u is reported in Figure 2, and three projections are suggested. Since ux is not given big enough in seedCCA(A, B, type="seed1", ux=2), the following warning is given:

The maximum number of iterations is reached. So, users must choose u bigger than 2.

Figure 2: Selection plot of u generated from seedCCA(A, B, type="seed1") in Section 3.4

Next, we change the values of ux, uy, AS, and eps. Since the usage of these options for type="seed1" are the same as that for type="seed2" and type="pls". To measure the computing time, the tictoc package ((Izrailev 2014)) is used with Intel(R) Core(TM)i7 2.9GHz and 12GB Ram computer.

> seedCCA(A, B, type="seed1", ux=5)$proper.u
[1] 3

> seedCCA(B, A, type="seed1", eps=0.000001)$proper.u
[1] 4

> library(tictoc)
> tic()
> seedCCA(B, A, type="seed1", u=30)$proper.u
> toc()
0.03 sec elapsed

> tic()
> seedCCA(B, A, type="seed1", u=30, AS=FALSE)$proper.u
> toc()
0.29 sec elapsed

Usage of AS should be noted. With bigger choices of u and AS=FALSE, the running time of the function will be longer.

Seeded CCA (case 2) versus Regularized CCA: nutrimouse data

The nutrimouse data was collected from a nutrition study in 40 mice $(n=40)$. One of two sets of variables was expressions of 120 genes measured in liver cells by microarray technology. The other set of variables was concentrations of 21 hepatic fatty acids(FA) measured through gas chromatography. In addition, the forty mice are cross-classified based on two factors, genotype and diet. There are two genotypes, wild-type (WT) and PPAR$\alpha$ deficient (PPAR$\alpha$) mice and five diets: corn and colza oils (50/50 REF), hydrogenated coconut oil for a saturated FA diet (COC), sunflower oil for $\omega$6 FA-rich diet (SUN), linseed oil for $\omega$3-rich diet (LIN) and corn/colza/enriched fish oils (42.5/42.5/15, FISH). The nutrimouse data is contained in the seedCCA package.

In this data, case 2 of the seeded CCA should be used because $\min(120,21)$ is relatively big compared to $n=40$. Then, case 2 of the seeded CCA requires to choose how many eigenvectors of $\hat{\boldsymbol \Sigma}_{xy}$ should be enough to replace it. This is another tuning parameter for case 2 of the seeded CCA along with the number of projections. The option cut in seedCCA controls automatic selection of the number of eigenvectors of $\hat{\boldsymbol\Sigma}_{xy}$. The option cut=$\alpha$ determines a set of the eigenvectors whose cumulative proportions of their corresponding eigenvalues is bigger than equal to $\alpha$. For the set of eigenvectors to be chosen conservatively, we set the default of cut at 0.9. Also, users can directly give the number of eigenvectors using d. Unless d is NULL, the option cut is discarded. This means that cut works only when d=NULL. If users want to use d, then a function covplot should be run first. The function covplot has the option mind, which set the number of the eigenvalues to show their cumulative percentages. Its default is NULL, and then it becomes $\min(p,r)$. The function returns the eigenvalues, the cumulative percentages, and the number of the eigenvectors to account for 60%, 70%, 80%, and 90% of the total variation along with the scree plot of the eigenvalues.

The results of the seeded and regularized CCAs are compared. Since the regularized CCA is necessary to choose proper values of the two parameters, we compare running times for the automatic searches for the regularized and seeded CCAs via the tictoc package. For seedCCA, we use the default value of cut.

> library(CCA)
> library(tictoc)

## loading nutrimouse data
> data(nutrimouse)
> X <- scale(as.matrix(nutrimouse$gene))
> Y <- scale(as.matrix(nutrimouse$lipid))

## determining the number of the eigenvectors of cov(X,Y) with cut=0.9
> tic("SdCCA")
> fit.seed2 <- seedCCA(X, Y)
> toc()
SdCCA: 0.13 sec elapsed

## finding the optimal values of lambda1 and lambda2 for RCCA
> tic("Regularized CCA")
> res.regul <- estim.regul(X, Y, plt=TRUE, grid1=seq(0.0001, 0.2, l=51), grid2=seq(0, 0.2, l=51))
> toc()
Regularized CCA 819.58 sec elapsed

## scree plot of cov(X, Y)
> names(covplot(X, Y, mind=10))
[1] "eigenvalue"  "cum.percent" "num.evecs"

> names(fit.seed2)
[1] "cor"  "xcoef"  "ycoef"  "proper.ux"  "proper.uy"  "d"  "initialMX0"  "initialMY0"
[9] "newX"  "newY"  "Xscores"    "Yscores"

> fit.seed2$d
[1] 3

Figure 3: Scree plots for the selection of sets of eigenvectors to replace $\rm{cov}(\mathbf{X}, \mathbf{Y})$ generated from covplot(X, Y, mind=10) in Section 3.5

Since type="seed2" reduces the dimensions of X and Y at the initialized CCA step, the output components of initialMX0, initialMY0, newX and newY and d are reported.

The plot generated from covplot(X, Y, mind=10) is given in Figure 3. According to Figure 3, the first two, three, and four eigenvalues account for 79.6%, 91.8%, and 95.9% of the total variation of $\hat{\boldsymbol{\Sigma}}_{xy}$, respectively. Using 90% conservative guideline, it is determined that the first three largest eigenvectors replace $\hat{\boldsymbol{\Sigma}}_{xy}$ well enough.

The selection plot of ux and uy is given in Figure 4. The figure suggests that $u_x$ and $u_y$ are equal to 7 and 6, respectively.

Figure 4: Selection plot of ux and uy generated from seedCCA(X, Y) in Section 3.5

Now we compare the parameter selection time. For the regularized CCA, it can be done with estim.regul, and users must provide a small enough range for them to reduce the computing time. The resulted optimal $\lambda_1$ and $\lambda_2$ are 0.168016 and 0.004, respectively. With Intel(R) Core(TM)i7 2.9GHz and 12GB Ram, the seeded CCA took 0.32 seconds, while 819.58 seconds, around 13.5 minutes, lapsed for the regularized CCA. This difference is really huge, so time consumed in the selection of ux, uy and, d is trivially small compared to the regularized CCA. This is a clear desirable aspect and advantage of the seeded CCA over the regularized one.

Next, we compare the first two pairs of estimated canonical variates. The results shown in Figures 5–6 are equivalent to the analysis discussed in (González et al. 2008).

## Extracting the first two pairs of canonical variates
> sx1 <- fit.seed2$Xscores[, 1]
> sx2 <- fit.seed2$Xscores[, 2]
> sy1 <- fit.seed2$Yscores[, 1]
> sy2 <- fit.seed2$Yscores[, 2]

## fitting the regularized CCA
> res.rcc <- rcc(X, Y, 0.168016, 0.004)
> RCCA.X <- X%*%res.rcc$xcoef
> RCCA.Y <- Y%*%res.rcc$ycoef
> rx1 <- RCCA.X[,1]
> rx2 <- RCCA.X[,2]
> ry1 <- RCCA.Y[,1]
> ry2 <- RCCA.Y[,2]

par(mfrow=c(1,2))
> with(plot(rx1, ry1, col=c(2,4)[genotype], pch=c(1,2)[genotype],
+ main="1st pair from RCCA", xlab="rx1", ylab="ry1"), data=nutrimouse)
> with(legend(-1.4, 1.4, legend=levels(genotype), col=c(2,4), pch=c(1,2), cex=1.5),
+ data=nutrimouse)
> with(plot(-sx1, -sy1, col=c(2,4)[genotype], pch=c(1,2)[genotype],
+ main="1st pair from seedCCA", xlab="sx1", ylab="sy1"), data=nutrimouse)
> with(legend(-1.5, 1.6, legend=levels(genotype), col=c(2,4), pch=c(1,2), cex=1.5),
+ data=nutrimouse)

> par(mfrow=c(1,2))
> with(plot(rx2, ry2, col=c(1:4,6)[diet], pch=c(15,16,17,18,20)[diet], cex=1.5,
+ main="2nd pair from RCCA", xlab="rx2", ylab="ry2"), data=nutrimouse)
> with(legend(-2.3, 1.9, legend=levels(diet), col=c(1:4,6), pch=c(15:18,20)),
+ data=nutrimouse)
> with(plot(sx2, sy2, col=c(1:4,6)[diet], pch=c(15,16,17,18,20)[diet], cex=1.5,
+ main="2nd pair from seedCCA", xlab="sx2", ylab="sy2"), data=nutrimouse)
with(legend(-2.5, 1.9, legend=levels(diet), col=c(1:4,6), pch=c(15:18,20)),
+ data=nutrimouse)

Figure 5: the first pair of canonical variates from regularized CCA and seeded CCA marked with genotype in Section 3.5

Figure 6: the second pair of canonical variates from regularized CCA and seeded CCA marked with diet in Section 3.5

According to Figures 5–6, the first pair of canonical variates from both CCAs distinguish genotype very well, but their second pairs marked with diet are quite complex. To have more insight into the results for the second pair on a diet, multivariate analysis of variance is fitted. Further pairwise comparison is done via lsmeans ((Lenth 2016)) with level 5% and $p$-values adjusted by false discovery rate (Benjamini and Hochberg 1995).

> library(lsmeans)
> fit2r <- manova(cbind(rx2, ry2)~diet, data=nutrimouse)
> fit3sd <- manova(cbind(sx2, sy2)~diet, data=nutrimouse)
> test( contrast( lsmeans(fit2r, "diet"), "pairwise"), side = "=",  adjust = "fdr")
 contrast     estimate        SE df t.ratio p.value
 coc - fish -2.3842686 0.2684019 35  -8.883  <.0001
 coc - lin  -2.1749708 0.2684019 35  -8.103  <.0001
 coc - ref  -1.4881111 0.2684019 35  -5.544  <.0001
 coc - sun  -1.6582635 0.2684019 35  -6.178  <.0001
 fish - lin  0.2092978 0.2684019 35   0.780  0.4897
 fish - ref  0.8961575 0.2684019 35   3.339  0.0040
 fish - sun  0.7260051 0.2684019 35   2.705  0.0175
 lin - ref   0.6868597 0.2684019 35   2.559  0.0214
 lin - sun   0.5167073 0.2684019 35   1.925  0.0780
 ref - sun  -0.1701524 0.2684019 35  -0.634  0.5302

Results are averaged over the levels of: rep.meas
P value adjustment: fdr method for 10 tests

> test( contrast (lsmeans(fit3sd, "diet"), "pairwise"), side = "=",  adjust = "fdr")
 contrast      estimate        SE df t.ratio p.value
 contrast      estimate        SE df t.ratio p.value
 coc - fish -2.14838660 0.3163067 35  -6.792  <.0001
 coc - lin  -1.79396325 0.3163067 35  -5.672  <.0001
 coc - ref  -1.07697196 0.3163067 35  -3.405  0.0035
 coc - sun  -1.03303440 0.3163067 35  -3.266  0.0041
 fish - lin  0.35442334 0.3163067 35   1.121  0.3001
 fish - ref  1.07141463 0.3163067 35   3.387  0.0035
 fish - sun  1.11535219 0.3163067 35   3.526  0.0035
 lin - ref   0.71699129 0.3163067 35   2.267  0.0371
 lin - sun   0.76092885 0.3163067 35   2.406  0.0308
 ref - sun   0.04393756 0.3163067 35   0.139  0.8903

Results are averaged over the levels of: rep.meas
P value adjustment: fdr method for 10 tests

For the regularized CCA, the “coc” diet is different from the others. Moreover “fish” differs from “sun”. However, the other pairwise comparisons are quite mixed. It is determined that there are no significant differences between “fish-lin”, “lin-sun”, and “ref-sun”. On the contrary, reasonable pairwise comparison results come from the seeded CCA. Like the others, the “coc” diet is different from the others. Furthermore, “fish-lin” is not significantly different, and “ref-sun” is concluded to be similar. Fish oil is known to contain $\omega 3$, and linseed oil is designed for it. Therefore, this conclusion would be reasonable. Also, the reference oil diet consists of corn and colza oil, which is known to contain $\omega 6$. Since sun-flower oil is, indeed, for $\omega 6$-rich diet, this result is also reasonable. In this regard, the seeded CCA results would be preferable to the regularized CCA.

Partial least square application with nutrimouse data

With the nutrimouse data, consider a regression of the first one, ” C14.0” in concentrations of 21 hepatic fatty acids given expressions of 120 genes measured in liver cells. In this case, partial least squares is a front-runner choice. Then, to obtain the partial least square estimator in seedCCA, one needs to implement seedCCA(X, Y, type="pls"). This results in “seedpls” subclass. An important matter in partial least squares is that the first set of the variable must be predictors. The response variable can be either univariate or multivariate. Option u is recommended to set reasonably small because the estimated coefficients are reported up to the value given in u. If scale=TRUE, the predictors are standardized to have zero sample means and the sample correlation matrix.

The estimated coefficients and fitted values by partial least square can be obtained via coef(object, u=NULL) and fitted(object, u=NULL). The default of u in both coef and fitt is NULL. In both functions, usage of u is equivalent. If u=k is specified, only the estimated coefficients and fitted values computed from k projections are reported. All of the coefficient estimates and fitted values are reported up to u, if u=NULL.

For type="pls", the automatic procedure to suggest a proper value of projections is not conducted. For the “seedpls” subclass, plot(object) suggests a proper value of projections along with other output components. If the terminating condition is not satisfied before reaching the value of u, then plot(object) provides a caution to increase the value of u.

> data(nutrimouse)
> Y <- as.matrix(nutrimouse$lipid)
> X <- as.matrix(nutrimouse$gene)
> Y1 <- as.matrix(Y[, 1])  ## univariate response
> Y12 <- as.matrix(Y[, 1:2])  ## multivariate response

## fitting partial least square and obtaining the estimated coefficient vector
> fit.pls1.10 <- seedCCA(X, Y1, u=10, type="pls")
> fit.pls1.3 <- seedCCA(X, Y1, u=3, type="pls", scale=TRUE)

> names(fit.pls1.10)
[1] "coef"  "u"     "X"     "Y"     "scale"

> names(fit.pls1.10$coef)
[1] u=1"  "u=2"  "u=3"  "u=4"  "u=5"  "u=6"  "u=7"  "u=8"  "u=9"  "u=10"

> names(fit.pls1.3$coef)
[1] "u=1" "u=2" "u=3"

> fit.pls1.3$scale
[1] TRUE

> par(mfrow=c(1,2))
> plot(fit.pls1.10)
$proper.u
[1] 6
$nFu
[1] 6.344725e+00 2.383108e+00 1.681329e+00 2.669394e+00 1.853061e+00 3.217472e-04 5.296046e-05
[8] 6.017641e-06 4.895905e-07 3.117371e-08
$u
[1] 10
$eps
[1] 0.01
> title("fit.pls1.10")

> plot(fit.pls1.3)
Caution: The terminating condition is NOT satisfied. The number of projections should be bigger than  3.
$proper.u
[1] 3
$nFu
[1] 6.344725 2.383108 1.681329
$u
[1] 3
$eps
[1] 0.01
> title("fit.pls1.3")

> names(fitted(fit.pls1.10))
[1] "u=1"  "u=2"  "u=3"  "u=4"  "u=5"  "u=6"  "u=7"  "u=8"  "u=9"  "u=10"

> fitted(fit.pls1.10, u=6)
     1      2      3      4      5      6      7      8      9     10     11     12     13     14
 0.137  0.368  0.317  0.346  0.492  1.620  0.722  0.003  0.065  1.212  0.458  0.640  0.272  0.397
    15     16     17     18     19     20     21     22     23     24     25     26     27     28
-0.103  0.426  1.448  0.287  1.264  0.517  2.803  0.914  0.043  0.028  0.234  0.598  0.875  0.434
    29     30     31     32     33     34     35     36     37     38     39     40
 0.694  0.666  2.958  2.350  0.620  0.958  0.495  2.790  0.701  0.168  0.767  0.535

> fit.pls.m <- seedCCA(X, Y12, u=5, type="pls")
> dim(fit.pls.m$coef$'u=1')
[1] 120   2

Figure 7: Selection plot of u generated from seedCCA(X, Y1, u=10, type="pls")(left) and seedCCA(X, Y1, u=3, type="pls", scale=TRUE)(right) in Section 3.6

The selection of projections for two partial least squares by seedCCA(X, Y1, u=10, type="pls") and seedCCA(X, Y1, u=3, type="pls", scale=TRUE) is given in Figure 7. According to Figure 7, the proper value of projection is suggested at 6 for fit.pls1.10 object, while the termination condition is not satisfied for fit.pls1.3 object, so a caution statement is given.

Discussion

When a study between two sets of variables, saying $(\mathbf{X}\in\mathbb{R}^{p}, \mathbf{Y}\in\mathbb{R}^{r})$, is of primary interest, canonical correlation analysis (CCA; (Hotelling 1936)) is still popularly used in explanatory studies. The CCA has successful application in many science fields such as chemometrics, pattern recognition, genomic sequence analysis, and so on.

The recently developed seedCCA package implements a collection of CCA methodologies including the standard CCA application, seeded CCA, and partial least squares. The package enables us to fit CCA to large-$p$ and small-$n$ data. The paper provides a complete guide for the package to implement all the methods, along with three real data examples. Also, the seeded CCA application results are compared with the regularized CCA in the existing CCA package.

It is believed that the package, along with the paper, will contribute to high-dimensional data analysis in various scientific field practitioners and that the statistical methodologies in multivariate analysis become more fruitful.

Acknowledgments

For the corresponding author Jae Keun Yoo, this work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Ministry of Education (NRF-2019R1F1A1050715). For Bo-Young Kim, this work was supported by the BK21 Plus Project through the National Research Foundation of Korea (NRF) funded by the Korean Ministry of Education (22A20130011003).

SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares

Bo-Young Kim

Yunju Im

Jae Keun Yoo

2021-06-07