Abstract
Multidimensional scaling (MDS), hierarchical cluster analysis (HCA), and discriminant analysis (DA) are classical techniques which deal with data made of \(n\) individuals and \(p\) variables. When the individuals are divided into \(T\) groups, the R package dad associates with each group a multivariate probability density function and then carries out these techniques on the densities, which are estimated by the data under consideration. These techniques are based on distance measures between densities: chi-square, Hellinger, Jeffreys, Jensen-Shannon, and \(L^p\) for discrete densities, Hellinger , Jeffreys, \(L^2\), and 2-Wasserstein for Gaussian densities, and \(L^2\) for numeric non-Gaussian densities estimated by the Gaussian kernel method. Practical methods help the user to give meaning to the outputs in the context of MDS and HCA and to look for an optimal prediction in the context of DA based on the one-leave-out misclassification ratio. Some functions for data management or basic statistics calculations on groups are annexed.Techniques such as multidimensional scaling, hierarchical cluster analysis, and discriminant analysis are often used for multivariate data analysis to visualize, organize observations into groups, and model class structure, respectively.
These techniques deal with data of the “individuals \(\times\) variables” type (Mardia, Kent, and Bibby 1979; Krzanowski 1988),
commonly stored in objects of class data frame. These
techniques are available in the R packages stats, MASS (Venables and Ripley 2002), ade4 (Dray, Dufour, and Chessel 2007), FactoMineR
(Lê, Josse, and Husson 2008), cluster
(Maechler et al. 2019).
In the case where the individuals are organized into occasions or groups, the analyst could be interested in taking into account this data organization by associating with each occasion a mathematical object and performing multivariate techniques on these objects. In the dad package (Boumaza et al. 2021), devoted to such data, the objects are probability density functions. These densities are either all continuous (numeric data with Lebesgue measure as reference measure) or all discrete (categorical data with counting measure as reference measure) and are subjected to the following analyses:
These three multivariate techniques constitute the core of this work. Theoretically, the dad package handles probability density functions and considers multi-group data as samples allowing their estimation. The densities could be considered as functional data processed by packages like fda (Ramsay, Graves, and Hooker 2020), fda.usc (Febrero-Bande and Fuente 2012), or fdadensity (Petersen, Hadjipantelis, and Mueller 2019), or as compositional data processed by packages like compositions (Tsagris and Athineou 2020), Compositional (van den Boogaart, Tolosana-Delgado, and Bren 2020), or robCompositions (Filzmoser, Hron, and Templ 2018). The differences or similarities with these approaches are the subjects of part of the discussion of the manuscript (Discussion and concluding remarks Section).
These three multivariate techniques are essentially based on distance indices between probability density functions. Literature abounds with such indices: as an example, the encyclopedia of distances of Deza and Deza (2013, 235–45) lists some forty. The dad package proposes to calculate ten of them among the most common by considering the case of discrete densities and that of continuous densities. The results returned by the three previous multivariate techniques depend on the distance index used. This is illustrated by simple examples in the context of HCA (Appendix A) or DA (Appendix B), and criteria of distance choice are proposed in the Practical advice Section.
Thus, for each distance index, the dad package implements:
In order to avoid unnecessary redundancies in the entry of data
characterizing the groups as these characteristics are the same for all
the individuals of the same group, we considered it useful to organize
the data in a list of data frames (object of class
folderh). Also, in order to calculate some statistics
(means, covariance matrices...) per group or distances between each pair
of groups, we considered it useful to store the individuals of each
group in one data frame and these data frames in a list (object of class
folder). Appendix C details the rationale for introducing
these object classes. The functions of the package dad
implementing the three main techniques (MDS, HCA, DA) apply to such
objects. The functions of data management or elementary calculation
(means, variance or correlation matrices, moments) applying to these
objects could have been made invisible in the package without affecting
the presentation of the main techniques. However, to facilitate the work
of the analyst interested in processing multi-group data or in
experimenting with other multivariate techniques on such data, we have
kept them visible. Some of them are presented in Appendix C and
Appendix D.
So, the presentation of the dad package will begin with a description of the data considered in the previous techniques (Multi-group data: examples and organization Section). We will then present functions for the calculation of the distance or divergence measures between discrete densities and between Gaussian densities. The special case of non-Gaussian densities estimated with the Gaussian kernel method is also considered (Distance / divergence between densities Section). Then, we will present the functions implementing the three techniques introduced above for the processing of multi-group data: multidimensional scaling (MDS of densities Section), hierarchical cluster analysis (HCA of densities Section), and discriminant analysis (DA of densities Section). Finally, we will give some practical advice (Practical advice Section) and briefly highlight some similarities with functions of other R packages (Discussion and concluding remarks Section). A summary and appendices complete this presentation.
For MDS and HCA, the data \(\mathbf{X}\) (Table 1a) of interest have three kinds of objects: occasions \(\times\) individuals \(\times\) variables. The occasions define a partition of the individuals on which the variables are measured. If \(T\) denotes the number of occasions, for each \(t\) in \(\left\{1,\ldots,T\right\}\), the rows of the table \(\mathbf{X}_t\) correspond to \(n_t\) observations \(\mathbf{x}_{t1}\,,\ldots,\,\mathbf{x}_{tn_t}\) of \(X_t\) a random vector with \(p\) components.
For DA, the data of interest are similar to the previous ones with the difference that we have two categories of occasions. The first category consisting of \(T\) occasions is partitioned into \(K\) subsets deriving from a factor \(G\) defined on occasions (Table 2). The second category consists of occasions, numbered \(T+1,\ \ldots\) for which we have data of type \(\mathbf{X}\) but not the value of \(G\).
The data of the following examples are available in the dad
package as lists of data frames or arrays, and are loaded by means of
the usual data function.
Archaeological data: the data are stored in
castles.dated, a list of two data frames whose description
is detailed in the subsection Introductory example of
Appendix C. The data frame castles.dated$stones in which
for each of \(T=68\) Alsatian castles
(occasions), \(p=4\) numerical
characteristics are measured on a batch of stones (individuals) used to
build the castle. The objective is to:
The \(68\) castles are dated from
the period 1140-1650, which is divided into six intervals, numbered 1 to
6, separated by the following cutoff points: 1175, 1210, 1245, 1280,
1350. The building periods are available in the data frame
castles.dated$periods. The first objective is to analyse
the relations between these periods and the previous visualization of
the castles. The second objective is to predict the building period of
\(67\) non-dated castles the data of
which are in the data frame castles.nondated$stones.
This archaeological example (Rudrauf and Boumaza
2001) was at the origin of the multi-group techniques presented
in this work. It illustrates well the MDS and DA techniques even if, as
we will see when processing the data, from the archeology point of view,
the results are not very satisfactory and have only an indicative
value.
Agronomic data: the data are stored in the folderh
object varietyleaves consisting of two data frames
variety and leaves, and a key
rose that is the column name common to the two data frames
which connects them. The first data frame
varietyleaves$leaves is made up of \(581\) rows (leaves of \(T = 31\) rosebushes) and 5 columns
corresponding to the number of the rosebush to which the leaf belongs
and \(p = 4\) numerical
characteristics: number of its leaflets, length of its rachis, then the
length, and width of its main leaflet. The second data frame
varietyleaves$variety is made up of \(T\) rows and two columns: the number of the
rosebush and its variety. There are \(K =
6\) varieties (Table 3). This example
will illustrate the DA technique. The objective of which is to predict
the variety of one plant from measures of several leaves of the
plant.
| Variety | Number | Number | Number of leaves |
| of plants | of leaves | per plant | |
| Canary | 6 | 65 | \(=\ \ 8+14+11+16+\ 8+\ 8\) |
| Electron | 3 | 42 | \(=\ \ 9+\ 8+25\) |
| Lili Marleen | 6 | 65 | \(=\ 15+10+\ 5+\ 7+15+13\) |
| Pussta | 6 | 200 | \(=\ 35+47+29+20+37+32\) |
| Starina | 5 | 105 | \(=\ 18+17+19+33+18\) |
| White Meillandina | 5 | 104 | \(=\ 23+19+16+22+24\) |
Sensory data: the data frame roses in which each of
\(T=10\) photographs of rosebushes
(occasions) was evaluated 3 times, by 14 assessors, for \(p=16\) numerical characteristics, giving a
table with \(16\) columns and \(42\) rows per rosebush, for a total of
\(420\) rows. In this case, an
individual is a couple (assessor, evaluation session). The objective is
to visualize the roses and then create clusters of roses that are as
similar as possible. A part of these data will illustrate the techniques
MDS and HCA.
Surveys – census data over years – : from each census conducted
in France (INSEE 2018) during \(T = 7\) different years (1968, 1975, 1982,
1990, 2010, and 2015), we extract the population of active individuals
aged 25 to 54 and \(p = 2\) categorical
variables: diploma (4 levels) and socio-professional group (6 levels).
From the initial data collected by INSEE, we build a list of \(T = 7\) arrays of dimension \((4,6)\), which is stored in
dspg object. The objective is to visualize the years and to
highlight, when they exist, the temporal evolutions of the frequencies.
The MDS technique seems suitable to achieve this objective.
Surveys – census data 2015 by department – : from the census of
the year 2015 (INSEE 2018), we consider
the same kind of data and organize them in a list of \(T = 96\) arrays, which correspond to the 96
departments of metropolitan France. This list is available in the
dspgd2015 object. As for the sensory data, the objective is
to visualize the departments and then create clusters of departments
that are as similar as possible. In order to give meaning to the
clusters, a common and advisable practice is to couple the techniques
HCA and MDS.
In the last two examples, the densities are discrete and are given
only for illustration. Their detailed presentation is in the
dad vignette mds-discrete-distributions.
The previously collected data can be organized into:
S3 class named folder (see Appendix C).To carry out discriminant analysis, the training step requires an a
priori division of the occasions into clusters, that is, a factor \(G\) with \(K\) levels defined from the occasion set
(Table 2). The predicting step is to
assign a level for each occasion whose value of the factor \(G\) is not available. The data, therefore,
consist of two data frames linked by a hierarchical relationship “1 to
N”. Each row of the data frame in Table 2
is, thus, matched to several rows of the data frame in Table 1a. The list of these two data frames is an
object of S3 class folderh (hierarchical
folder) built by the function folderh (see Appendix C).
Notice that in the presentation of the data tables, we arranged them so that the individuals (rows of Table 1a) of the same occasion are neighbors and so that the occasions (rows of Table 2) of the same class are neighbors. However, in the dad package, such a layout is obviously not necessary. Only the factors \(Group\) (Table 1a) and \(G\) (Table 2) must be given.
In the first subsection, we present the indices which operate on discrete densities. In the second subsection, we consider the indices which operate on Gaussian densities and have an analytical expression depending on the parameters of the densities: means, variances, and covariances. Therefore, these indices can easily be estimated from the parameter estimates. In the third subsection, we present an index based on the estimation of densities on \(\mathbb{R}^p\) by the Gaussian kernel method. This index offers the advantage of being easily calculable even for non-Gaussian continuous densities.
The main techniques of the dad package depend on the distance index used (Appendix A and B). A brief practical guidance on the choice of distance index is provided in Practical advice Section.
| Name | Expression |
| Symmetric chi-square | \(\sum\limits_{x}{(p_1(x)-p_2(x))^2 / (p_1(x) + p_2(x))}\) |
| Hellinger | \(\left( 2 \sum\limits_{x}{(\sqrt{p_1(x)}-\sqrt{p_2(x)}\;)^2} \right)^{\frac{1}{2}}\) |
| Jeffreys | \(\sum\limits_{x}{ (p_1(x)-p_2(x)) \ \ln (p_1(x)/p_2(x)) }\) |
| Jensen-Shannon | \(\sum\limits_{x} \left( \ p_1(x) \ \ln (2 p_1(x)/ (p_1(x)+p_2(x))) \right.\) |
| \(\left. + \ p_2(x) \ \ln (2 p_2(x)/ (p_1(x)+p_2(x)))\ \right)\) | |
| Lp | \(\left( \sum\limits_{x}{|p_1(x)-p_2(x)|^p} \right)^{\frac{1}{p}}\) |
Table 4 lists the expressions of distance indices of two discrete densities and the dad functions associated with them. The set of the states of these densities can be either the set of the levels of one categorical variable or the Cartesian product of the \(q\) sets of the levels of \(q\) categorical variables as in the following example with \(q=2\).
> x1 <- data.frame(x = factor(c("A", "A", "A", "B", "B", "B")),
+ y = factor(c("a", "a", "a", "b", "b", "b")))
> x2 <- data.frame(x = factor(c("A", "A", "A", "B", "B")),
+ y = factor(c("a", "a", "b", "a", "b")))
> p1 <- table(x1)/nrow(x1)
> p1
y
x a b
A 0.5 0.0
B 0.0 0.5
> p2 <- table(x2)/nrow(x2)
> p2
y
x a b
A 0.4 0.2
B 0.2 0.2
The \(L1\) distance and Jeffreys divergence between the densities \(p1\) and \(p2\) are computed as follows.
> ddlppar(p1, p2)
[1] 0.8
> ddjeffreyspar(p1, p2)
[1] Inf
The Jensen-Shannon index is equal to the entropy or average quantity of information of the distribution \((p_1 + p_2) / 2\), from which we subtract the sum of the entropies of \(p_1\) and \(p_2\). The other indices are based on the sum of the differences, possibly weighted, between \(p_1\) and \(p_2\) in each state \(x\), unlike the Jensen-Shannon index, which is somewhat more global. These indices are compared in the subsection Simulated discrete data of Appendix B.
Table 5 lists the parametric expressions of distance indices of multivariate densities and the dad functions associated with them. For the univariate case, the expressions are easily deduced.
| Name | Expression |
| Hellinger \(^{\textrm{(a)}}\) | \(\left(2 - 2^{\frac{p}{2}+1} \det(\Sigma V)^{\frac{1}{4}} \det(\Sigma + V)^{-\frac{1}{2}} \exp(-\frac{1}{4} \|\mu-m\|^2_{(\Sigma+V)^{-1}} )\right)^{\frac{1}{2}}\) |
| Jeffreys\(^{\textrm{(b)}}\) | \(2^{-1}\, \|\mu-m\|^2_{\Sigma^{-1}+V^{-1}} + 2^{-1}\, \textrm{tr}((\Sigma-V)(V^{-1}-\Sigma^{-1}))\) |
| L2 \(^{\textrm{(c)}}\) | \(\left( (2\pi)^{-\frac{p}{2}} \det(2\Sigma)^{-\frac{1}{2}} + (2\pi)^{-\frac{p}{2}} \det(2V)^{-\frac{1}{2}} \right.\) |
| \(\left. - 2\,(2\pi)^{-\frac{p}{2}} \det(\Sigma + V)^{-\frac{1}{2}} \exp(-\frac{1}{2} \|\mu-m\|^2_{(\Sigma + V)^{-1}} ) \right)^{\frac{1}{2}}\) | |
| L2N \(^{\textrm{(d)}}\) | \(\left(2 - 2^{\frac{p}{2}+1} \det(\Sigma V)^{\frac{1}{4}} \det(\Sigma + V)^{-\frac{1}{2}} \exp(-\frac{1}{2} \|\mu-m\|^2_{(\Sigma+V)^{-1}} )\right)^{\frac{1}{2}}\) |
| 2-Wasserstein \(^{\textrm{(e)}}\) | \(\left(\|\mu-m\|^2_{I_p} + \textrm{tr}(\Sigma + V - 2(V^{\frac{1}{2}} \Sigma V^{\frac{1}{2}})^{\frac{1}{2}})\right)^{\frac{1}{2}}\) |
For example, the Jeffreys divergence is respectively carried out with
the jeffreyspar or jeffreys functions
depending on whether the calculations are respectively based on
parameters or samples.
> m1 <- c(1,1)
> v1 <- matrix(c(4,1,1,9),ncol = 2)
> m2 <- c(0,1)
> v2 <- matrix(c(1,0,0,1),ncol = 2)
> jeffreyspar(m1,v1,m2,v2)
[1] 5.314286
> library(MASS)
> set.seed(100)
> x1 <- mvrnorm(40, m1, v1)
> x2 <- mvrnorm(30, m2, v2)
> jeffreys(x1, x2)
[1] 6.780999
All these indices are based on a combination of the difference between the means \(\mu\) and \(m\) and the dissimilarity between the covariance matrices \(\Sigma\) and \(V\). In the case of equal means, the distance index between Gaussian densities reduces to a dissimilarity between covariance matrices. If the covariance matrices are equal, all these indices reduce to an index of the distance between these means, this index being dependent on the common variance matrix, with the exception of the Wasserstein index, which uses the identity matrix. These indices are also compared in the subsection Simulated Gaussian data of Appendix B.
Except for the \(L^2\) distances, the extension of the other distance indices of Table 5 to any estimated densities still requires a lot of work. Indeed, we experimented with calculating distance indices using numerical integration methods but computation times were so long in the multidimensional case that we did not implement them in the dad package. The solution we recommend is to estimate the densities by the Gaussian kernel method and use the \(L^2\) distances.
Density estimation with the Gaussian kernel method. The probability densities \(f_t\) are estimated by the Gaussian kernel method: \[\label{densiteng} \hat{f}_t(\mathbf{z})=\frac{1}{n_t|\mathbf{h}_t|^{1/2}}\frac{1}{(2\pi)^{p/2}}\sum_{i=1}^{n_t}\exp(-\frac{1}{2}(\mathbf{z}-\mathbf{x}_{ti})^\top \mathbf{h}_t^{-1}(\mathbf{z}-\mathbf{x}_{ti}))\text{,} (\#eq:densiteng)\] where \(\mathbf{h}_t\) is the non-singular bandwidth matrix and \(|\mathbf{h}_t|\) its determinant. This matrix may be provided by the user, or calculated directly according to the AMISE criterion, with reference to the normal distribution (Wand and Jones 1995), that is: \[\label{bandwidth} \mathbf{h}_t=h_t\widehat{\mathbf{V}}_t^{1/2}\text{,} (\#eq:bandwidth)\] with: \[\label{windowh_normal_reference} h_t=\Big(\frac{4}{n_t(p+2)}\Big)^{\frac{1}{p+4}}. (\#eq:windowh-normal-reference)\]
Calculation of \(L^2\)
distances between estimated non-Gaussian densities. The
calculation of the inner product is carried out with the
l2d function using a sample per density: x1
and x2; the result derives from the estimation
(@ref(eq:densiteng)) of the densities, the bilinearity of the inner
product, and a formula of integral calculus (Wand
and Jones 1995, 101). Then, the \(L^2\)-distance is directly deduced. This
calculation is carried out with the distl2d function.
> set.seed(40)
> x1 <- c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1, sd = 2))
> x2 <- c(rnorm(10, mean = 2, sd = 3), rnorm(5, mean = 0, sd = 2))
> distl2d(x1, x2, method = "kern")
```
``` r
[1] 0.2562896The normalized \(L^2\) distance between densities is also possible. Although it has the disadvantage of being time-consuming, it has some similarities to Hellinger’s distance, and in the Gaussian case, the two distances have almost the same expression (Table 5).
Since the function of dad package implementing
multidimensional scaling of probability density functions is a direct
application of the function cmdscale of R, it is briefly
recalled in the case of continuous densities. The mathematical aspects
of this method have been dealt with in several works (Delicado (2011) as MDS; Boumaza (1998), Kneip and
Utikal (2001), and Yousfi et al.
(2015) as functional PCA). In this section, we will privilege the
presentation in MDS form, which offers greater flexibility in the choice
of the distance between densities, while taking inspiration from the
method of interpretation of the results of PCA developed in Boumaza, Yousfi, and Demotes-Mainard (2015) in
order to interpret the scores resulting from MDS.
Given \(T\) densities and \((\delta_{ts})_{1\leq t,s\leq T}\) the
distances/divergences between each pair of them, the MDS technique looks
for a representation of the densities by \(T\) points in a low dimensional space such
that the distances between these points are as similar as possible to
the \((\delta_{ts})\) (Cox and Cox 2001). In R, this multidimensional
positioning technique is performed by the cmdscale
function, whose main argument is the symmetric matrix of distances and
the main output is the matrix of coordinates.
If the densities are assumed to be Gaussian, we can use the Hellinger distance, the Jeffreys divergence, the 2-Wasserstein distance, or the \(L^2\) distance (Table 5). If they are not expected to be Gaussian, they are estimated using the Gaussian kernel method, and the only available distance for the moment is the \(L^2\)-distance (Calculation of \(L^2\) distances between continuous non-Gaussian densities estimated by the kernel method Section).
The dad package includes functions for all the calculations required to implement such a method and to interpret its outputs:
fmdsd function which performs multidimensional
scaling and generates scores;plot function which generates graphics representing
the densities on the factorial axes;interpret function which returns other aids to
interpretation based on the moments of the variables.fmdsd functionMDS of densities can be carried using the fmdsd
function, which applies to an object of the class folder
(Table 1b). The future or to a data frame
and a grouping variable (Table 1a). It is
built on the cmdscale function of R. In addition to the
add argument of cmdscale, the
fmdsd function has three sets of optional arguments. The
first, consisting of gaussiand, windowh, and
distance, controls the method used to estimate the
densities and their distances (Distance/divergence between
densities Section). The second consists of the arguments
data.scaled, data.centered, controls some data
transformations, and the logical argument common.variance,
which, when set to TRUE, considers that all the occasions
have the same covariance matrix. These three arguments are discussed in
Appendix E. The third set consists of optional arguments which control
the function outputs.
fmdsd outputsThe fmdsd function returns an object of S3
class fmdsd, consisting of a list of 11 elements, including
the scores, also called principal coordinates, and the moments of the
variables per occasion. The outputs are displayed with the
print function, and graphical representations on the
principal planes are generated with the plot function.
The interpretation of outputs is based on the relationships between
the principal scores and the moments of the densities, in particular
their means, variances, covariances, and correlations. These
relationships are quantified by correlation coefficients and are
represented graphically by plotting the scores against the moments.
These interpretation tools are provided by the interpret
function, which has two optional arguments: nscores
indicating the indices of the column scores to be interpreted and
moment whose default value is "mean".
The following example is treated in detail in Boumaza, Yousfi, and Demotes-Mainard (2015),
using PCA of densities. The data consist of \(T = 10\) rose bushes assessed three times,
by a jury of 14 assessors, for \(p =
3\) attributes: top-sided shape (\(Sha\)), foliage thickness (\(Den\)), and plant symmetry (\(Sym\)). Here, we present the results
obtained with the MDS technique. This presentation is limited to the
major steps in the calculation and the visualization of the results
generated by the fmdsd, print,
plot, and interpret functions.
> data("roses")
> rosesf <- as.folder(roses[,c("Sha", "Den", "Sym", "rose")], groups = "rose")
> resultmds <- fmdsd(rosesf, gaussiand = FALSE, distance = "l2")
The function fmdsd displays the barplot of the inertia
explained by the first nine principal coordinates (Figure 1).
> names(resultmds)
[1] "call" "group" "variables" "d" "inertia" "scores"
[7] "means" "variances" "correlations" "skewness" "kurtosis"
By default, the print function applied to
resultmds only displays the names of the variables, the
inertia, and the principal coordinates.
> print(resultmds)
group variable: rose
variables: Sha Den Sym
---------------------------------------------------------------
inertia
eigenvalue inertia
1 0.02977 25.3
2 0.02261 19.2
3 0.02028 17.2
4 0.01439 12.2
5 0.00980 8.3
6 0.00930 7.9
7 0.00566 4.8
8 0.00344 2.9
9 0.00262 2.2
---------------------------------------------------------------
coordinates
rose PC.1 PC.2 PC.3
A A 0.055191062 0.022167510 0.02655143
B B -0.004963751 -0.023764758 0.07033084
C C 0.019611171 -0.122241048 -0.06566866
D D -0.091777346 0.041132410 -0.04995275
E E -0.013763431 0.019828288 -0.01341398
F F 0.016470141 -0.024307858 0.07144865
G G -0.088949736 0.005722199 0.01223686
H H -0.025102407 -0.006365474 0.01382440
I I 0.068203593 0.043999532 -0.02735366
J J 0.065080705 0.043829197 -0.03800313
> plot(resultmds)
The output is shown in Figure 2.
> interpret(resultmds)
Pearson correlations between scores and moments
PC.1 PC.2 PC.3
mean.Sha -0.65 0.29 0.69
mean.Den 0.83 -0.42 0.08
mean.Sym -0.01 0.94 -0.04
Spearman correlations between scores and moments
PC.1 PC.2 PC.3
mean.Sha -0.65 -0.26 0.65
mean.Den 0.78 -0.30 0.02
mean.Sym 0.16 0.92 -0.35
The returned plots of the interpret function are not
shown. From the correlations between the principal coordinates (PC) and
the means of the variables, we deduce that:
roses data frame): inertia explained by the first
ten principal coordinates.
roses
data frame): the first three principal coordinates.
Thus, from this interpretation of the PCs, we can describe the classes of rose bushes that can be constituted in view of their proximities vs. distances visualized in Figure 2. For example, the roses of the class \(\left\{A, I, J \right\}\) have thick foliage compared to those of the class \(\left\{D, G \right\}\), the rose bush \(C\) is very asymmetrical compared to the other rose bushes, the rose bushes of the class \(\left\{B, F \right\}\) have a top sided shape.
In order to obtain the correlations between the scores and the
standard deviations, we set the optional argument moment to
"sd" as in the following example. The other possible values
of this argument include "var" (variances),
"skewness", "cor" (correlations for
multivariate densities).
> interpret(resultmds, moment = "sd")
Pearson correlations between scores and moments
PC.1 PC.2 PC.3
sd.Sha 0.67 0.21 -0.49
sd.Den -0.01 0.77 0.11
sd.Sym -0.13 -0.44 0.76
Spearman correlations between scores and moments
PC.1 PC.2 PC.3
sd.Sha 0.50 0.45 -0.58
sd.Den 0.15 0.64 -0.08
sd.Sym -0.26 -0.76 0.75
Some of the correlations between the PCs and the standard deviations of the variables seem high. Reminding that the PCs are related to means, these correlations are therefore clues of links between standard deviations and means of the variables. We, therefore, represent roses using their means and standard deviations (Figure 3). We highlight that the standard deviations/variances used to assess discordance between assessors tend to be smaller when the products subjected to evaluation on a nine-level scale were awarded marks at the ends of the scale. This result which is actually quite intuitive, obtained by the use of MDS on probability density functions, would have been difficult to demonstrate if the means and standard deviations/variances had been analyzed separately.
roses
data frame): relationships between means and standard deviations of the
variables.
As for MDS, the fhclustd function of the dad
package implementing hierarchical cluster analysis of probability
density functions is a direct application of hclust, the
corresponding function of R. So it is briefly recalled, and we put the
emphasis on the interest of coupling the implementation of MDS with HCA
in order to interpret more easily the clusters resulting from HCA.
HCA deals with the same kind of data as the MDS technique, namely: \(T\) probability density functions and \((\delta_{ts})_{1\leq t,s\leq T}\) the distances/divergences between each pair of them. Its purpose is to build a series of nested partitions that can be visualized by means of a dendrogram (Figure 4). An agglomerative building algorithm starts with a partition consisting of \(T\) clusters (one density per cluster) then “it repeats merging the closest pair of clusters according to some similarity criteria until all the data (densities in our context) are in one cluster” (Gan, Ma, and Wu 2007). The criteria are built on dissimilarities between sets of densities, which themselves derive from \((\delta_{ts})_{1\leq t,s\leq T}\) obtained as follows:
> resulthca <- fhclustd(rosesf, gaussiand = FALSE, distance = "l2")
> round(resulthca$distances, digits = 2)
A B C D E F G H I
B 0.14
C 0.19 0.18
D 0.19 0.18 0.21
E 0.14 0.14 0.18 0.14
F 0.14 0.09 0.18 0.18 0.15
G 0.19 0.15 0.20 0.16 0.17 0.17
H 0.15 0.12 0.17 0.16 0.12 0.14 0.14
I 0.12 0.15 0.18 0.18 0.14 0.15 0.18 0.15
J 0.15 0.16 0.19 0.19 0.15 0.16 0.19 0.15 0.08
The dendrogram (Figure 4 ) is obtained by the following instruction:
plot(resulthca, xlab = "Roses", sub = " ", hang = -1)
roses data frame). The y-axis indicates the distance at
which the clusters are merged: the lower this distance is, the more the
rosebushes of the clusters are similar.By cutting the dendrogram (Figure 4 ), a partition of the set of the \(T\) roses is deduced. For example, in four clusters: \(\{D,\ G\}\), \(\{C\}\), \(\{I,\ J\}\), and \(\{E,\ H,\ A,\ B,\ F\}\). This partition is quite similar to the one we could visually make on the basis of Figure 2. This is easy to understand since Figure 4 is only an approximate visualization of the dissimilarities \((\delta_{ts})_{1\leq t,s\leq T}\) that served both in MDS and in HCA.
In the Interpretation of fmdsd outputs section,
we described the method used to give meaning to the principal
coordinates based on the moments of the variables, from which we deduced
the meaning of the clusters of roses constituted using the scores of
MDS.
This illustrates the process we propose to follow in practice:
With the notations introduced at the beginning of the section Multi-group data: examples and organization, the aim of discriminant analysis of densities (Boumaza 2004) is to predict the value of \(G\) (Table 2) for the occasion \(T+1\) represented by the density \(f_{T+1}\), knowing \(n_{T+1}\) observations of the random vector \(X_{T+1}\), which are stored in \(\mathbf{X}_{T+1}\) (Table 1).
For each \(k=1,\ldots,K\), we denote by \(g_k\), the density representing the class \(k\) of \(G\). The predicted value is: \[\label{ruledistance} \widehat{k} = \arg\ \min_{1\leq k\leq K} D(f_{T+1}, g_k)\text{,} (\#eq:ruledistance)\] where \(D\) is a distance index between densities (Distance/divergence between densities section). The data corresponding to the density \(g_k\) are those corresponding to the \(T_k\) densities \(f_t\) belonging to the class \(k\) of \(G\). In Appendix F, we specify the possible procedures for calculating the densities \(g_k\) and highlight the link between DA of densities and the linear discriminant analysis (MASS) in the homoscedastic Gaussian case.
The dad package performs all the calculations required to
obtain the predicted value through the fdiscd.predict
function whose first two arguments, x and
class.var, control the input data. It includes the
arguments distance and crit, which
respectively set the distance and the densities \(g_k\) (Appendix F). It also includes the
arguments gaussiand and windowh, which control
the method of density estimation. The fdiscd.predict
function returns an object of S3 class
fdiscd.predict, which is a list consisting of prior and
predicted values corresponding to each \(f_t\), a confusion matrix, the distances
\((d_{tk})\) between the \(f_t\)’s and the \(g_k\)’s, and proximities. These are
calculated from the inverse of the distances in such a way that their
sum is 1, but they are not probabilities and are useful for a quick
comparison of the distances.
In addition, the package calculates the misclassification ratio for
the occasions for which the prior class of \(G\) is known through the
fdiscd.misclass function. This ratio is computed by using
the leave-one-out method on the \(T\)
occasions, and the lower it is, the better the prediction of the
variable \(G\) by the data \(\mathbf{X}\). This function is based on
arguments almost identical to those used by the
fdiscd.predict function. It also generates similar outputs,
grouped into an object of the S3 class
fdiscd.misclass. However, these two functions differ in the
method used to calculate the distances \((d_{tk})\): if \(f_t\) belongs to the \(k\)-th class of \(G\), the data corresponding to \(f_t\) are not included in the data used to
estimate \(g_k\). This function is
useful for empirical investigations in order to identify the optimal
values of the arguments minimizing the misclassification ratio. These
values are then used by the fdiscd.predict function for
prediction of the \(G\) class of an
occasion of unknown class.
Let us consider the archaeological data (Section Multi-group data: examples and organization, Example 1). For the sake of clarity, we combine the six classes to yield three final classes, 1140-1175, 1175-1280, and 1280-1550, numbered 1 to 3.
> data("castles.dated")
> levels(castles.dated$periods$period) <- c("1", "2", "2", "2", "3", "3")
> castlesfh <- folderh(castles.dated$periods, "castle", castles.dated$stones)
The fdiscd.misclass function is used to calculate the
misclassification ratios (global and per class).
> fdiscd.misclass(castlesfh, class.var = "period", distance = "l2")
misallocation ratio: 0.3382353
predicted.class
prior.class 1 2 3 total misalloc
1 10 3 0 13 0.231
2 5 24 8 37 0.351
3 0 7 11 18 0.389
total 15 34 19 68
---------------------------------------------------------------
castle prior.class predicted.class misclassed
1 1 1 FALSE
2 1 1 FALSE
3 1 2 TRUE
...
131 2 2 FALSE
133 3 2 TRUE
135 2 3 TRUE
136 2 1 TRUE
In order to empirically calibrate the different arguments of the
function, the previous operation is repeated, choosing other values for
the arguments. If the gaussiand argument is set to
TRUE, the smallest global misclassification ratio is 34%
(Table 6) and is obtained for
crit = 1, the default value for this argument, and for
distance = "l2" or "hellinger".
Estimating the densities by the Gaussian kernel method by setting
gaussiand = FALSE, with the bandwidth defined by
(@ref(eq:bandwidth)) and (@ref(eq:windowh-normal-reference)), increased
the global misclassification ratio to 40%. If we consider the same
proportionality coefficient \(h\) in
the formula (@ref(eq:bandwidth)), that is \[\label{hwindowh}
\mathbf{h}_t=h \,
\widehat{\mathbf{V}}_t^{1/2}, (\#eq:hwindowh)\] by setting the
argument windowh of the function
fdiscd.misclass, we empirically obtain a value of \(h\) that minimizes the misclassification
ratio. For this purpose, we calculate the misclassification ratio for
different values of windowh (Table 7). Note that an optimal empirical value is about
windowh = 0.6, with a misclassification ratio of 32%. One
of the best empirical parametrizations of the
fdiscd.predict function would be
gaussiand = FALSE, windowh = 0.6, and
crit = 1.
crit |
1 | 2 | 3 |
| Ratio | 0.34 | 0.56 | 0.40 |
windowh |
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
| Ratio | 0.56 | 0.44 | 0.38 | 0.38 | 0.37 | 0.32 | 0.32 | 0.35 | 0.34 |
The misclassification ratios are disappointing. Thus, the method was used only as an indication to date the castles and its results were cross-checked with other established facts from other historical sources (Rudrauf and Boumaza 2001).
Let us consider the agronomic data (Section Multi-group data:
examples and organization, Example 2). We consider only the
continuous variables characterizing the leaves, that is, leaving out the
discrete variable “number of leaflets”. As in the first example, we
carry out discriminant analysis by applying the
fdiscd.misclass function for different parameter values.
Table 8 gives the classification
error rates for the three criteria and some values of the
proportionality parameter. One of the best combinations of parameter
values would, therefore, be crit = 3 and
windowh = 0.2. The error rate is 0.032, which is clearly
better than the result obtained with the archaeological data of the
first example. If the same method is applied to all variables, including
“number of leaflets”, the error percentage is zero for
crit = 3 and windowh = 0.3.
In these two previous examples, it should be noted that the search
for an optimum value of the parameter windowh has the
disadvantage of being carried out by trial and error and is a
time-consuming procedure.
windowh |
crit = 1 |
crit = 2 |
crit = 3 |
| 0.1 | 0.742 | 0.806 | 0.129 |
| 0.2 | 0.452 | 0.677 | 0.032 |
| 0.3 | 0.226 | 0.452 | 0.097 |
| 0.4 | 0.129 | 0.226 | 0.097 |
| 0.5 | 0.097 | 0.129 | 0.161 |
| 0.6 | 0.097 | 0.129 | 0.161 |
| 0.7 | 0.065 | 0.129 | 0.194 |
| 0.8 | 0.065 | 0.097 | 0.226 |
| 0.9 | 0.065 | 0.097 | 0.226 |
The functions folder and as.folder play a
central role in the creation of objects handled by the functions
fmdsd (MDS on continuous data), mdsdd (MDS on
discrete data), fhclustd (HCA on continuous data), and
hclustdd (HCA on discrete data). The functions
folderh and as.folderh play the same role in
DA context: fdiscd.misclass and fdiscd.predict
in the continuous case, or discdd.misclass and
discdd.predict in the discrete case.
It is advisable for the user to know how to use them, especially as their handling is quick and easy (Appendix C).
The computation times of the functions of dad depend mainly
on the computation time of the distances between groups in MDS and HCA
contexts or between groups and classes in the DA context. The longest
times are achieved when the densities are estimated by the kernel method
(Calculation of \(L^2\) distances
between continuous non-Gaussian densities estimated by the kernel
method section). When applying the function matdisl2
with the option method = "kern" on the archeological data
(Example 1 of Appendix G), the computation time is approximately five
times greater than when method = "gaussiand".
Among the main functions of dad, the most time-consuming one
is undoubtedly fdiscd.misclass when its
gaussiand argument is set to FALSE. That is,
the densities are non parametrically estimated. For example, applying
this function to archaeological data with gaussiand = TRUE
takes less than one second while it takes approximately 30-40 seconds
with gaussiand = FALSE (Example 2 of Appendix G). Thus, we
recommend doing tests on small datasets when working on non-Gaussian
continuous data.
The choice of a distance index depends above all on the modeling hypotheses: discrete or continuous data. If they are discrete dad proposes five indices (Table 4). If they are continuous, it proposes five indices in the Gaussian case (Table 5) and only one for non-Gaussian data, which is the \(L^2\) distance combined with the estimation of the densities by the Gaussian kernel method.
The distance indices of discrete or Gaussian densities are compared on particular examples in the context of HCA (Appendix A) or DA (Appendix B). It is shown that depending on the technique used and the criterion for measuring the quality of the results, a distance index can be better than another for a set of data and worse for another set. Thus, in the discrete or Gaussian cases, if there is no a priori choice of distance index, we suggest an empirical approach. As the computation times are reasonable, we perform the desired analysis by experimenting with each of the distance indices, then choose the distance index according to one of the two criteria given by dad.
For DA, it is the misclassification ratio obtained by the
one-leave-out procedure by using the function
fdiscd.misclass. We should choose the distance index which
gives the lowest misclassification ratio.
For MDS, it is the interpretation of principal coordinates in terms
of marginal distributions (discrete densities) or moments (continuous
densities) using the function interpret. We should choose
the distance index which gives the highest correlations between
principal coordinates and marginal distributions or moments. For HCA, we
combine it with the MDS method (HCA and MDS on the same
densities subsection) and, therefore, we refer to the above to
choose the distance index.
The classic statistical methods as multidimensional scaling (MDS), hierarchical clustering analysis (HCA), and discriminant analysis (DA) operate on data which are rows of a data frame. They are available in the R packages stats, MASS (Venables and Ripley 2002), ade4 (Dray, Dufour, and Chessel 2007), FactoMineR (Lê, Josse, and Husson 2008), cluster (Maechler et al. 2019). The dad package presented in the manuscript generalizes them to data that are organized into groups or occasions. The graphics produced by MDS, the clusters constituted by HCA, or the predicted class provided by DA, concern the occasions and not the individuals who constitute them.
These methods are multivariate data analysis, but they can also be seen as functional data analysis since they operate on multivariate functions estimated from multivariate data. They are theoretically similar to some functional statistical methods implemented in fda.usc, fda, and fdadensity packages. The main difference is in the type of data processed. The functional statistical methods deal with functions of only one or two variables, while the main functions of dad can deal with more.
These methods can also be seen as compositional data analysis.
Indeed, compositional dataset provides portions of the total (Aitchison 1986) and the R packages
compositions, Compositional, and
robCompositions are devoted to such data by implementing many
descriptive or graphical techniques and models. Regarding the packages
Compositional and robCompositions, their modeling
approach is far enough from our approach to be addressed in this
discussion, while the package compositions presents
similarities with our work which we detail in Appendix H. In the case of
a univariate discrete density, the differences concern the method of
interpreting the graphical outputs provided by princomp of
compositions and mdsdd of dad even if we
get fairly similar graphics (see the example of Appendix H). In the
multivariate case, dad takes into account marginal
distributions of orders 1 and 2 (or more), while in the current version
of compositions, only marginal distributions of order 1 are
taken into account.
The most recent version of the dad package implements
functions which operate on discrete data. It also extends the three
previous methods to a mixture of numerical and categorical data by
transforming the numerical data into categorical data. It is done by
dividing the range of each numerical variable into intervals using the
function cut.folder, an extension of the function
cut to several variables of a folder.
In addition, the last versions of functions implementing MDS and HCA apply to data stored in a data frame and not only to data stored in a folder.
The following development work is also planned:
When we work on multidimensional multi-group data and are interested in the groups and not in the individuals who make up these groups, we want to have statistical methods and computer tools making it possible to describe these groups. The dad package is devoted to this.
It mainly provides elaborate functions which implement multivariate techniques such as multidimensional scaling, hierarchical classification analysis, or discriminant analysis on such groups. Moreover, in order to help users in reading the outputs of these techniques, dad provides functions for interpreting the results.
It provides datasets which illustrate such data and easy-to-use functions which allow to (i) manage multi-group data by associating a data frame with each group, (ii) compute the distances between groups based on the mathematical concept of probability distribution/density, and (iii) secondarily compute elementary statistics by group as for example frequency distributions and moments.
It defines new data structures called folders (folder,
folderh, foldermtg) and provides specific
tools to manage them, such as selecting or deleting columns from a
folder, converting numeric columns of a folder
to factors. The most noticeable among them allows to easily import into
R plant architectures encoded in mtg files and, thus, have
R packages available to analyze the imported data.
The authors thank Gilles Hunault and Julie Bourbeillon, the Associate Editor, and the reviewers for their valuable comments and contributions to this work.
We compare three distance indices in the HCA context with a small, simple example originating from an exchange with an anonymous reviewer of a previous version of the manuscript by comparing the \(L1\) and \(L2\) distances and the symmetrized divergence of Kullback-Leibler. In his review, he made a severe criticism of the \(L2\) distance, which was the unique distance proposed in the first versions of the dad package: “Let \(f\) a uniform density in the interval \([0,\ 1]\), and \(g\) also uniform in \([0,\ 0.90]\). Clearly, from the point of view of \(g\), it is impossible to reach values in the interval \((0.90,\ 1]\) and so, these two densities are not neighbors.” Indeed, the \(L1\) and \(L2\) distances and the symmetrized Kullback-Leibler divergence (KL) are: \[L1(f,g) = 0.20,\ \ L2(f,g) = 0.33,\ \ KL(f,g) = \infty,\] so \(KL (f, g)\) reflects this impossibility of reaching all the values of \(f\) from \(g\), unlike \(L1\) and \(L2\), which consider them to be relatively close to each other.
However, if we add the other two densities \(v\) and \(w\), uniform on \([0,\ 0.10]\) and \([0, 0.11]\), and if our objective is to obtain either a partition of \(f\), \(g\), \(v\), and \(w\) in 2 classes, or an approximate representation on a plane of these 4 densities, the choice of \(L1\) or \(L2\) is more informative than \(KL\) (Table 9) because with the first two distances, we would easily group \(f\) and \(g\) on one side and \(v\) and \(w\) on the other side, whereas with KL all distances are infinite.
| L1 | L2 | KL | ||||||||||||
| \(f\) | \(g\) | \(v\) | \(w\) | \(f\) | \(g\) | \(v\) | \(w\) | \(f\) | \(g\) | \(v\) | \(w\) | |||
| \(f\) | 0 | 0.20 | 1.80 | 1.78 | \(f\) | 0 | 0.33 | 3.00 | 2.84 | \(f\) | 0 | \(\infty\) | \(\infty\) | \(\infty\) |
| \(g\) | 0 | 1.78 | 1.76 | \(g\) | 0 | 2.98 | 2.82 | \(g\) | 0 | \(\infty\) | \(\infty\) | |||
| \(v\) | 0 | 0.18 | \(v\) | 0 | 0.95 | \(v\) | 0 | \(\infty\) | ||||||
| \(w\) | 0 | \(w\) | 0 | \(w\) | 0 |
Our belief is that the choice of a distance between groups cannot be made in the absolute. It is better to specify a criterion for choosing the distance, which makes sense in the context of the method used, such as percentage of inertia explained in MDS ((Delicado 2011)), or in the context of the data analysed. In the Practical advice section, we suggest two other criteria.
In the two examples of DA of densities section, the data were not considered plausibly Gaussian and the distance used is the \(L^2\) distance, the only distance computable in dad which is suitable for this type of data. In this appendix, we are interested in other types of data and in the misclassification ratios according to the distance index used (Distance/divergence between densities section).
Let \(p_1={\cal P}(\lambda_1)\) and \(p_2 = {\cal P}(\lambda_2)\) be two Poisson distributions with respective parameter \(\lambda_1 = 1\) and \(\lambda_2 = 2\) and \(S_1\) and \(S_2\) respective simulated samples of size 30. We simulate a sample \(S\) of size 10 according to the distribution \(p_1\), we calculate the dissimilarities \(d_1\) and \(d_2\) between this sample and each of the samples \(S_1\) and \(S_2\), then we compare \(d_1\) and \(d_2\). We would expect \(d_1\) to be less than \(d_2\) since the samples \(S\) and \(S_1\) are from the same population. Thus, if \(d_1> d_2\) we consider that the sample \(S\) is misclassified. We repeat the previous scenario 1000 times and then calculate the misclassification ratio.
The results obtained for the same 1000 samples and each of the distance indices of Table 4 are in column (a) of Table 10. Then we proceed in the same way with 1000 samples of the distribution \(p_2\); the misclassification ratios are in column (b).
| Distance index | \(a\) | \(b\) |
| Misclassification ratio | Misclassification ratio | |
| of 1000 samples from \(p_1\) | of 1000 samples from \(p_2\) | |
| Symmetric chi-square | 0.073 | 0.183 |
| Hellinger | 0.064 | 0.188 |
| Jeffreys | 0.049 \((600^{*})\) | 0.139 \((643^{*})\) |
| Jensen-Shannon | 0.139 | 0.269 |
| Lp | 0.079 | 0.172 |
| (*) number of samples \(S\) at infinite distance from samples \(S_1\) or \(S_2\) |
We notice that for the Jeffreys divergence, the number of samples
\(S\) at infinite distances from \(S_1\) or \(S_2\) is very large, which makes it
inefficient at discriminating between samples.
We carried out several simulations as in the previous procedure to
compare the distances. We did not find any stability in the order of
distances according to the misclassification ratios.
As for the discrete case, these distance indices are compared in a simplified context of discriminant analysis using the same procedure. Let \(p_1\) and \(p_2\) be the two Gaussian distributions \(N(0,1)\) and \(N(1,4)\).
| Distance index | \(a\) | \(b\) |
| Misclassification ratio | Misclassification ratio | |
| for 1000 samples from \(p_1\) | for 1000 samples from \(p_2\) | |
| Hellinger | 0.020 | 0.051 |
| Jeffreys | 0.018 | 0.043 |
| L2 | 0.034 | 0.040 |
| L2N | 0.026 | 0.111 |
| 2-Wasserstein | 0.007 | 0.169 |
The results are given in Table 11. We notice that for the samples from \(N (0,1)\) (a), the distance giving the best rate is the Wasserstein distance. On the other hand, it is the worst for the samples from \(N (1,4)\) (b).
We carried out several simulations and the conclusion is word for word the same as that of the discrete case with the Poisson distributions.
These results lead us to suggest choosing a distance index based on the minimum misclassification ratio in the subsection Choice of distance of Practical advice Section.
The following R instructions are used to compute the misclassification ratios in DA context with samples from Poisson distributions according to the symmetric chi-square distance index of Table 10. The instructions for computing the other ratios of Table 10 and the ratios of Table 11 are given in supplementary material.
> n <- 30
> ne <- 10
> nrep <- 1000
> l1 <- 1
> l2 <- 2
> #
> nmisclass1 <- 0
> set.seed(135)
> e1 <- rpois(n, lambda = l1)
> e2 <- rpois(n, lambda = l2)
> for(index in (1:nrep))
+ { x <- rpois(ne, lambda = l1)
+ d1 <- ddchisqsym(x,e1)
+ d2 <- ddchisqsym(x,e2)
+ if(d1 > d2) nmisclass1 <- nmisclass1 + 1
+ }
> misclassratio1 <- nmisclass1 / nrep
> print(misclassratio1)
[1] 0.073
> nmisclass2 <- 0
> set.seed(135)
> e1 <- rpois(n, lambda = l1)
> e2 <- rpois(n, lambda = l2)
> for(index in (1:nrep))
+ { x <- rpois(ne, lambda = l2)
+ d1 <- ddchisqsym(x,e1)
+ d2 <- ddchisqsym(x,e2)
+ if (d1 < d2) nmisclass2 <- nmisclass2 + 1
+ }
> misclassratio2 <- nmisclass2 / nrep
> print(misclassratio2)
[1] 0.183
The dad package uses objects of class folder or
folderh. These objects are lists of data frames having
particular formats.
Let us consider the archaeological data introduced in Multi-group data: examples and organization section. The data that Jean Michel Rudrauf (Rudrauf and Boumaza 2001) submitted to us was in the form of a paper binder, each sheet of which corresponds to a castle (Figure 5).
H: height,
L: width, l: edging and b: boss.
There are 10 stones whose measurements are complete: 7 stones belong to
the staircase tower (Tour d’escalier) and 3 stones belong to a
wall located near the entrance of the castle (Pierres dans mur près
pont d’entrée). The corner stones designated by the symbol (a) in
column H, are excluded from the study.So for each castle we have:
A most natural and least redundant way for entering such data is to create:
p
columns is named by the identification number of the castle and contains
the measurements of the stones. Its number of rows corresponds to the
number of stones whose all measurements are available.From there, we choose to suggest suitable data structures for multi-group data and propose management and calculation tools adapted to these data structures. This is exemplified below.
The archaeological data are stored in the list
castles.dated of two data frames. The data frame
castles.dated$periods consists of T=68 rows
(castles) and 2 columns: castle, the castle identifier, and
period, the building period which is a factor with 6
levels. The second data frame castles.dated$stones consists
of 1262 rows (stones) and 5 columns: 4 numeric characteristics of the
stones (height, width, edging,
and boss) and 1 factor castle with 68 levels,
which gives the identifier of the castel to which belongs each stone.
The implementation of DA requires carrying out calculations not only on
all the stones of each castle but also on all the stones of all the
castles of each period. To do this, we can store the stones in a data
frame made up of 6 columns obtained by adding a column
period to the file castles.dated$stones. Thus,
all the stones of a castle have the same period value. In
order to avoid this redundancy in data entry and management, we propose
to store the two files castles.dated$stones and
castles.dated$periods as well as the key
castle which relates them, in a folderh that
is a list of two data frames related by a key. The procedure for doing
this is given in the paragraph First example: Castles/Stones of
DA of densities section.
Now consider only the castles.dated$stones file. To
implement both DA and MDS, it is necessary first to calculate and store
the vectors of means and the covariance matrices of the p=4
numeric variables for each castle, then to use the results in the
calculation of the distances between each pair of castles. The
calculations of means and covariances by castle can be carried out
directly from the five-column data frame
castles.dated$stones using, for example, the following R
functions: colMeans, var, and by.
In order to facilitate the extraction of data relating to a particular
castle and control the data entry, or to check pieces of the computer
program, we organize the data in a folder that is a list of
T four-column data frames (one data frame per castle), and
we extend the mean and var functions so that
they apply to such a data structure and return results as lists. So if
x is the name of the folder, x[[t]] is the
data frame which contains the data of the castle t,
mean(x)[[t]] is its vector of means and
var(x)[[t]] is its covariance matrix. We find that by doing
so, it is also easier to remember the names and contents of these
objects when writing computer programs.
folderSuch objects, lists of data frames which have the same column names,
are created by the folder function, from two (or more) data
frames, say x1 and x2, as follows:
folder(x1, x2).
Optional argument cols.select defines the way in which
the columns of the two data frames are selected: common columns only
(cols.select = "intersect") or all the columns
(cols.select = "union"). For the "union"
option, if the data frames do not have exactly the same column names
they are complemented by NA’s.
The functions mean.folder (or simply mean),
skewness.folder, and kurtosis.folder applied
to the object x of the class folder,
respectively return the list of the vectors of means, skewnesses, and
kurtosises of the numeric columns of the elements of x. The
functions var.folder and cor.folder return the
list of the covariance and correlation matrices. If the data frames of
x contain non-numeric columns, these functions exclude
these columns from the computation. If they contain only one numeric
column, these functions return lists of numbers. Note that
map(x, colMeans), map(x, cor), and
map(x, var) return the same values as before, except in the
presence of a factor column whose levels are numbers:
map(x, colMeans) and map(x, cor) return an
error while map(x, var) integrates the factor column in the
covariance matrices.
Hence, in folder objects, the observed variables are the
same on every occasion, unlike individuals which can be different from
one occasion to the next. The particular case where individuals are the
same on every occasion, corresponds to data defined as three-way data by
Kiers (2000) in his essay on
standardization of terminology for multiway analysis. In this particular
case, it would be better to store the data in an object of class
array.
folderhSuch objects are hierarchical lists of data frames in which two
successive data frames from the list are related by means of a key. We
complete the presentation of the introductory example by a case with
more than two data frames. For three data frames, say df1,
df2, and df3, there are two keys: the first,
say key1, describes the “1 to N” relationship between
df1 and df2, and the second, say
key2, describes the “1 to N” relationship between
df2 and df3. The arguments of the
folderh function are introduced in the following order:
df1, key1, df2,
key2, df3, and so on, if there are more than
three data frames. An example of such object is given in Appendix D.
The function as.data.frame applied to such a
hierarchical folder, say fh, whose constituent elements are
listed above, has two main arguments: key (the name of a
key of fh) and elt (the name of a data frame
of fh) with the precision that the value of
elt is located after the value of key in the
list of arguments defining fh. In the case of two adjacent
names, that is key = key1 and elt
= df2 or key = key2 and
elt = df3, as.data.frame returns
a data frame similar to any viewpoint to that returned by the
merge function. If key = key1 and
elt = df3, the data frame returned by the
as.data.frame function, say dfr, has the same
rows as df3. The columns of dfr are those of
the data frames df3 and df1, and those
corresponding to all the keys located between key1 and
df3 in the list defining fh, noticing that the
key columns are the first columns of dfr.
mtg
filesThe result of this import procedure consists of an object of class
folderh and uses an intermediate object of class
foldermtg. Let us first specify what an mtg
file is.
The topological structure of a plant is defined from its
decomposition into elementary components and the connections between
them (Godin and Caraglio 1998). In
Figure 6, the plant is composed of 2 axes
(one principal axis, A1, coming from the root and bearing
one secondary axis, A2) and each axis is composed of
internodes and peduncles: the principal axis is composed of seven
internodes I1, ..., I7 and one peduncle
F1 and the secondary axis is composed of three internodes
I8, I9 and I10. Among the
computer file types used to store the topology of a plant, we are
interested in mtg (multiscale tree graph) files which can
be opened with a spreadsheet as LibreOffice-Calc, or Excel (Pradal and Cokelaer 2010). Breaking a plant
into axes and breaking each axis into internodes create two “1 to N”
relationships which can be stored in an object of class
folderh. This hierarchical folder, say fh, is
a list of three data frames P (set of 1 plant),
A (set of 2 axes), and I (set of 10
internodes), and two keys P and A. It is the
result of the following three R instructions which successively imports
an mtg file into R, creates an object of the S3 class
foldermtg, and then creates fh.
> mtgfile <- system.file("extdata/plant2.mtg", package = "dad")
> x2 <- read.mtg(mtgfile)
> fh <- as.folderh(x2, classes = c("P", "A", "I"))
> print(fh)
$P
P Variety
v01 v01 Starina
$A
P A Length
v02 v01 v02 30
v07 v01 v07 12
$I
A I Leaflet
v03 v02 v03 3
v04 v02 v04 3
v05 v02 v05 5
v06 v02 v06 5
v08 v07 v08 5
v09 v07 v09 5
v10 v07 v10 7
v11 v02 v11 7
v12 v02 v12 5
v13 v02 v13 3
attr(,"class")
[1] "folderh"
attr(,"keys")
"P" "A"
P) into axes (A), internodes
(I), and peduncles (F).| Biological component | File code | |
/P1 |
v01 | |
\(\wedge\)/A1 |
v02 | |
\(\wedge\)/I1 |
v03 | |
\(\wedge\)<I2 |
v04 | |
\(\wedge\)<I3 |
v05 | |
\(\wedge\)<I4 |
v06 | |
+A2 |
v07 | |
\(\wedge\)/I8 |
v08 | |
\(\wedge\)<I9 |
v09 | |
\(\wedge\)<I10 |
v10 | |
\(\wedge\)<I5 |
v11 | |
\(\wedge\)<I6 |
v12 | |
\(\wedge\)<I7 |
v13 | |
\(\wedge\)<F1 |
v14 |
An object of class foldermtg is a list of data frames.
It is only an intermediary to which the only function of R which can be
applied to it is as.folderh. This, therefore, makes it
possible to retrieve a hierarchical folder which consists of data
frames, one data frame per type of biological components, on which one
can operate statistical calculations by means of R functions.
fmdsd functionWe assume that the densities associated with the occasions are
Gaussian. For particular values of the arguments
data.centered, data.scaled, and
common.variance, the outputs of the fmdsd function are
similar to those returned by other functions.
common.variance = TRUEThe Gaussian densities are assumed with the same covariance matrix
which is estimated using all data. Thus, the distances between the
densities are reduced to the differences between their mean vectors
(Table 5). The Euclidian distances between
mean vectors computed by the R function dist are equal to
that computed by the 2-Wasserstein distance between densities. In this
case, the R function cmdscale applied on the mean vectors
returns exactly the outputs of the fmdsd function.
data.centered = TRUEThe Gaussian densities are assumed with zero mean vectors.
Consequently, the distances between the densities are reduced to the
differences between their covariance matrices (Table 5). With the 2-Wasserstein distance, the
distances between the densities are reduced to the Hilbert-Schmidt
distances between the square root of the covariance matrices while the
dual STATIS method (Lavit et al. 1994)
uses Hilbert-Schmidt distances between the covariance matrices. In R,
the calculations can partly be performed with the DSTATIS
function of the multigroup package (A.
Eslami et al. 2013; Aida Eslami et al. 2020) or the
statis function of the ade4 package (Dray, Dufour, and Chessel 2007) after some
transformation. However, the outputs of these two functions are
completely different from those of the fmdsd function
(MDS of densities Section): fmdsd focuses on the
visualization of the occasions, while the other functions focus more on
the visualization of variables or individuals.
data.scaled = TRUEWith the two previous special cases, we have shown that MDS on
densities is a way to take into account globally the means, variances,
and covariances of the occasions and, therefore, a form of
generalization of separate analyses either on averages or on variances
and covariances. The optional argument data.scaled is
useful when the analyst is interested to focus only on the relationships
between variables.
In the functions fdiscd.misclass and
fdiscd.predict, for each \(k=1,\ldots,K\), the density \(g_k\) representing the class \(k\) of \(G\) is estimated using a procedure selected
by means of the argument crit. Recalling that the class
\(k\) contains \(T_k\) densities \(f_t\), the three procedures are defined as
follows.
All samples related to the \(T_k\) occasions of class \(k\) are pooled and constitute a single sample, which is then used to estimate \(g_k\).
If \(\hat{f}_t\) estimates \(f_t\), then \(g_k\) is estimated by the mean value of the \(T_k\) densities pertaining to the class \(k\): \((1/T_k) \sum \hat{f}_t\).
The mean value of the previous \(T_k\) densities is calculated by weighting each \(\hat{f}_t\) by the size of its corresponding sample: \((1/\sum n_t)\ \sum n_t\hat{f}_t\).
The last two procedures are only available if the argument
distance is set to "l2". If there is only one
occasion per class that is \(T_k=1,\ \forall
k\), the three procedures are the same. In this case, the data
are those of the training step of classic DA. However, in the prediction
step, we have to assign a group of individuals not individually but
taken as a whole.
We assume that:
We have to predict the class value of \(f_{T+1}\) using the rule
@ref(eq:ruledistance) and one of the distances from Table 5. For distance = "wasserstein",
we calculate the distances between the mean vectors \(\|m_{T+1}- m_k\|_{I_p} \ (k=1,\ldots,T)\),
and for the other distances we calculate the distances \(\|m_{T+1}- m_k\|_{V^{-1}}\
(k=1,\ldots,T)\).
In the homoscedastic Gaussian case, this makes DA of densities appear as
a form of extension of linear discriminant analysis where the group of
individuals to be predicted is summarized by the mean vector of the
group.
All the following calculations are carried out using a laptop
computer equipped with an i5 processor on the archaeological data. The
stone characteristics are stored in the data frame x.df.
Then, we create the corresponding folder x.folder.
> data(castles.dated)
> x.df <- castles.dated$stones
> x.folder <- as.folder(x.df, groups = "castle")
The computation times of the inter-group distances are of the same magnitude when the densities are supposed Gaussian and parametrically estimated. Computations with the \(L^2\) distance take about one second.
> system.time(matdistl2d(x.folder, method = "gaussiand"))
user system elapsed
1.13 0.06 1.36
When the densities are estimated by the kernel method, the computation time is multiplied by about 5.
> system.time(matdistl2d(x.folder, method = "kern"))
user system elapsed
6.38 0.14 6.53
fdiscd.misclassWe build the hierarchical folder x.fh corresponding to
the archeological data. We first apply the function
fdiscd.misclass with the option
gaussiand = TRUE then with the option
gaussiand = FALSE.
> x.fh <- folderh(castles.dated$periods, "castle", castles.dated$stones)
> system.time(fdiscd.misclass(x.fh, class.var = "period", distance = "l2", gaussiand = TRUE))
user system elapsed
0.16 0.02 0.17
> system.time(fdiscd.misclass(xfh, class.var = "period", distance = "l2", gaussiand = FALSE))
user system elapsed
38.13 0.17 38.64
princomp.acomp applied on sa.lognormals data.
(b) The two first principal coordinates of mdsdd on the
same data. The signs of the scores are chosen so that the comparison of
the two graphs is easier.
comp1 and -PC.1 is
0.98 and between comp2 and -PC.2 is 0.93.
PC.1 against the relative
amounts of Cu, Zn, and Pb.The discrete densities considered in dad are compositional
data as they are made up of positive numbers whose sums are one. It is
possible to apply to them the techniques developed in the package
compositions after an adapted data formatting work. The
opposite, i.e., applying dad techniques to compositional data,
is partly true as in the following example illustrating the functions
princomp.acomp of compositions and
mdsdd of dad on the same data.
The simulated data sa.lognormals of
compositions are stored in a matrix \(60\,\times\,3\) whose columns are the
amounts of Cu, Zn, and Pb present
in 60 samples. They are loaded as follows.
> library(compositions)
> data(SimulatedAmounts)
> print(sa.lognormals)
Cu Zn Pb
[1,] 8.8043262 35.1671810 45.895025
...
[60,] 3.9854998 6.1301909 40.579417
These initial data are transformed so that each row is of sum 1 by
means of the acomp function.
> acomp(sa.lognormals) -> x1
> print(x1)
Cu Zn Pb
[1,] 0.097971136 0.391326782 0.51070208
...
[60,] 0.078617049 0.120922730 0.80046022
attr(,"class")
[1] acomp
The rows of the object x1 of class acomp
(relative amounts of Cu, Zn, and
Pb present in 60 samples) are transformed into tables or
arrays. These tables or arrays then are organized in a list to be
subjected to mdsdd. The three levels of the unique
categorical variable are denoted dd.Cu, dd.Zn,
and dd.Pb.
> x <- as.data.frame(x1)
> nomscol <- colnames(x)
> x2 <- list()
> for(i in 1:60) {x2[[i]] <- as.table(as.numeric(x[i,]));
+ dimnames(x2[[i]]) <- list("dd" = nomscol)}
> names(x2) <- rownames(x)
princomp and mdsddBy applying the princomp.acomp (or simply
princomp) function to x1, we obtain the
visualization of the data in a biplot (Fig. 7a).
The mdsdd function (MDS of discrete densities) is
applied to x2 which are the rows of x1
considered as discrete densities. We choose to present the results for
the distance argument set to hellinger. These
results are quite similar to those obtained with other distances as
chisqsym, jeffreys, or
jensen.
r2 = mdsdd(x2, distance = "hellinger")
It provides the figure 7b.
The two figures are almost equivalent: by plotting the coordinates
comp1 and comp2 against PC.1 and
PC.2 of the two previous graphics we get the figure 8.
For the interpretation of the axes, the two packages provide quite
different tools. With compositions, the biplot makes it
possible to visualize the links between the variables Cu,
Zn, and Pb and their links with the principal
components. With dad, the interpretation is done by crossing
the scores and the initial data. The interpretation is based on the
strength of the links between the PCs and the probabilities of occurence
of each level. In the compositional data example, almost a single axis
would suffice to explain the general trend: the first principal
coordinate PC.1 explains 90% of the inertia.
> interpret(r2, nscore = 1)
Pearson correlations between scores and probability distributions of each variable
...
Spearman correlations between scores and probability distributions of each variable
PC.1
dd.Cu 0.90
dd.Zn 0.96
dd.Pb -1.00
The graphical output is shown in Figure 9. The
relative amounts are highly correlated with PC.1, showing
that a low relative amount of the dd.Pb level corresponds
to a high relative amount of the levels dd.Cu or
dd.Zn. It is the result suggested by the biplot (Fig. 7a) returned by the function
princomp.acomp of the package compositions.