Abstract
Fuzzy clustering methods discover fuzzy partitions where observations can be softly assigned to more than one cluster. The package fclust is a toolbox for fuzzy clustering in the R programming language. It not only implements the widely used fuzzy \(k\)-means (F\(k\)M) algorithm, but also many F\(k\)M variants. Fuzzy cluster similarity measures, cluster validity indices and cluster visualization tools are also offered. In the current version, all the functions are rewritten in the C++ language allowing their application in large-size problems. Moreover, new fuzzy relational clustering algorithms for partitioning qualitative/mixed data are provided together with an improved version of the so-called Gustafson-Kessel algorithm to avoid singularity in the cluster covariance matrices. Finally, it is now possible to automatically select the number of clusters by means of the available fuzzy cluster validity indices.Standard clustering algorithms assign a set of observations to a
limited number of clusters such that observations belonging to the same
cluster are more similar to each other than to those in other groups.
Generally speaking, such algorithms usually produce a hard
partition of the observations, i. e. every observation is assigned to
one and only one cluster. This may often be too strict leading to
unfeasible partitions. The well-known Butterfly dataset (Ruspini 1970) helps to clarify the problem. It
is available in the matrix butterfly of the package fclust,
provided that the first and the last rows of the matrix are removed.
> data("butterfly", package = "fclust")
> butterfly <- butterfly[-c(1,17),]
> rownames(butterfly) <- as.character(rep(1:nrow(butterfly)))
> plot(butterfly, type = "n", xlab = "Var. 1", ylab="Var. 2")
> text(butterfly[,1], butterfly[,2], labels = rownames(butterfly), cex = 0.7, lwd = 2)
The Butterfly data refer to 15 observations and 2 variables. Two
clusters corresponding to the left and right wings (observations n.1-n.7
and n.9-n.15, respectively) of the butterfly can be graphically depicted
without any need of clustering tools. The assignment of observation n.8
(the body of the butterfly) is a much more complex issue because it is
at the halfway between the two clusters. Standard algorithms fail to
assign properly such an observation. For instance, let us consider the
(hard) \(k\)-means (\(k\)M) algorithm (Hartigan and Wong 1979), the most known
clustering algorithm. We run the \(k\)M
algorithm (using the function kmeans) a large number of
times (nt).
> set.seed(12)
> nt <- 1000
> ca8 <- rep(NA,nt)
> lfv <- rep(NA,nt)
> for (n in 1: nt){
+ km.butterfly <- kmeans(butterfly, centers = 2, iter.max = 1000, nstart = 10)
+ lfv[n] <- km.butterfly[[5]]
+ if (km.butterfly$cluster[8] == km.butterfly$cluster[1]){
+ ca8[n] <- 1
+ }else{
+ ca8[n] <- 2
+ }
+ }
> summary(lfv)
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.43 31.43 31.43 31.43 31.43 31.43
> table(ca8)
ca8
1 2
560 440
We find (details not reported) that the same two clusters are always
obtained (the one formed by observations n-1-n.7, labelled Cluster 1,
and the other one by observations n-9-n.15, labelled Cluster 2), whilst
observation n.8 is assigned to one of the two clusters
(ca8) by chance without affecting the loss function value
(lfv), i. e. the total within sum of squares.
The difficulty in the assignment of observation n.8 depends on the fact that it shares the features of both the groups. This situation frequently occurs in real life applications. In general, it may exist observations without clear assignments to the clusters. In such cases, it would be desirable to assign them to the clusters to a certain extent. For instance, in the butterfly problem, observation n.8 should be assigned to the two clusters with the same degree. This goal can be achieved by the fuzzy approach to clustering where observations belong to clusters with the so-called membership degrees in \([0,1]\) and, for each observation, the sum of the membership degrees must be equal to 1. Therefore, in fuzzy clustering, the observations are not strictly assigned to one and only one cluster, but can share their memberships to more than one cluster. To differentiate the fuzzy approach from the standard hard one, it may also be referred to as soft clustering.
The most known fuzzy clustering algorithm is the fuzzy \(k\)-means (F\(k\)M), proposed by Bezdek (1981), which is the fuzzy counterpart of
\(k\)M. It has been implemented in
several functions in different R packages: we mention cluster
(Maechler et al. 2017), clue (Hornik 2005), e1071 (Meyer et al. 2017), skmeans
(Hornik et al. 2012), vegclust
(De Caceres, Font, and Oliva 2010), ppclust
(Cebeci et al. 2018) and fclust
(Ferraro and Giordani 2015). Among them,
fclust offers a general toolbox for partitioning data using
fuzzy clustering algorithms, computing cluster validity indices and
visualizing clustering results. The current version (version 2.1.1) of
the package has been deeply improved with respect to the previous ones.
It contains several new routines and performance improvements. As a
first improvement, all the functions have been rewritten in the C++
language using Rcpp (Eddelbuettel 2013) and RcppArmadillo
(Eddelbuettel and Sanderson 2014) to
enhance their computational efficiency. In addition, all the clustering
algorithms can automatically select the optimal number of clusters using
the cluster validity indexes available in the package. All the functions
usually require data organized by quantitative variables and
observations (object data). To increase the usability of the package,
another relevant improvement is the implementation of fuzzy clustering
algorithms for relational data (Davé and Sen
2002). Relational data come from measures of dissimilarity
between observations. Two clustering methods proposed by Davé and Sen (2002) have been considered. They
do not require any assumption about the adopted measure of
dissimilarity. Thus, such algorithms can be applied to a wide range of
data. Specifically, whilst all the functions for object data require
quantitative variables, those for relational data can handle all kinds
of data (quantitative, qualitative or mixed) provided that a suitable
measure of dissimilarity/distance is selected, e. g. the Gower distance
(Gower 1971) specifying the option
metric="gower" of the function daisy in the
package cluster. Finally, new functions to compare two
partitions in a fuzzy environment have been implemented. Namely, the
fuzzy versions of the Rand index [RI; Rand
(1971)], the adjusted Rand index [ARI; Hubert and Arabie (1985)], and the Jaccard index
(Jaccard 1901; Halkidi, Batistakis, and
Vazirgiannis 2001) have been developed by Campello (2007). The standard indexes are
implemented in different packages (see, for
instance, Scrucca et al. 2016; Chung et al. 2018). The fuzzy
vartiants are now available in fclust.
In this paper, after reviewing the most relevant fuzzy clustering methods, we recall some of the original functions to present the new improvements and we introduce the new functions by examples. We assume that the latest version of fclust available on CRAN is already installed with
> install.packages("fclust")
and loaded into the R session using
> library(fclust)
In this section, we recall the fuzzy \(k\)-means algorithm (Bezdek 1981) and its extension suggested by Gustafson and Kessel (1979). Whilst the former detects spherical clusters, the latter allows for clusters with ellipsoidal shape. Then, a fuzzy clustering algorithm for relational data is described (Davé and Sen 2002)
The most known and used fuzzy clustering algorithm is the fuzzy \(k\)-means (F\(k\)M) (Bezdek 1981). The F\(k\)M algorithm aims at discovering the best fuzzy partition of \(n\) observations into \(k\) clusters by solving the following minimization problem: \[\begin{array}{l} \mathop {\min }\limits_{{\bf U},{\bf H}} { {J_{\mathrm{F}k\mathrm{M}}}} = \sum\limits_{i = 1}^n {\sum\limits_{g = 1}^k {u_{ig}^m d^2 \left( {{\bf x}_i ,{\bf h}_g } \right)} }, \\ {\rm s}{\rm .t}{\rm . \quad}u_{ig} \in \left[ {0,1} \right], \sum\limits_{g{\rm = 1}}^k {u_{ig} } = 1,\\ \end{array} \label{modFKM} (\#eq:modFKM)\] where \(d \left( {{\bf x}_i ,{\bf h}_g } \right)\) is the Euclidean distance. In (@ref(eq:modFKM)), \(u_{ig}\) is the generic element of the membership degree matrix \({\bf{U}}\) of order (\(n \times k\)), taking values in the interval \[0,1\] and representing the membership degree of observation \(i\) to cluster \(g\). The row-wise sums of \({\bf{U}}\) are equal to 1. The propotypes (centroids) \({\bf{h}}_g=[h_{g1}, h_{g2}, \cdots, h_{gp}]\) (\(g=1,\cdots, k\)) are stored in the matrix H of order (\(k \times p\)), being \(p\) the number of variables. Finally, the parameter \(m\) tunes the fuzziness of the obtained solution.
The F\(k\)M algorithm, as the standard \(k\)-Means, can determine only spherical shaped clusters. This depends on the use of the Euclidean distance to measure the dissimilarity between observations. This limits its applicability when non-spherical clusters are expected. In order to overcome this drawback, Gustafson and Kessel (1979) extend the F\(k\)M algorithm by replacing the Euclidean distance with a cluster-specific Mahalanobis distance: \[d^2_M \left( {{\bf x}_i ,{\bf h}_g } \right)=\left({\bf x}_i -{\bf h}_g\right)^{'} {\bf M}_g \left({\bf x}_i -{\bf h}_g\right), \label{Mal} (\#eq:Mal)\] where \({\bf M}_g\) is a symmetric and positive-definite matrix of order \(p\). When \({\bf M}_g={\bf I}\), (@ref(eq:Mal)) is equvalent to the Euclidean distance. The Gustafson-Kessel F\(k\)M (briefly, GK-F\(k\)M) consists in solving the following minimizatin problem \[\begin{array}{l} \mathop {\min }\limits_{{\bf U},{\bf H}, {\bf{M}}_1, \ldots, {\bf{ M}}_k} { {J_{\text{GK-F$k$M}}}} = \sum\limits_{i = 1}^n {\sum\limits_{g = 1}^k {u_{ig}^m d^2_M \left( {{\bf x}_i ,{\bf h}_g } \right)} }, \\ {\rm s}{\rm .t}{\rm . \quad }u_{ig} \in \left[ {0,1} \right], \sum\limits_{g{\rm = 1}}^k {u_{ig} } = 1, |{\bf M}_g|=\rho_g. \\ \end{array} \label{modGK} (\#eq:modGK)\] The constraints in (@ref(eq:modGK)) are similar to those in (@ref(eq:modFKM)) except for the new ones on \({\bf M}_g\) (\(|{\bf M}_g|=\rho_g>0\), with \(\rho_g\) fixed for each \(g\)), added to avoid a trivial solution with \({\bf M}_g={\bf 0}\), that would be obtained since \(J_{\text{GK-F$k$M}}\) is linear in \({\bf M}_g\).
For the generic \(g\)-th cluster, the iterative solution of \({\bf{M}}_g\) is \([\rho_g |\textbf{V}_{g}|]^{\frac{1}{n}}\textbf{V}^{-1}_{g}\), where \(\textbf{V}_{g}\) is the fuzzy covariance matrix for cluster \(g\), defined as \[{\bf V}_g = \frac{\sum_{i=1}^{n} u^{m}_{ig} ({\bf x}_i - {\bf h}_g)({\bf x}_i - {\bf h}_g)^{'}}{\sum_{i=1}^{n}u^{m}_{ig}}, \quad g=1,\dots,k.\] The eigenvalues and eigenvectors of \({\bf V}_g\) describe the shape and orientation of the \(g\)-th cluster. When an eigenvalue is equal to \(0\) or when the condition number of \({\bf{V}}_g\) (i. e. the ratio between its maximum and minimum eigenvalue) is very large, the matrix is nearly singular, hence \({\bf V}_g^{-1}\) cannot be calculated. The condition \(|{\bf V}_g|=\rho_g\) cannot overcome this drawback, as the determinant becomes \(0\). Babuska, Van der Veen, and Kaymak (2002) propose to avoid these numerical problems by constraining the condition number of \({\bf{V}}_g\) to be smaller than a predefined threshold. Since this might lead to overfit the data, the update of \({\bf V}_g\) can be regularized by considering the covariance matrix of the whole dataset. See, for more details, Babuska, Van der Veen, and Kaymak (2002).
In practical applications it may occur that only relational data are available. Relational data consist in the pairwise relations (similarities or dissimilarities) between observations, stored in a square matrix, say \({\bf{D}}\), not necessarily based on object data. In fact, \(\bf{D}\) can be built either by computing the dissimilarity (or similarity) between all the pairs of observations on which a set of variables are collected (indirect relational data) or according to subjective expert knowledge, e. g. a teacher expresses her/his subjective degrees of dissimilarity for all the pair of pupils in her/his classroom (direct relational data). In the latter case, fuzzy clustering algorithms for object data can no longer be applied. In the former case, fuzzy clustering algorithms for object data should be preferred to those for relational data due to their computational efficiency. Nonetheless, fuzzy clustering algorithms usually assume to deal with quantitative variables preventing their applicability in case of qualitative/mixed variables. In such a case, fuzzy clustering methods for relational data can be fruitfully applied provided that a suitable dissimilarity measure for qualitative/mixed variables is used.
In the literature, there exist several proposals of fuzzy clustering
algorithms for relational data. Among them, a valuable choice is
represented by the class of algorithms proposed by (Davé and Sen 2002), which are suitable for all
kinds of dissimilarity. We assume that \({\bf{D}}\) is a dissimilarity matrix. If it
contains similarities, these should be converted into dissimilarities.
For this purpose, e. g. the function sim2diss of the
package smacof
(de Leeuw and Mair 2009) can be used. The
non-Euclidean fuzzy relational data clustering algorithm (NEFRC)
consists in solving the following minimization problem: \[\begin{array}{l}
\mathop {\min }\limits_{{\bf U}} { {J_{\mathrm{NEFRC}}}} =
\sum\limits_{g= 1}^k \frac{ \sum\limits_{i = 1}^n \sum\limits_{j=
1}^n u^m_{ig} u^m_{jg} d({\bf x}_i ,{\bf x}_j)}{2 \sum\limits_{t=
1}^n u^m_{tg}}, \\
{\rm s}{\rm .t}{\rm . \quad}u_{ig} \in \left[ {0,1} \right],
\sum\limits_{g{\rm = 1}}^k {u_{ig} } = 1. \\
\end{array} \label{modNEFRC} (\#eq:modNEFRC)\] Notice that the
NEFRC algorithm differs from the famous FANNY algorithm proposed by
Kaufman and Rousseeuw (1990) since a
general fuzzifier \(m\) is used and it
is suitable for all kinds of dissimilarity.
The package also offers a robust variant of NEFRC involving the concept of noise cluster. It is an additional cluster such that the outliers are assigned to it with high membership degrees. It is not a cluster in a strict sense because the outliers are not necessarily similar to each other. Its role is that the membership degrees of the outliers to the standard clusters tend to be low without affecting the obtained partition. The robust version of NEFRC has been implemented in the current version of fclust and represents the relational counterpart of the F\(k\)M algorithm with noise cluster, already available in the package.
| Function | Algorithm | |||
|---|---|---|---|---|
FKM |
standard F\(k\)M algorithm | |||
| (Bezdek 1981) | ||||
FKM.ent |
F\(k\)M with entropy regularization | |||
| (Li and Mukaidono 1995, 1999) | ||||
FKM.noise |
F\(k\)M with noise cluster | |||
| (Davé 1991) | ||||
FKM.ent.noise |
F\(k\)M with entropy regularization and noise cluster | |||
| (Li and Mukaidono 1999; Davé 1991) | ||||
FKM.gk |
Gustafson and Kessel extension of F\(k\)M | |||
| (Gustafson and Kessel 1979) | ||||
FKM.gk.ent |
Gustafson and Kessel extension of F\(k\)M with entropy regularization | |||
| (Ferraro and Giordani 2013) | ||||
FKM.gk.noise |
Gustafson and Kessel extension of F\(k\)M with noise cluster | |||
| (Gustafson and Kessel 1979; Davé 1991) | ||||
FKM.gk.ent.noise |
Gustafson and Kessel extension of F\(k\)M with entropy regularization and noise cluster | |||
| (Ferraro and Giordani 2013; Davé 1991) | ||||
FKM.gkb |
Gustafson, Kessel and Babuska extension of F\(k\)M | |||
| (Babuska, Van der Veen, and Kaymak 2002; Gustafson and Kessel 1979) | ||||
FKM.gkb.ent |
Gustafson, Kessel and Babuska extension of F\(k\)M with entropy regularization | |||
| (Babuska, Van der Veen, and Kaymak 2002; Ferraro and Giordani 2013) | ||||
FKM.gkb.noise |
Gustafson, Kessel and Babuska extension of F\(k\)M with noise cluster | |||
| (Babuska, Van der Veen, and Kaymak 2002; Davé 1991) | ||||
FKM.gkb.ent.noise |
Gustafson, Kessel and Babuska extension of F\(k\)M with entropy regularization and noise cluster | |||
| (Babuska, Van der Veen, and Kaymak 2002; Ferraro and Giordani 2013; Davé 1991) | ||||
FKM.pf |
F\(k\)M with polynomial fuzzifier | |||
| (Winkler et al. 2009; Winkler, Klawonn, and Kruse 2011) | ||||
FKM.pf.noise |
F\(k\)M with polynomial fuzzifier and noise cluster | |||
| (Winkler et al. 2009; Winkler, Klawonn, and Kruse 2011; Davé 1991) | ||||
FKM.med |
fuzzy \(k\)-medoids algorithm | |||
| (Krishnapuram et al. 2001) | ||||
FKM.med.noise |
fuzzy \(k\)-medoids algorithm with noise cluster | |||
| (Krishnapuram et al. 2001; Davé 1991) | ||||
NEFRC |
non-euclidean fuzzy relational algorithm (Davé and Sen 2002) | |||
NEFRC.noise |
non-euclidean fuzzy relational algorithm with noise cluster | |||
| (Davé and Sen 2002; Davé 1991) |
In this section we present the main features of the package fclust with particular emphasis on the more recent updates. The list of algorithms with the corresponding functions is reported in Table 1. Apart from some peculiarities, all the available functions in the package require the same input arguments, involving the set-up of the clustering algorithms, i. e. number of starts, convergence criterion, data standardization. The user is not requested to specify such arguments because default options are specified. Obviously, the dataset to be clustered must be given.
Differently from the previous versions, the number of groups
k is no longer required. Of course, the user can select the
integer value of k, otherwise the optimal number of
clusters is automatically chosen by maximizing or minimizing one of the
available fuzzy cluster validity indices (see Table 2) to be specified in the option
index (default "SIL.F"). By default the
possible number of clusters is in the vector k=2:6, unless
a different integer vector is selected by the user.
A fast way to apply one of the available algorithms is represented by
the function Fclust:
> Fclust (X, k, type, ent, noise, stand, distance)
In Fclust to choose a given algorithm, the options
type, ent, noise and
distance should be set. type is a character
string specifying the type of algorithm to be used. The currently
available options are "standard" (the default option for
FKM-type algorithms, provided that distance = FALSE),
"polynomial", "gk", "gkb",
"medoids". ent (default FALSE)
and noise (default FALSE) are logical values
indicating, respectively, whether the entropy regularization and the
noise cluster should be used. Morever, distance (default
FALSE) is another logical value indicating whether the data
in X are distances/dissimilarities. When
distance = TRUE, type is constrained to be
"standard" and NEFRC-type algorithms are run. Finally,
stand is used for standardization (default: no
standardization) and k indicates the desired number of
clusters (only for this function, the default value is 2).
For instance, the researcher interested in applying the F\(k\)M algorithm with noise cluster with
\(k=3\) clusters to X can
digit:
> Fclust (X = X, k = 3, type = "standard", noise = TRUE)
| Function | Index |
|---|---|
PC |
partition coefficient |
MPC |
modified partition coefficient |
PE |
partition entropy |
XB |
partition entropy |
SIL |
(crisp) silhouette |
SIL.F |
fuzzy silhouette |
In the following we are going to present the main features and
improvements of the package by considering the standard F\(k\)M algorithm (function FKM),
the Gustafson–Kessel extension of F\(k\)M according to the Babuska, Van der Veen, and Kaymak (2002) variant
(function FKM.gkb), and the clustering algorithm for
relational data (function NEFRC).
FKM)The FKM function is applied to the NBA dataset available
in fclust. The dataset contains some statistics on 30 NBA teams
for the regular season 2017-2018 (source: https://stats.nba.com/teams/traditional/): number of
wins (W), field goals made (FGM), field goals
attempted (FGA), field goal percentage (FGP),
3 point field goals made (3PM), 3 point field goals
attempted (3PA), 3 point field goals percentage
(3PP), free throws made (FTM), free throws
attempted (FTA), free throw percentage (FTP),
offensive rebounds (OREB), defensive rebounds
(DREB), assists (AST), turnovers
(TOV), steals (STL), blocks
(BLK), blocked field goal attempts (BLKA),
personal fouls (PF), personal fouls drawn
(PFD) and points (PTS). In addition, two more
variables are available indicating the conference
(Conference) and the playoff appearance
(playoff). Both the variables are objects of class
factor with two levels.
The dataset can be loaded as following:
> data("NBA")
A subset of variables is considered for clustering purposes. The raw values of field goals, point field goals and free throws are removed (only the percentage values are considered), as well as the wins and the personal fouls.
> X.NBA <- NBA[,c(4,7,10,11,12,13,14,15,16,17,20)]
We apply the function FKM to the obtained dataset. The
parameter of fuzziness m is set to m = 1.2
(the default value m = 2 was too high producing an
extremely fuzzy partition with membership degrees not far from \(0.5\)) and the number of starts is fixed to
50 (RS = 50) to avoid local optima. The number of clusters
is automatically selected using the fuzzy silhouette index
(index = "SIL.F"). Notice that the fuzzy silhouette index
represents a fuzzy extension of the well-known silhouette (Kaufman and Rousseeuw 1990) involving the use
of the membership degree information (for further
details, refer to Campello 2007). Finally, we set
stand = 1 in order to standardize the data before running
FKM:
> fkm.NBA <- FKM(X = X.NBA, m = 1.2, RS = 50, stand = 1, index = "SIL.F")
The summary method returns the most relevant
information:
> summary(fkm.NBA)
Fuzzy clustering object of class 'fclust'
Number of objects:
30
Number of clusters:
2
Cluster sizes:
Clus 1 Clus 2
18 12
Clustering index values:
SIL.F k=2 SIL.F k=3 SIL.F k=4 SIL.F k=5 SIL.F k=6
0.2994904 0.2508281 0.2558217 0.2586680 0.2700120
Closest hard clustering partition:
Houston Rockets Toronto Raptors Golden State Warriors
2 2 2
Boston Celtics Philadelphia 76ers Cleveland Cavaliers
1 2 2
Portland Trail Blazers Indiana Pacers New Orleans Pelicans
1 2 2
Oklahoma City Thunder Utah Jazz Minnesota Timberwolves
1 2 2
San Antonio Spurs Denver Nuggets Miami Heat
1 2 1
Milwaukee Bucks Washington Wizards LA Clippers
2 2 1
Detroit Pistons Charlotte Hornets Los Angeles Lakers
1 1 1
New York Knicks Brooklyn Nets Chicago Bulls
1 1 1
Sacramento Kings Orlando Magic Atlanta Hawks
1 1 1
Dallas Mavericks Memphis Grizzlies Phoenix Suns
1 1 1
Cluster memberships:
Clus 1
[1] "Boston Celtics" "Portland Trail Blazers"
[3] "Oklahoma City Thunder" "San Antonio Spurs"
[5] "Miami Heat" "LA Clippers"
[7] "Detroit Pistons" "Charlotte Hornets"
[9] "Los Angeles Lakers" "New York Knicks"
[11] "Brooklyn Nets" "Chicago Bulls"
[13] "Sacramento Kings" "Orlando Magic"
[15] "Atlanta Hawks" "Dallas Mavericks"
[17] "Memphis Grizzlies" "Phoenix Suns"
Clus 2
[1] "Houston Rockets" "Toronto Raptors"
[3] "Golden State Warriors" "Philadelphia 76ers"
[5] "Cleveland Cavaliers" "Indiana Pacers"
[7] "New Orleans Pelicans" "Utah Jazz"
[9] "Minnesota Timberwolves" "Denver Nuggets"
[11] "Milwaukee Bucks" "Washington Wizards"
Number of objects with unclear assignment (maximal membership degree <0.5):
0
Membership degree matrix (rounded):
Clus 1 Clus 2
Houston Rockets 0.02 0.98
Toronto Raptors 0.01 0.99
Golden State Warriors 0.02 0.98
Boston Celtics 0.92 0.08
Philadelphia 76ers 0.11 0.89
Cleveland Cavaliers 0.05 0.95
Portland Trail Blazers 0.95 0.05
Indiana Pacers 0.34 0.66
New Orleans Pelicans 0.00 1.00
Oklahoma City Thunder 0.78 0.22
Utah Jazz 0.14 0.86
Minnesota Timberwolves 0.12 0.88
San Antonio Spurs 0.77 0.23
Denver Nuggets 0.03 0.97
Miami Heat 1.00 0.00
Milwaukee Bucks 0.03 0.97
Washington Wizards 0.03 0.97
LA Clippers 0.96 0.04
Detroit Pistons 1.00 0.00
Charlotte Hornets 0.98 0.02
Los Angeles Lakers 0.93 0.07
New York Knicks 0.96 0.04
Brooklyn Nets 0.99 0.01
Chicago Bulls 1.00 0.00
Sacramento Kings 0.98 0.02
Orlando Magic 1.00 0.00
Atlanta Hawks 0.98 0.02
Dallas Mavericks 0.97 0.03
Memphis Grizzlies 0.99 0.01
Phoenix Suns 0.99 0.01
Cluster summary:
Cl.size Min.memb.deg. Max.memb.deg. Av.memb.deg. N.uncl.assignm.
Clus 1 18 0.77 1 0.95 0
Clus 2 12 0.66 1 0.92 0
Euclidean distance matrix for the prototypes (rounded):
Clus 1
Clus 2 2.91
Available components:
[1] "U" "H" "F" "clus" "medoid"
[6] "value" "criterion" "iter" "k" "m"
[11] "ent" "b" "vp" "delta" "stand"
[16] "Xca" "X" "D" "call"
According to SIL.F, we select the solution with \(k = 2\) clusters. The obtained clusters can
be plotted on the plane spanned by the first two principal components.
This can be done by using the method plot associated to an
fclust object specifying the option
pca = TRUE.
> plot(fkm.NBA, pca = TRUE)
We can see that the first component is able to distinguish the two clusters. Teams with high first component scores belong to Cluster 2 and those with low scores to Cluster 1. The first component loadings are (the script is omitted):
FGP 3PP FTP OREB DREB AST TOV STL BLK BLKA
0.455 0.305 0.278 -0.157 0.158 0.395 0.071 0.160 0.370 -0.338
PTS
0.369
Hence, it appears that the clusters are mainly distinguish in terms
of FGP, AST, BLK,
PTS, BLKA, 3PP and
FTP, i. e. the variables with the highest first component
loadings in absolute value.
In order to interpret the clusters, we inspect the prototypes. To
this purpose, we apply the function Hraw to visualize the
prototypes by using the original units of measurement.
> round(Hraw(X = X.NBA, H = fkm.NBA$H), 3)
FGP 3PP FTP OREB DREB AST TOV STL BLK BLKA
Clus 1 0.451 0.358 0.759 9.830 33.757 22.483 14.252 7.446 4.581 4.996
Clus 2 0.474 0.367 0.780 9.513 33.891 24.425 14.305 8.096 5.145 4.548
PTS
Clus 1 104.362
Clus 2 109.357
We can see that Cluster 2 recognizes the best teams. In fact, the
values of the prototype of Cluster 2 are better than the corresponding
ones of Cluster 1, except for a few variables such as OREB
and BLKA. To further characterize the obtained clusters, we
consider the variables Conference and Playoff.
In particular, we aim at discovering whether the two clusters can be
interpreted in terms of the geographical location and/or the playoff
appearance. From a statistical point of view, this consists in comparing
the fuzzy partition resulting from FKM with the hard ones
corresponding to the classification variables Conference or
Playoff. For this purpose, the fuzzy cluster similarity
measures available in the package are considered. Such measures,
proposed by Campello (2007), are
summarized in Table 3.
| Function | Index |
|---|---|
RI.F |
Fuzzy version of Rand index |
ARI.F |
Fuzzy version of adjusted Rand index |
JACCARD.F |
Fuzzy version of Jaccard index |
To report the values of the three measures, the function
Fclust.compare can be used. The input required by
Fclust.compare (and similarly for RI.F,
ARI.F and JACCARD.F) is a fuzzy membership
degree matrix (U) and a vector of class labels
(VC).
> Fclust.compare(VC = NBA$Playoff, U = fkm.NBA$U)
ARI.F RI.F JACCARD.F
0.3077549 0.6537339 0.4825140
> Fclust.compare(VC = NBA$Conference, U = fkm.NBA$U)
ARI.F RI.F JACCARD.F
-0.02547957 0.48701485 0.31724090
It is clear that the clusters cannot be interpreted from a
geographical point of view, whilst, to some extent, they are related to
the playoff appearance. Such a comment holds especially for Cluster 2.
In fact, 11 out of 12 teams belonging to Cluster 2 reached the playoff
stage. The only exception is Denver Nuggets, which was one
of the best teams in terms of number of wins (W) but not
qualified to the playoff stage, because the number of wins was not
sufficient to reach the playoff stage in the Western conference.
FKM.gk and
FKM.gkb)The Gustafson-Kessel extension of the F\(k\)M algorithm is implemented in the
functions FKM.gk and FKM.gkb. The former
implements the GK-F\(k\)M algorithm in
the original proposal, whilst the latter, recently added to the package,
considers the computational improvement suggested by Babuska, Van der Veen, and Kaymak (2002). A
simulated dataset similar to the one in Babuska,
Van der Veen, and Kaymak (2002) is used to show the differences
between the two functions. Three different clusters with different size
(\(100\), \(80\), and \(60\)) in two-dimensional space are
generated as follows: \[%y_k = \Bigg \{
y = \Bigg \{
\begin{array}{rl}
%6 -2.0x_{1} &\quad \text{with $x_1 \sim U(1,3)$ for $k=1$}, \\
%-5 +1.5x_{2} &\quad \text{with $x_2 \sim U(3.2,6)$ for $k=2$},
\\
%3x_{3} &\quad \text{with $x_3 \sim U(-1,1)$ for $k=3$}. \\
6 -2.0x &$\quad$ \text{with $x \sim U(1,3)$ for Cluster 1}, \\
-5 +1.5x &$\quad$ \text{with $x \sim U(3.2,6)$ for Cluster 2}, \\
3x &$\quad$ \text{with $x \sim U(-1,1)$ for Cluster 3}.
\\
\end{array}\] Data can be found in fclust and loaded
with the following command:
> data(synt.data2)
Figure 3 shows the scatterplot of the simulated data. In this case the cluster covariance matrices are singular because the two variables are perfectly collinear in all the clusters.
synt.data2 dataset.By employing the standard function FKM.gk numerical
problems occur. By setting m = 2, k = 3 and
RS = 1, we have:
> fkm.gk.synt <- FKM.gk(X = synt.data2, k = 3, RS = 1)
The following warning message appears:
Warning message:
In FKM.gk(X = synt.data2, k = 3, RS = 1) :
When k=3, at least one cluster covariance matrix seems to be singular.
Increase the number of starts RS or use FKM.gkb
Thus, we can see that the algorithm stops because at least one
cluster covariance matrix is singular. In this case, the function
returns the standard object of class fclust containing the
sub-optimal solution at the previous iteration, i. e. the one with no
singular cluster covariance matrices. By studying the number of
iterations and the loss function value of such a local optimum solution,
we get:
> fkm.gk.synt$iter
Start 1
13
> fkm.gk.synt$value
Start 1
0.06044555
For comparative purpose, we run the recommended function
FKM.gkb using the same start:
> fkm.gkb.synt <- FKM.gkb(X = synt.data2, k = 3, RS = 1, seed = 123)
> fkm.gkb.synt$iter
Start 1
16
> fkm.gkb.synt$value
Start 1
1.482029e-05
The method required two more iterations for convergence. The obtained solution is characterized by a lower loss function value and is not affected by singularity problems.
The NEFRC algorithm can be applied using the function
NEFRC. Differently from the other functions for clustering
object data, it requires distances/dissimilarities as input argument.
Consistently with the other functions, the available clustering indices
(except for the Xie and Beni one) can be used to select the optimal
number of clusters k. In particular, the silhouette index
(SIL) and its fuzzy extension (SIL.F) have
been rearranged for relational data. Specifically, the input of
NEFRC is employed to compute the silhouette and the fuzzy
silhouette indices. This is the default option when SIL.F
is called by NEFRC. In order to use the
distance/dissimilarity matrix for computing the fuzzy silhouette index,
the option distance = TRUE in SIL.F should be
set. Generally speaking, the fuzzy silhouette index can be applied for
any kind of data (quantitative or qualitative or mixed) provided that a
suitable distance/dissimilarity matrix is used as input.
The function NEFRC is presented by considering the
congressional voting records data (Schlimmer
1987) available in fclust. The data collect 1984 United
Stated voting records for 475 U.S. House of Representative congressmen
on 16 key votes identified by the Congressional Quartely Almanac (CQA).
The congressmen are split into Democrats and Republicans (variable
class). The 16 key votes are objects of class
factor with three levels according to the CQA scheme:
y refers to the types of votes “voted for”, “paired for”
and “announced for”; n to “voted against”, “paired against”
and “announced against”; yn to “voted present”, “voted
present to avoid conflict of interest” and “did not vote or otherwise
make a position known”.
The dataset can be loaded as follows:
> data("houseVotes")
It contains the following variables:
> colnames(houseVotes)
[1] "class" "handicapped-infants"
[3] "water-project-cost-sharing" "adoption-of-the-budget-resolution"
[5] "physician-fee-freeze" "el-salvador-aid"
[7] "religious-groups-in-schools" "anti-satellite-test-ban"
[9] "aid-to-nicaraguan-contras" "mx-missile"
[11] "immigration" "synfuels-corporation-cutback"
[13] "education-spending" "superfund-right-to-sue"
[15] "crime" "duty-free-exports"
[17] "export-administration-act-south-africa"
Since the level yn might indicate unknown preferences
for some key votes, these values are considered as missing and,
therefore, the rows with at least one yn value are
removed:
> level.drop <- droplevels(houseVotes, exclude = "yn")
> houseVotesComplete <- level.drop[complete.cases(level.drop),]
The research interest relies in discovering whether a two-cluster
structure exists and a relationship between the political position and
the system of voting emerges. Even if the dataset is not relational,
NEFRC is the only one R routine for getting a
fuzzy partition based on qualitative variables. For this purpose, the
Gower distance, implemented in the function daisy of the
package cluster, is used to generate the dissimilarity
matrix:
> X.houseVotesComplete <- houseVotesComplete[,-1]
> library(cluster)
> D.houseVotes <- daisy(x = X.houseVotesComplete, metric = "gower")
The standard algorithm for relational data is employed by running the
function NEFRC setting m = 1.5 and
k = 2 in order to assess whether the clusters are related
to the parties (class).
> nefrc.houseVotes <- NEFRC(D = D.houseVotes, k = 2, m = 1.5, index = "SIL.F")
The summary method is similar to that of
FKM and hence not reported. The two clusters can be
interpreted in terms of the parties. In fact, we get the following
cluster similarity measures:
> Fclust.compare(VC = houseVotesComplete$class, U = nefrc.houseVotes$U)
ARI.F RI.F JACCARD.F
0.4871095 0.7435544 0.5914710
> plot(nefrc.houseVotes)
plot method.
Points are marked according to the obtained classification (Cluster 1:
red, Cluster 2: cyan).
Morevover, we have:
> table(nefrc.houseVotes$clus[,1], houseVotesComplete$class)
democrat republican
1 19 101
2 105 7
Therefore, Cluster 1 and Cluster 2 refer to the Republicans and
Democrats, respectively. In Figure 4 the
clusters are plotted in the low dimensional space spanned by the first
two components. Note that the plot method for relational
data is based on non-metric multidimensional scaling (Kruskal 1964) by calling the function
isoMDS of the package MASS (Venables and Ripley 2002).
To further interpret the clusters, Figure 5 displays the barplots of the 16 key votes
for the two clusters (by considering the closest hard clustering
partition). We can observe that the votes are highly connected with the
Congressmen political positions. This holds for almost all the 16 key
votes with particular reference to, e. g.
"adoption-of-the-budget", "education-spending"
and "anti-satellite-test-ban".
In this section, a dataset with ordinal data is analyzed by using
NEFRC. The data refer to the Math Anxiety Scale Survey
administered to \(20\) students in a
statistics course (Bai et al. 2009). In
the survey, each student answers \(14\)
questions by using a Likert scale with five levels
("Strongly Disagree", "Disagree",
"Neutral", "Agree",
"Strongly Agree"). First, we load the dataset:
> library(likert)
> data("mass")
Then, we compute the dissimilarity matrix by using the Gower
distance. When applied to ordinal variables, such a distance is based on
ranks. Note that the first variable of mass is
Gender, not useful for clustering purposes and, thus,
omitted in the computation of the dissimilarity matrix. We have:
> library(cluster)
> D.mass <- daisy(x = mass[,-1], metric = "gower")
Finally, we run the function NEFRC automatically
selecting the number of clusters by means of SIL.F:
> nefrc.mass <- NEFRC(D = D.mass, index = "SIL.F")
The fuzzy silhouette values, employed to select the number of clusters, are:
> nefrc.mass$criterion
SIL.F k=2 SIL.F k=3 SIL.F k=4 SIL.F k=5 SIL.F k=6
0.5330319 0.4623684 0.4039311 0.4428360 0.4685703
Hence, \(k=2\) clusters are
suggested. Since the default options are used, the solution could also
be obtained by considering the function Fclust:
> nefrc.mass <- Fclust(X = D.mass, k = 2, noise = FALSE, distance = TRUE)
The clusters can be interpreted according to the observed qualitative
variables. For this purpose, we calculate the p-values resulting from
the \({\chi}^2\) tests by which we
study the independence between the closest hard clustering partition and
every observed variables. The p-values are stored in the vector
PV:
> PV <- rep(NA,ncol(mass))
> for (j in 1:ncol(mass)) PV[j] <- chisq.test(nefrc.mass$clus[,1], mass[,j])$p.value
At the significance level \(\alpha = 0.05\), we are interested in those variables such that the corresponding p-value is lower than \(\alpha\):
> alpha <- 0.05
> names(mass)[PV < alpha]
[1] "I find math interesting."
[2] "I get uptight during math tests."
[3] "Mind goes blank and I am unable to think clearly when doing my math test."
[4] "I worry about my ability to solve math problems."
[5] "I get a sinking feeling when I try to do math problems."
[6] "I find math challenging."
[7] "Mathematics makes me feel nervous."
[8] "Mathematics makes me feel uneasy."
[9] "Math is one of my favorite subjects."
[10] "I enjoy learning with mathematics."
[11] "Mathematics makes me feel confused."
We inspect the contingency tables (not reported here) between such a
subset of observed variables and the closest hard clustering partition
and we find that Cluster 1 is characterized by large frequencies for the
modalities "Strongly Disagree" and "Disagree"
with respect to the variables "I find math interesting.",
"Math is one of my favorite subjects." and
"I enjoy learning with mathematics." and large frequencies
for the modalities "Agree" and
"Strongly Agree" with respect to the variables
"I get uptight during math tests.",
"Mind goes blank and I am unable to think clearly when doing my math test.",
"I worry about my ability to solve math problems.",
"I get a sinking feeling when I try to do math problems.",
"I find math challenging.",
"Mathematics makes me feel nervous.",
"Mathematics makes me feel uneasy." and
"Mathematics makes me feel confused.". Of course, the
opposite comment holds for Cluster 2. Therefore, the partition
distinguishes the students liking math (assigned to Cluster 2) from
those who experience feelings of stress when faced with math (assigned
to Cluster 1).
In the previous two subsections, NEFRC is applied in
order to discover homogeneous clusters of observations on which
qualitative variables are collected. In these cases, suitable
dissimilarity matrices are built before running NEFRC. In
the current subsection, we consider the case where variables are not
available and the only information about the observations is expressed
in terms of their dissimilarities or distances. The data are stored in
the following object of class dist:
> library(smacof)
> data("FaceExp")
FaceExp contains the dissimilarities between pairs of 13
facial expressions related to particular stimuli:
> labels(FaceExp)
[1] "Grief at death of mother" "Savoring a Coke"
[3] "Very pleasant surprise" "Maternal love-baby in arms"
[5] "Physical exhaustion" "Something wrong with plane"
[7] "Anger at seeing dog beaten" "Pulling hard on seat of chair"
[9] "Unexpectedly meets old boyfriend" "Revulsion"
[11] "Extreme pain" "Knows plane will crash"
[13] "Light sleep"
The dissimilarities have been calculated in a psychological experiment where a set of subjects were invited to judge how much two pictures of emotional expressions differ. Thus, all the possible pairs of emotional expressions were compared by the subjects and the dissimilarities were derived. See, for further details, Abelson and Sermat (1962) .
By means of NEFRC the aim is to discover whether similar
facial expressions are perceived by the subjects in connection with
similar emotions intended by the stimuli. In this case, we do not know
the number of clusters and, therefore, we determine it according to
SIL.F.
> nefrc.FaceExp <- NEFRC(D = FaceExp, index = "SIL.F")
We find that \(k=3\) should be set:
> nefrc.FaceExp$criterion
SIL.F k=2 SIL.F k=3 SIL.F k=4 SIL.F k=5 SIL.F k=6
0.5298465 0.5929045 0.5470887 0.5436513 0.4003177
The interpretation of the clusters can be done by seeking a common feature for the facial expressions, i. e. the stimuli, assigned to the same cluster. We have:
> round(nefrc.FaceExp$clus[(nefrc.FaceExp$clus[,1] == 1), 2], 2)
Savoring a Coke Very pleasant surprise Maternal love-baby in arms
0.64 0.85 0.75
Pulling hard on seat Unexpectedly meets old boyfriend
0.59 0.94
> round(nefrc.FaceExp$clus[(nefrc.FaceExp$clus[,1] == 2), 2], 2)
Grief at death of mother Physical exhaustion Revulsion
0.79 0.81 0.69
Extreme pain Light sleep
0.56 0.64
> round(nefrc.FaceExp$clus[(nefrc.FaceExp$clus[,1] == 3), 2], 2)
Something wrong with plane Anger at seeing dog beaten Knows plane will crash
0.52 0.93 0.78
Cluster 1 groups pleasant stimuli with the only exception of
"Pulling hard on seat of chair" for which the membership
degree is however the lowest one (0.59). The facial
expressions showing pain belong to Cluster 2. "Light sleep"
is also assigned to the cluster. It follows that the subjects tend to
associate such an expression with suffering. Finally, anxiety
characterizes Cluster 3.
In this paper we have described the main features of the package
fclust. fclust represents a toolbox for fuzzy cluster
analysis. The functions in the package offer a wide range of fuzzy
clustering algorithms, fuzzy cluster validity indices, measures of
similarity for comparing hard and fuzzy partitions and visualization
tools for fuzzy clustering results. Particular attention has been paid
to the new improvements and implementations available in the current
version of the package (version 2.1.1). First of all, the functions have
been updated by using the C++ language, with a remarkable reduction in
computation time. Furthermore, the package now includes some fuzzy
clustering algorithms for relational data, allowing the user to perform
a fuzzy clustering analysis when the variables are qualitative or mixed.
In such cases, a dissimilarity matrix can be built by using the existing
R functions (e. g. dist or daisy in the
package cluster) and the available functions for relational
data (NEFRC and NEFRC.noise) can then be
applied. As far as we know, NEFRC and
NEFRC.noise represent the first available R
functions for fuzzy clustering of qualitative or mixed variables. All
the functions have been revised in such a way that the number of
clusters can be automatically selected. This might increase the
computation time, but it is crucial in order to spread the use of fuzzy
clustering methods especially for non-expert users. In this connection,
the function Fclust for running the available algorithms
using the default options and specifying the desired number of clusters
is also offered.