Abstract
The heuristic \(k\)-means algorithm,
widely used for cluster analysis, does not guarantee optimality. We
developed a dynamic programming algorithm for optimal one-dimensional
clustering. The algorithm is implemented as an R package called Ckmeans.1d.dp.
We demonstrate its advantage in optimality and runtime over the standard
iterative \(k\)-means algorithm.
Cluster analysis offers a useful way to organize and represent complex data sets. It is used routinely for data analysis in fields such as bioinformatics. The \(k\)-means problem is to partition data into \(k\) groups such that the sum of squared Euclidean distances to each group mean is minimized. However, the problem is NP-hard in a general Euclidean space, even when the number of clusters \(k\) is \(2\) (Aloise et al. 2009; Dasgupta and Freund 2009), or when the dimensionality is \(2\) (Mahajan, Nimbhorkar, and Varadarajan 2009). The standard iterative \(k\)-means algorithm (Lloyd 1982) is a widely used heuristic solution. The algorithm iteratively calculates the within-cluster sum of squared distances, modifies group membership of each point to reduce the within-cluster sum of squared distances, and computes new cluster centers until local convergence is achieved. The time complexity of this standard \(k\)-means algorithm is \(O(qknp)\), where \(q\) is the number of iterations, \(k\) is the number of clusters, \(n\) is the sample size, and \(p\) is the dimensionality (Manning, Raghavan, and Schtze 2008).
However, the disadvantages of various heuristic \(k\)-means algorithms in repeatability, optimality, and runtime may limit their usage in medical and scientific applications. For example, when patients are clustered according to diagnostic measurements on expression of marker genes, it can be critical that a patient consistently falls into a same diagnostic group to receive appropriate treatment, regardless of how many times the clustering is applied. In this case, both cluster repeatability and optimality are important. The result of heuristic \(k\)-means clustering, heavily dependent on the initial cluster centers, is neither always optimal nor repeatable. Often one restarts the procedure a number of times to mitigate the problem. However, when \(k\) is big, the number of restarts for \(k\)-means to approach an optimal solution can be prohibitively high and may lead to a substantial increase in runtime. As clustering is often a preprocessing step in data analysis or modeling, such inadequacy can pose a nuisance for further steps.
In response to this challenge, our objective is to develop a practical and convenient clustering algorithm in R to guarantee optimality in a one-dimensional (1-D) space. The motivation originated from discrete system modeling such as Boolean networks (Kauffman 1969, 1993) and generalized logical networks (Mingzhou Song et al. 2009), where continuous values must be converted to discrete ones before modeling. Our algorithm follows a comparable dynamic programming strategy used in a 1-D quantization problem to preserve probability distributions (Mingzhou Song, Haralick, and Boissinot 2010), the segmented least squares problem and the knapsack problem (Kleinberg and Tardos 2006). The objective function in the \(k\)-means problem is the sum of squares of within-cluster distances. We present an exact dynamic programming solution with a runtime of \(O(n^2 k)\) to the 1-D \(k\)-means problem.
We implemented the algorithm in the R package Ckmeans.1d.dp (Mingzhou(Joe) Song and Wang 2011) and evaluated its performance by simulation studies. We demonstrate how the standard \(k\)-means algorithm may return unstable clustering results, while our method guarantees optimality. Our method is implemented in C++ to take advantage of the practical speedup of a compiled program. This C++ implementation is further wrapped in an R package so that it can be conveniently called in R.
We first define the \(k\)-means problem. Let \(x_1, \ldots ,x_n\) be an input array of \(n\) numbers sorted in non-descending order. The problem of 1-D \(k\)-means clustering is defined as assigning elements of the input 1-D array into \(k\) clusters so that the sum of squares of within-cluster distances from each element to its corresponding cluster mean is minimized. We refer to this sum as within-cluster sum of squares, or withinss for short.
We introduce a dynamic programming algorithm to guarantee optimality of clustering in 1-D. We define a sub-problem as finding the minimum withinss of clustering \(x_1, \ldots, x_i\) into \(m\) clusters. We record the corresponding minimum withinss in entry \(D[i,m]\) of an \(n+1\) by \(k+1\) matrix \(D\). Thus \(D[n,k]\) is the minimum withinss value to the original problem. Let \(j\) be the index of the smallest number in cluster \(m\) in an optimal solution to \(D[i,m]\). It is evident that \(D[j-1,m-1]\) must be the optimal withinss for the first \(j-1\) points in \(m-1\) clusters, for otherwise one would have a better solution to \(D[i,m]\). This establishes the optimal substructure for dynamic programming and leads to the recurrence equation \[D[i,m] = \min_{m\leq j\leq i} \left\{ D[j-1,m-1] + d(x_j, \ldots ,x_i) \right\}, \\ 1 \le i \le n, 1 \le m \le k\] where \(d(x_j, \ldots ,x_i )\) is the sum of squared distances from \(x_j, \ldots ,x_i\) to their mean. The matrix is initialized as \(D[i,m] = 0\), when \(m=0\) or \(i=0\).
Using the above recurrence, we can obtain \(D[n,k]\), the minimum withinss achievable by a clustering of the given data. In the meanwhile, \(D[n,m]\) gives the minimum withinss if all \(n\) numbers are clustered into \(m\) groups. By definition each entry requires \(O(n^2)\) time to compute using the given recurrence if \(d(x_j, \ldots ,x_i)\) is computed in linear time, resulting in \(O(n^3 k)\) total runtime. However, \(d(x_j, \ldots, x_i)\) can be computed in constant time in the recurrence. This is possible because \(d(x_j, \ldots ,x_i )\) can be computed progressively based on \(d(x_{j+1},...,x_i)\) in constant time. Using a general index from \(1\) to \(i\), we iteratively compute \[\begin{aligned} d(x_1, \ldots ,x_i) &= d(x_1, \ldots ,x_{i-1}) + \frac{i-1}{i}(x_i-\mu_{i-1})^2 \\ \mu_{i} &= \frac{x_i + (i-1)\mu_{i-1}}{i} \end{aligned}\] where \(\mu_{i}\) is the mean of the first \(i\) elements. This iterative computation of \(d(x_j, \ldots ,x_i)\) reduces the overall runtime of the dynamic programming algorithm to \(O(n^2 k)\).
To find a clustering of data with the minimum withinss of \(D[n,k]\), we define an auxiliary \(n\) by \(k\) matrix \(B\) to record the index of the smallest number in cluster \(m\): \[\begin{gathered} B[i,m] = \underset{m \leq j \leq i}{\operatorname{argmin}}\left\{ D[j-1,m-1] + d(x_j, \ldots ,x_i) \right\}, \\ 1 \le i \le n, 1 \le m \le k \end{gathered}\] Then we backtrack from \(B[n,k]\) to obtain the starting and ending indices for all clusters and generate an optimal solution to the \(k\)-means problem. The backtrack is done in \(O(k)\) time.
The space complexity of the dynamic programming is \(O(n k)\) because we use an \((n+1)\times (k+1)\) matrix \(D\) to record minimum withinss and an \(n\times k\) matrix \(B\) for backtracking.
We implemented this dynamic programming algorithm and created an R
package Ckmeans.1d.dp.
To take advantage of the runtime speed up by a compiled language over R,
an interpreted language, we developed the dynamic programming solution
in C++. We compiled the C++ code to a binary dynamic library, which can
be called within R through function Ckmeans.1d.dp()
provided in the package.
We simulated data sets containing clusters using 1-D Gaussian mixture models. In a Gaussian mixture model, the number of components is the true number of clusters. The mean of each Gaussian component is randomly sampled from the uniform distribution from \(-1\) to \(1\) and the standard deviation of a component is uniformly distributed from \(0\) to \(0.2\).
We first compared the optimality of Ckmeans.1d.dp() and
the standard \(k\)-means method on 1-D
data. We used the Hartigan and Wong (1979)
implementation of \(k\)-means, as
provided by the kmeans() function in R. The core of the
kmeans() function is also implemented in C/C++. As
Ckmeans.1d.dp() guarantees optimality of clustering, we
define the relative difference from the kmeans() clustering
result to the optimal value produced by Ckmeans.1d.dp() as
\[\frac{withinss (\texttt{k-means}) -
withinss (\texttt{Ckmeans.1d.dp})}{withinss
(\texttt{Ckmeans.1d.dp})}\] which measures deviation of the \(k\)-means result from the optimal
value.
The data sets were randomly generated from Gaussian mixture models
with \(2\) to \(50\) components. We ran
Ckmeans.1d.dp() and kmeans() on input data
given the correct number of clusters \(k\). The relative difference in
withinss from the kmeans() to the optimal value
produced by Ckmeans.1d.dp() is shown as a function of \(k\) in Figure 1.
Although one hopes to get better clusters by restarting \(k\)-means several times, the simulation
suggests that it still cannot guarantee optimality when there are many
clusters in the data. This clearly demonstrates the advantage of
Ckmeans.1d.dp() in optimality.
We then compared the empirical runtime of
Ckmeans.1d.dp() with the standard \(k\)-means method on 1-D data. Using 1-D
Gaussian mixture models, we generated five input arrays of size \(100\), \(1,000\), \(10,000\), \(100,000\), and \(1,000,000\), respectively. The Gaussian
mixture models had \(2\) to \(50\) components.
We ran both programs on each input array ten times to obtain average empirical runtimes in three different settings. All simulations were run on a desktop computer with an Intel Core i7 860 processor and 8GB memory, running OpenSUSE 11.3 and R 2.11.1.
In the first setting (Figure 2),
runtime is obtained as a function of input data size from \(100\) to \(1,000,000\) for both
Ckmeans.1d.dp() and kmeans() without
constraints on optimality. The number of components in the Gaussian
mixture models is fixed to \(2\).
According to Figure 2, as the
input size increases, the runtime of Ckmeans.1d.dp()
increases faster than the runtime of kmeans(). However, the
result returned by kmeans() may not be optimal (the restart
of kmeans() was set to 1). Without any constraints on
optimality, kmeans() demonstrates an advantage in runtime
over Ckmeans.1d.dp() of runtime quadratic in \(n\).
In the second setting (Figure 3), runtime is obtained as a
function of number of clusters for Ckmeans.1d.dp() and
kmeans() without constraints on optimality. All input data
sets are of the same size \(10,000\)
and were generated from Gaussian mixture models with \(2\) to \(25\) components.
In Figure 3, the
runtime of Ckmeans.1d.dp() increases linearly with the
number of clusters \(k\), as does the
runtime of kmeans(). kmeans(), without any
constraints on optimality, still shows an advantage in absolute runtime
over Ckmeans.1d.dp().
In the third setting (Figure 4), we plot the runtime of
Ckmeans.1d.dp() and kmeans() as a function of
the number of clusters, obtained from exactly the same input data in the
second setting, but with a constraint on the optimality of \(k\)-means such that its withinss
has a relative difference less than \(10^{-6}\) from the optimal value.
Since Ckmeans.1d.dp() returns an optimal result in only
one run, its runtime is the same as setting 2. On the other hand,
kmeans() was restarted a number of times in order to
approach an optimal solution within the given tolerance. The number of
restarts increases substantially to produce a nearly optimal solution as
the number of clusters \(k\) increases.
As shown in Figure 4, the
runtime of Ckmeans.1d.dp() increases linearly with \(k\), but the runtime of
kmeans() appears to increase exponentially with \(k\). The runtime of kmeans()
almost always far exceeds Ckmeans.1d.dp() when \(k\) is above \(15\). This suggests an advantage of
Ckmeans.1d.dp() over kmeans() in both runtime
and optimality when \(k\) is big.
The R source code for the simulation studies is available from http://www.cs.nmsu.edu/~joemsong/R-Journal/Ckmeans-Simulation.zip.
The Ckmeans.1d.dp
package implements the dynamic programming algorithm for 1-D clustering
that we described above. In the package, the
Ckmeans.1d.dp() function performs the clustering on a 1-D
input vector using a given number of clusters.
The following example illustrates how to use the package. Figure 5 visualizes the input data and the cluster
result obtained by Ckmeans.1d.dp() in this example.
# a one-dimensional example
# with a two-component Gaussian mixture model
x <- rnorm(50, mean = 1, sd = 0.3)
x <- append(x, rnorm(50, sd = 0.3) )
result <- Ckmeans.1d.dp(x, 2)
plot(x, col = result$cluster)
abline(h = result$centers, col = 1:2, pch = 8,
cex = 2)
The main purpose of our work is to develop an exact solution to 1-D clustering in a practical amount of time, as an alternative to heuristic \(k\)-means algorithms. We thus developed the Ckmeans.1d.dp package using dynamic programming to guarantee clustering optimality in \(O(n^2 k)\) time. The algorithm is implemented in C++ with an interface to the R language. It is provided as a package to R and has been tested under recent operating system versions of Windows, Linux and MacOS. By simulation studies, we demonstrated its advantage over the standard \(k\)-means algorithm when optimality is required.
There are two limitations of our Ckmeans.1d.dp method. It only applies to 1-D data and its quadratic runtime in the sample size may not be acceptable in all applications.
However, where applicable, the impact of our method is its optimality and repeatability. As our simulation studies show, when the number of clusters is big, our method not only guarantees optimality but also has a fast runtime. In this situation, our method is preferred. We applied our Ckmeans.1d.dp to quantize biological data in the DREAM Challenges (Prill 2010) for qualitative dynamic modeling using generalized logical networks (Mingzhou Song et al. 2009).
It is of great interest if the 1-D dynamic programming strategy can be extended to multiple dimensional spaces so that clustering can be done in polynomial time to \(k\). However, this seems to us to remain open indefinitely.
We gratefully acknowledge the financial support to the reported work by the grant number HRD-0420407 from the U.S. National Science Foundation.