0% found this document useful (0 votes)
26 views

Density Based Silhouette Diagnostics For

Uploaded by

jananiVarun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Density Based Silhouette Diagnostics For

Uploaded by

jananiVarun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Stat Comput (2011) 21:295–308

DOI 10.1007/s11222-010-9169-0

Density-based Silhouette diagnostics for clustering methods


Giovanna Menardi

Received: 23 February 2009 / Accepted: 6 January 2010 / Published online: 4 February 2010
© Springer Science+Business Media, LLC 2010

Abstract Silhouette information evaluates the quality of the of the clustering is due to the real structure of the data or to
partition detected by a clustering technique. Since it is based the performance of the clustering technique adopted. Hence,
on a measure of distance between the clustered observations, a tool for evaluating the quality of the results can be useful
its standard formulation is not adequate when a density- to assess the ability of the clustering procedure in finding
based clustering technique is used. In this work we pro- partitions or to choose the best partition.
pose a suitable modification of the Silhouette information Several methods have addressed this issue in the past,
aimed at evaluating the quality of clusters in a density-based many of them based on some measure of distance between
framework. It is based on the estimation of the data posterior objects and clusters. One exploratory tool hinging on this
probabilities of belonging to the clusters and may be used to idea is the Silhouette information (Rousseeuw 1987). How-
measure our confidence about data allocation to the clusters ever, it can be argued that the diagnostics used for evaluat-
as well as to choose the best partition among different ones. ing the goodness of a partition should be consistent with the
clustering method adopted to produce that partition and, in
Keywords Cluster analysis · Density estimation · particular, that distance-based diagnostics are inadequate to
Diagnostics · Silhouette information assess the groups identified by using a technique based on
density estimation.
In this work we propose a suitable modification of the
1 Introduction Silhouette index aimed at appraising the clustering quality
in a density-based framework.
Cluster analysis refers to a widespread class of methods for The rest of the paper is organized as follows: in Sect. 2 a
exploring data with the aim of finding groups of objects that review of clustering methods based on density measures is
are similar to each other but different from the objects in presented. Section 3 discusses some diagnostics for evaluat-
other groups. These goals of similarity within groups and ing the quality of clusters and focuses on Silhouette infor-
dissimilarity between groups are typically achieved accord- mation specifically. Section 4 introduces the proposed mod-
ing to a traditional approach based on some measures of dis- ification. Its diagnostic ability is evaluated on some simu-
tance (dissimilarity) or, alternatively, by evaluating the den- lated and real data in Sect. 5. Some concluding remarks are
sity underlying the data. reported in Sect. 6.
Unlike supervised techniques of data mining, when deal-
ing with a clustering problem, there is no prior information
about the existence of interesting partitions of the data in 2 Density-based clustering methods
groups. Therefore, it is not known to what extent the quality
Among the many techniques for cluster analysis, a quite re-
cent approach expresses the concept of intra-group similar-
G. Menardi ()
ity and inter-group dissimilarity in terms of the density of
Department of Economics and Statistics, University of Trieste,
P.le Europa, 1, Trieste, Italy data. This approach goes back to Hartigan’s (1975, p. 205)
e-mail: [email protected] definition of clusters, which may be thought of as regions of
296 Stat Comput (2011) 21:295–308

high density separated from other such regions by regions bootstrap data resampling. Essentially, the population clus-
of low density, but it has been developed only recently with ters are approximated by the union of closed balls centered
current computational advances. at the sampled data with estimated density above a fixed
This idea has an explicitly inferential motivation: the ob- threshold k. Such a constant depends on some user-defined
served data X = (x1 , . . . , xi , . . . , xn )′ , xi ∈ ℜd , are, in fact, related parameters. Stuetzle (2003) takes advantage of a link
supposed to be a sample of independent and identically between the minimum spanning tree and the nearest neigh-
distributed realizations of a d−dimensional random vector bor density estimate of data, that allows for an easy detec-
with an unknown probability density function f. Density tion of the connected components of the level sets. Azzalini
estimation allows for the detection of high density regions and Torelli (2007) approximate the groups with the polyhe-
(empirical clusters) which approximate the population clus- drons formed by applying a Delaunay triangulation on the
ters. data with estimated density greater than k. The remaining
Two main classes of density-based methods arise in the data, with lower density, are assigned to the clusters by fol-
statistical literature about cluster analysis. The model-based lowing a logic typical of supervised classification. A notable
approach (see, for a review, McLachlan and Peel 1998; Fra- advantage of the procedure is that the number of clusters is
ley and Raftery 2002) rests on the idea that each cluster Um , automatically selected.
m = 1, . . . , M corresponds to a subpopulation fm typically It is worthwhile noting that the two classes of methods
belonging to some parametric family. The overall popula- differ not only because of their approach to density estima-
tion f is then modeled as a finite mixture of these subpopu- tion (parametric and, respectively, nonparametric) but the
lations: definition of a cluster is also conceptually different in the
two approaches. While the nonparametric methods associate
M
 the clusters to the regions around the modes of the proba-
f (x) = πm fm (x), bility distribution of data, clusters in the model-based ap-
m=1 proach correspond to the components of a mixture of dis-
tributions. Since the number of the modes in a mixture of
where fm (·) is the density of the mth component of the mix-
distributions does not necessarily match the number of com-
ture, corresponding to the group Um and usually depending
ponents, the difference between the two approaches emerges
on some parameter vector θm and πm is the mixing propor-
quite clearly.
tion of that component. Estimation of θm is usually carried
The idea of defining groups as regions associated with
out by the expectation-maximization (EM) algorithm which
the high density connected components is also the corner-
determines the maximum likelihood estimate of the mixture
stone on which several methods, proposed by the machine
model parameters. The maximization step of EM allocates
learning community, rest. However, it would be more appro-
the data to the clusters according to the Bayes rule for classi-
priate to place these methods halfway between the density-
fication, namely the observation x is classified to the cluster
based and the distance-based clustering procedures. In fact
Um0 if the data lie on a metric space and the density of the data
is usually expressed as a function of the distance between
π̂m0 fˆm0 (x) > π̂m fˆm (x), m = m0 .
the objects. DBSCAN (Ester et al. 1996) is one of the most
A second class of clustering methods, which is more di- representative procedures of this class. Here the concepts of
both density and connectivity are based on the local distri-
rectly related to the Hartigan definition, links the clusters
bution of the data’s nearest neighbors. GDBSCAN (Sander
with the high density level sets. Basically, any section of
et al. 1998) is a suitable generalization of DBSCAN which
the density f underlying data, at level k, induces a parti-
allows for clustering point objects as well as spatially ex-
tion of the sample space into two sets, one having density
tended objects, according to their spatial and non-spatial at-
less than k, one having density greater than k. The clusters
tributes. An attempt to overcome some arbitrariness in the
correspond to the maximum connected regions of the latter
choice of the input parameters inherent to these methods is
set. As k varies, these clusters may be represented accord-
given in Ankerst et al. (1999).
ing to a hierarchical structure in the form of a tree. Density
estimation is usually performed by a nonparametric method
and allows for the detection of high density regions. How- 3 Evaluating the quality of clusters
ever, the associated connected regions are not, in general,
explicitly defined. Several recent and earlier methods have addressed the issue
The clustering methods of this class mainly differ from of evaluating the partition produced by a clustering method,
each other in the way of detecting such regions. For in- in order to give a measure of its quality, choose the best par-
stance, in Cuevas et al. (2001), the identification of the con- tition among different ones and select the optimal number of
nected sets is pursued by a technique based on the smoothed clusters.
Stat Comput (2011) 21:295–308 297

Classical distance-based indexes have been proposed by 4 Cluster validation in a density-based framework
Dunn (1974) as well as by Davies and Bouldin (1979). Both
the indexes compare, for each cluster, a measure of com- 4.1 A density-based Silhouette information
pactness given by an average distance between the objects
in the cluster and the cluster centroid, with a measure of Density-based clustering methods include natural informa-
separation from the other clusters (distance between the cen- tion about the degree of confidence we give to the cluster
troids). Hubert and Schultz (1976) provide a general means membership of the observations. High density data points
for measuring the association between two proximity ma- are given maximum confidence because they lie just around
trices. More recent distance-based cluster validity indexes the modes of the density function. In contrast, a lower con-
have been proposed by Xie and Beni (1991), Bezdek and fidence is given to the data points which lie on the tails or
Pal (1995), Maulik and Bandyopadhyay (2002). at the valleys of the density function. This feature may be
developed in order to produce a measure for evaluating the
Other methods are based on probabilistic schemes,
quality of the partition detected by the clustering procedure.
such as likelihood ratio tests (Duda and Hart 1973) or
In particular, we propose an adaptation of the Silhouette in-
information-based criteria (Cutler and Windham 1994).
formation suitable for density-based clustering procedures.
Bayesian inference provides an alternative to likelihood ra-
The main difference between the existing Silhouette infor-
tio tests for the number of groups in a model-based cluster- mation and the new tool is that the former is based on the
ing, both for normal mixtures and other types of distribu- distance between clusters, whereas the latter is built in a
tions (Binder 1978, 1981; Banfield and Raftery 1993; Bens- density-based framework.
mail et al. 1997). Recalling the notation introduced in Sect. 3, since
A traditional distance-based exploratory method aimed xi ∈ X is drawn from a probability density function f, one
at appraising the quality of clusters is the so called Silhou- can evaluate the posterior probability that it belongs to group
ette information (Rousseeuw 1987). The idea on which Sil- Um , m = 1, . . . , M, as:
houette information hinges, arises from the comparison of
a measure of closeness of each observation to the cluster πm fm (xi )
τm (xi ) = M , (2)
where it has been allocated and a measure of separation from m=1 πm fm (xi )
the closest alternative cluster.
where πm is a prior probability of Um and fm is the density
Let X = (x1 , . . . , xn )′ be the matrix of the observations,

of group Um at xi .
xi ∈ ℜd , and U1 , . . . , UM a partition of X produced by a The density-based Silhouette information (dbs) of xi is
clustering procedure. For each xi one computes d(xi ; Um ), then defined as follows:
the average distance between xi and the elements of X be-
τm (xi )
longing to Um , m ∈ 1, . . . , M. Moreover, let us suppose that log( τm0 (xi ) )
1
the clustering procedure has assigned xi to the cluster Um0 dbsi = τm (xj )
, (3)
and that Um1 is the cluster which minimizes the average dis- maxj =1,...,n | log( τm0 (xj ) )|
1
tance d(xi ; Um ), m = m0 .
where m0 is such that xi has been classified to Um0 and m1
The Silhouette Information for the elements xi is:
is the group index for which τm is maximum, m = m0 .
The normalization factor in (3) does not correspond ex-
d(xi , Um1 ) − d(xi , Um0 )
si = . (1) actly to its counterpart in (1) because the maximum is taken
max{d(xi , Um1 ), d(xi , Um0 )} with respect to the observations (instead of the two compet-
ing groups). This is a discretionary choice but some prelimi-
Observations with a large si (near 1) are supposed to be nary analysis has shown better results using the formulation
well clustered, a small si (near 0) means that the observa- above (see the next section for further details).
tion lies between two clusters, and observations with a neg- The density-based Silhouette information of xi is, there-
ative si are probably placed in the wrong cluster. The clus- fore, proportional to the log ratio between the posterior prob-
tering structure can be displayed after splitting the obser- ability that it belongs to the group to which it has been allo-
vations between the groups, sorted according to Silhouette cated and the maximum posterior probability that it belongs
information. to another group. Large values of dbs are evidence of a well
An average s provides a global measure of quality of clustered data point while small values of dbs mean a low
clusters and allows for the comparison of partitions (the confidence in the classification. Negative values of dbs are
partition with maximum average Silhouette is taken as the possible and occur when τ̂m0 (xi ) < τ̂m1 (xi ) that is, the ob-
optimal one). See Kaufman and Rousseeuw (1990) for de- servation xi allocated to the cluster Um0 has an higher pos-
tails. terior probability of belonging to a different cluster. Hence,
298 Stat Comput (2011) 21:295–308

a negative value of dbs is usually evidence of an incorrect With regard to the specification of πm , the choice de-
allocation of the observation. pends on the prior knowledge about the composition of the
After evaluating the dbs index for all the observations, clusters and a lack of information would imply the choice of
they are partitioned into the clusters, sorted in a decreas- a uniform distribution of the πm over the groups. However,
ing order with respect to dbs and displayed on a bar graph information derived from the detected partition can also be
(density-based Silhouette plot). A location index could then used. In a model-based clustering, for example, the mixing
be calculated to obtain summarizing information about the proportions would seem to be a natural choice. When using
quality of the clusters. the AT procedure (Azzalini and Torelli 2007) a first detec-
tion of some cluster cores is performed, and prior probabil-
4.2 Computation of dbs ities can be chosen as proportional to the cardinalities of the
cluster cores.
The practical evaluation of the described diagnostic index It is worth noting that a further connection between the
requires the specification of both the density of data and the original Silhouette information and its density-based version
prior probabilities of groups. Since the distribution underly- exists. If fm (·) is chosen as proportional to exp(−λd(·, θm )),
ing the data is not known, the empirical dbs is used: where d(·, ·) is a distance and θm , λ some location and scale
τ̂m0 (xi )
parameters characterizing the groups (i.e. a distance-based
log( τ̂ ) model is used for estimating the cluster densities), it is easy
m1 (xi )
ˆ i=
dbs , (4) to show that the numerator of the (3) becomes:
τ̂m0 (xj )
maxj =1,...,n | log( τ̂ )|
m1 (xj )  
πm0
log + λ(d(xi , θm1 ) − d(xi , θm0 )).
where πm1
πm fˆm (xi ) Hence, if it is reasonable to assume a uniform distribution
τ̂m (xi ) = M , m = 1, . . . , M, (5)
ˆ
m=1 πm fm (xi )
of the πm over the groups, the dbs relates to the original
distance-based version of the Silhouette index even more
and fˆm (xi ) is a density estimate at xi obtained, after cluster- closely than the general case, being different only in the
ing the data, by using only the data points in Um . way that the distances are averaged. The former computes
The illustrated procedure is not linked to any specific the distance between xi and the average points of Um0 , Um1 ,
technique of density estimation and, provided that a density while the latter measures the average distances between xi
estimator with good properties is used, parametric models as and the elements of Um0 , Um1 .
well as nonparametric ones can be chosen. Among the many
possible choices, in the subsequent examples a kernel esti-
mator with Gaussian kernel and diagonal smoothing matrix 5 Validation of density-based Silhouette information
h = (h1 , . . . , hd ) has been used. In order to reduce the com-
putational effort, the diagonal smoothing parameters have The ability of the density-based Silhouette information in
been selected as asymptotically optimal for estimating a nor- evaluating the quality of a clustering and in helping to
mal density function. This approach usually tends to induce choose the best partition between different ones is assessed
oversmoothing when applied to non-normal data. However, on some real and simulated data sets.
we are confident that our choice looks well advised because Since one considers a cluster analysis which produces
the fm are at least unimodal by definition. meaningful groups as successful, some clustering proce-
From a computational point of view, the use of only the dures have been applied on data with a known clustering
data points allocated to Um to estimate the fm has the ef- structure, with the purpose of reconstructing the original
fect of pushing the clusters apart, with two main conse- groups. Then, the detected clustering has been evaluated by
quences. First, most xi values will have τm0 (xi ) close to one. using the density-based Silhouette information. The analy-
Hence, normalizing the density-based Silhouette by using sis has been carried out by applying the AT procedure based
the exact counterpart of (1) would result in most dbs values on nonparametric density estimation (Azzalini and Torelli
close to one. Instead, better performance derives from the 2007) and the MCLUST model-based clustering method
use of (3), because the difference between the dbsi emerges (Fraley and Raftery 2006).
more clearly. A further consequence is that the observations In order to understand if the density-based Silhouette in-
lying at the valley of the density underlying the data have formation can be used also for choosing between different
a higher chance of getting a negative value of the dbs. This partitions, distinct clustering structures have been produced
effect turns out to be desirable because the confidence we by each procedure for the considered data sets. This has been
give to the cluster membership decreases as we move away possible by suitably varying the input parameters of the clus-
from one mode of the density to another mode. tering procedures. More specifically, MCLUST has been run
Stat Comput (2011) 21:295–308 299

by setting, in turn, a different number of components (i.e.


groups) in the mixture of densities. The AT procedure, in-
stead, automatically determines the number of groups by
counting the modes of the density estimate. Hence, parti-
tions having different number of groups have been obtained,
by alternatively scaling the bandwidth matrix which governs
the smoothing of the density estimate so that different modal
structures emerged.
The sensitivity of the dbs in detecting misclassified ob-
servations has been measured through a ROC analysis.
Moreover, the behaviour of the mean and median dbs has
been analysed when the number of clusters varies, in order to
evaluate the opportunity of using a location index as a sum-
marizing dbs for measuring the global quality of the cluster-
ing. Fig. 1 Distribution of the clusters in the first two principal components
Finally, we have also computed the (distance-based) Sil- of the olive oil data. Symbols indicate the geographic areas: triangles
houette information on the clusters returned by the classical for the South, rumbles for the Center-North, squares for Sardinia
Ward clustering method and applied a ROC analysis to com-
pare the diagnostic ability of the original and the density-
based Silhouette.
We present here the results from the use of four data sets.
The first data set was originally presented by Forina et al.
(1983) and subsequently analyzed by various authors to il-
lustrate classification and clustering techniques. See, for in-
stance, Stuetzle (2003). The data represent eight chemical
measurements on n = 572 specimens of olive oil produced
in various areas of Italy: Centre-North, South, and Sardinia.
The clustering algorithms have been applied to reconstruct
the geographical origin of the oils. Some preliminary analy-
sis has been conducted on this set of data, following Azzalini
and Torelli (2007). Since the resulting data matrix had a di-
mensionality too large to be handled by the AT procedure,
all the clustering methods have been applied on the first five
Fig. 2 Distribution of the clusters in the first two principal components
principal components. To see the distribution of the clus- of the wine data. Symbols indicate the different wine cultivars
ters, we have displayed the first two principal components
in Fig. 1. Despite the possibility that the reduced dimension
could lead to misleading considerations, the figure suggests the issue is beyond the scope of this paper and it is not ad-
that the clustering methods have difficulty clustering the Sar- dressed here. In fact, it should be stressed that the purpose
dinia group. of the analysis is not to appraise the quality of the partition
The second example data set describes 13 chemical char- generated by the clustering procedures but to test the ability
acteristics of 178 wines grown in the same region in Italy but of dbs in understanding such quality.
derived from three different cultivars (groups). Also this data As a first synthetic example, a sample of n = 100 ob-
set was introduced by Forina et al. (1986) and widely used servations has been generated from the mixture of bivariate
for testing new classifiers or clustering methods (see, for ex- Gaussian distributions 1/2N (μ1 , )+ 1/2N (μ2 , ) having
ample, Dy and Brodley 2004). All the clustering methods the same covariance matrix.1
were applied on the first three principal components. Here, The second artificial data set contains n = 100 realiza-
the clustering structure is more evident than the previous ex- tions from a bivariate standard Normal distribution. It does
ample, even when looking at the first two principal dimen- not present any clustering structure and has been used to test
sions only (Fig. 2). the ability of the density-based Silhouette in recognizing the
The choice of using PCA to reduce the data dimension absence of groups.
might, admittedly, be risky, because there is no guarantee
that the clustering structure is preserved in the reduced space

 0.4 1 
(for further details see, for example, Chang 1983). However, 1 = (1/2, −1), μ2 = (−1/2, 1),  = 1 3
.
300 Stat Comput (2011) 21:295–308

Fig. 3 dbs plots of three partitions produced by the AT method on the In the top right plot, the groups are associated to the Centre-North, Sar-
first five principal components of the olive oil data. The plots refer to dinia and Southern areas (from the top to the bottom). The three largest
differently user selected parameters of the clustering methods, leading groups in the dbs plot on the bottom may be labeled as the South of
to a distinct number of clusters. On each plot different blocks corre- Italy, the Center North and Sardinia. The bottom right panel displays
spond to different clusters. The black crosses (normally not present in the ROC curves corresponding to the three dbs plots: the solid, dashed
the plot) identify the misclassified units and help us in evaluating the and dotted lines refer to the partitions in two, three and four groups,
diagnostic power of dbs. In the first panel the largest group mainly respectively
correspond to the olive oils from Sardinia and from the South of Italy.

In Figs. 3 to 10 some results are reported. On each fig- the remaining groups. Again, the misclassified labels have a
ure the dbs plots corresponding to partitions of the data in small value of the dbs. The same behaviour emerges when
two, three and four groups are displayed. In the bar plots we four groups are formed by MCLUST. The additional group
have highlighted the misclassified observations to better un- includes some data belonging to the Southern and Centre-
derstand the diagnostic ability of the dbs. The bottom right Northern regions. Instead, when four groups are returned by
panel of each figure compares the diagnostic abilities of the AT, the additional group is formed by some Centre-Northern
three partitions in terms of ROC curve, as will be explained and Sardinian oils, while some Southern olive oils are incor-
in the next paragraphs. rectly labeled as coming from the Centre-North area. The
Concerning the olive oil data, when AT and MCLUST most notable feature of the dbs plots corresponding to four
are forced to return two groups (top left panel of Figs. 3 groups is that, when using both the AT and the MCLUST
and 4) the Sardinia cluster is entirely assigned to the South- method, a very small median dbs in the fourth group is evi-
ern area and to the Centre-North respectively. Additionally, dence of a wrong number of clusters.
AT exchanges a few labels from these two areas. However, With regard to the wine data set, when AT and MCLUST
misclassified data lie on the margins of the groups, corre- are forced to partition the data into the actual number of
sponding to the valley of the density estimate, thus hav- groups (top right panel of Figs. 5 and 6) the overall misclas-
ing the smallest dbs. The tripartition of the olive oil data sification error is very low. The median dbs of each group
leads to good performance of both the clustering methods is relatively large, still taking values close to zero when data
(especially MCLUST) which correctly detect the olive oils are incorrectly labeled. Clustering the data into two groups
from the Sardinia group, but misclassifies a few values from mainly corresponds to partitioning one of the original clus-
Stat Comput (2011) 21:295–308 301

Fig. 4 Cf. Fig. 3. In this example the first five principal components of the olive oil data have been clustered by the MCLUST method. The dbs
behaviour is optimum when three groups are found because the minimum dbs corresponds to the only one misclassified point

Fig. 5 Cf. Fig. 3. In this example the first three principal components cluster’s median dbs when four groups are found, suggest we should
of the wine data have been clustered by the AT method. Many negative choose the top right clustering, corresponding to the actual structure of
values of the dbs in the bipartition and the close to zero value of one groups
302 Stat Comput (2011) 21:295–308

Fig. 6 Cf. Fig. 3. In this example the first three principal components are detected the small size of one group is evidence of a spurious clus-
of the wine data have been clustered by MCLUST. Good diagnostic ter, even if the correspondent median dbs is not minimum
abilities are shown by the dbs in all the partitions. When four groups

ters into the two detected ones. More specifically, MCLUST Unlike the Silhouette plot which returns the mean s for each
splits the data of the largest cultivar while AT splits the data cluster, the median of the dbs values is used as a measure
of the smallest one. In both cases the incorrectly labelled of a cluster’s quality. Indeed, the mean is not the best repre-
data take a small or even a negative value of the dbs in- sentative of the overall cluster accuracy, due to the lack of
dex. Good diagnostic abilities are shown by the dbs also robustness. See, for example, the bottom left panel of Figs. 3
when the data are partitioned into four groups. When us- and 10: the smallest clusters take a very close to zero (or
ing MCLUST the data assigned to the fourth group do not even negative) value of the median dbs, which is consis-
take the smallest values of the dbs but the small size of the tent with their spuriousness. Instead, the mean dbs would
group is evidence of a spurious cluster. Moreover, values of be much higher. Further interesting features emerge from
the dbs close to zero still correspond to misclassified data in the observation of the density-based Silhouette on the data
the other groups. The fourth cluster detected by the AT pro- which do not exhibit any grouping structure (see Figs. 9 and
cedure is generated by bipartitioning the third cultivar of the 10). When the clustering methods are coerced to partition
data set (corresponding to the smallest group) and it has the the data, a very small or negative dbs is given to the small-
smallest median dbs. est groups, meaning a low confidence in the classification.
The dbs behaviour on partitions of the simulated data sets Instead, a large group is formed, having the maximum dbs
(Figs. 7 to 10) shows the same tendency as the real data sets. median. This behaviour, especially visible in the partitions
The misclassified observations have a negative or a small returned by the AT method, is evidence of the absence of
value of the index and, on the other hand, negative and small groups and reassures us about the diagnostic ability of the
values of the dbs correspond to incorrect labels. Some false density-based Silhouette information even when partition-
negatives (misclassified data with large dbs) occur, but their ing data into homogeneous groups does not make sense.
presence is quite rare (see the ROC analysis below). These considerations can be confirmed through a ROC
Moreover, the analysis suggests that clusters with a large (Receiver Operating Characteristic) analysis (see, for exam-
median dbs correspond to a small misclassification error. ple, Fawcett 2006). From a statistical point of view, ROC
Stat Comput (2011) 21:295–308 303

Fig. 7 Cf. Fig. 3. The first simulated data set has been clustered by AT. When the procedure is forced to return more than two groups, the additional
clusters present several negative values of the dbs and a remarkably smaller median dbs than the actual groups

Fig. 8 Cf. Fig. 3. The first simulated data set has been clustered by MCLUST. When the procedure is forced to return a wrong number of clusters,
the additional clusters have a small median dbs with respect to the actual groups
304 Stat Comput (2011) 21:295–308

Fig. 9 Cf. Fig. 3. The second simulated data set has been clustered by AT. The smallest groups have a very low median dbs, being evidence of the
absence of groups. Steep ROC curves show good diagnostic abilities of the dbs across the three partitions

analysis has been increasingly used as a tool to evaluate dis- erations because the ROC curves of the proposed method lie
criminate effects among different diagnostic methods. The above the bisector of the ROC space. A remarkable situation
ROC curve is a graphical plot of the true sensitivity vs. 1- occurs in Figs. 4, 8 and 10 where the ROC curves referring
specificity for a binary classifier as its discrimination thresh- to the actual number of clusters jump almost immediately to
old varies. It represents equally the plot of the fraction of one, meaning that minimum values of the dbs correspond to
true positives vs. the fraction of false positives when the incorrect labels and no misclassified observation has a large
classification threshold varies. The best possible diagnos- dbs (no false negatives are found). Moreover, the dbs looks
tic method would yield a point in the upper left corner or consistent across partitions with a distinct number of groups,
coordinate (0, 1) of the ROC space, representing 100% sen- the ROC curves being steep in almost all the considered ex-
sitivity (all true positives are found) and 100% specificity amples. However, quite low ROC curves are returned when
(no false positives are found). A completely random guess MCLUST is forced to return fewer clusters than the actual
would give a point along a diagonal line from the bottom left structure of groups (see the real data examples), suggesting
to the top right corners. A ROC curve for evaluating the di- a possible bias of the density-based Silhouette with respect
agnostic ability of the dbs should be constructed by plotting, to the number of clusters. The issue of possible forms of bias
for each observed value m of the density-based Silhouette, affecting the clustering validation techniques has been dis-
the proportion of all the misclassified data with dbs ≤ m cussed by Handl et al. (2005) which show that the correct
(the y-coordinate) versus the proportion of all the data with partitioning may not score well under the Silhouette infor-
dbs ≤ m (the x coordinate). For computational convenience mation when distinct partitions are considered.
the results which follow refer to ROC curves built when m Some further investigations have been conducted with the
varies across a range of one hundred quantiles of the dbs, twofold aim of catching possible forms of bias and consider-
instead of across all the observed values. ing the use of a summarizing measure of the global quality.
The bottom right panel of Figs. 3 to 10 compares the ROC In particular, the behaviour of the overall mean and median
curves of the dbs across a range of partitions with two, three dbs has been evaluated when the number of clusters varies.
and four clusters. The analysis confirms the previous consid- The analysis has shown that there do not seem to be rea-
Stat Comput (2011) 21:295–308 305

Fig. 10 Cf. Fig. 3. The second simulated data set has been clustered by MCLUST. The minor groups have a very small median dbs, being evidence
of the absence of groups. Steep ROC curves mean good diagnostic abilities of the dbs across the three partitions, especially the first

sons for preferring the median to the mean as a measure of – clusters having a median dbs close to zero are likely to be
a global accuracy (unlike the evaluation of the quality of a spurious groups;
single cluster, there is no need to use a robust location index – negative values of dbs are interpreted as misclassified ob-
after the dbs normalization). In fact, given that the mean is servations, i.e. observations being more similar to the ob-
affected by negative and small values of the dbs it slightly jects belonging to alternative groups;
overperforms the dbs median. A slight tendency of the dbs – in general, observations getting a dbs value close to zero
in favouring model-based partitions with two clusters has lie on the margins of the groups, corresponding to the
emerged, but in the majority of the considered situations the valley of the density estimate. We are not able to decide
partition with the largest mean and median dbs is the one about the correct or incorrect allocations of these observa-
which most overlaps with the actual clustering structure.2 tions but their dbs value denotes a poor homogeneity with
It should be stressed that, unlike supervised problems of respect to the other observations belonging to the same
classification, in a clustering framework the observations are group.
not associated with a true label class and the solution pro- Our analysis has included also the computation of the orig-
vided by the application of a clustering method represents inal Silhouette information on the clusters returned by the
just one possible partition of the data in groups of similar distance-based Ward clustering method. The evaluation has
observations. While keeping in mind these considerations, been conducted by cutting the dendrogram produced by the
the conducted empirical analysis may give us some sugges- Ward method to output a given number of clusters, set equal
tions about how to use the dbs information in practice: to the actual number of groups. Pointedly, an exception has
– among distinct partitions, choose the one with the highest been made for the standard normal data where two clusters
median (or mean) dbs; have been considered instead of one, in accordance with the
density-based clustering methods. Figure 11 shows the re-
sults. From the considered examples we see that the Silhou-
2 Further details may be found on the supplementary material. ette information is quite sensitive at recognizing misclas-
306 Stat Comput (2011) 21:295–308

Fig. 11 Silhouette plot of the Ward method on the olive oil (top left panel), wine data (top right panel), simulated two groups (bottom left panel)
and normal data (bottom right panel). A poor performance emerges from the use of s on the olive oil data and normal data partitions

sified observations. However, it is generally outperformed 6 Concluding remarks


by its density-based version. The comparison between the
ROC curves of s and dbs (Fig. 12) shows that the dbs pro- In this work a diagnostic tool aimed at evaluating the quality
duce uniformly steeper curves than the Silhouette informa- of the partition generated by a clustering procedure has been
tion in three of the four considered data sets. Moreover, the presented. This method is similar to the Silhouette informa-
ROC curves show even more clearly that, when the data are tion but, unlike the Rousseeuw index, it is developed for ap-
not partitionable, the density-based Silhouette shows good praising the quality of a density-based clustering method. It
is based on the estimation of the posterior probabilities that
performance while the original Silhouette does not recog-
the observations belong to the detected groups.
nize the absence of groups (both the density-based and the
The idea of using the Bayes formula for evaluating the
distance-based Silhouette have been computed on the two
allocation of data points to the clusters is, indeed, not com-
group clustering of the standard normal data).
pletely new. Fraley and Raftery (2002), for instance, assess
The ROC analysis suggests some further reflections on
the confidence of each clustered data by estimating the pos-
the dbs behaviour when it is used in a model-based frame- terior probability that each observation does not belong to
work compared with a nonparametric framework. Unlike the group where it has been allocated. This measure highly
the parametric clustering, the latter approach basically as- agrees (in a reverse sense) with the dbs both when two clus-
sociates the clusters with the bumps in the density estimate, ters only are detected and when groups are well separated.
thus resulting in groups which are apparently more sepa- Indeed, while the uncertainty index of Fraley and Raftery
rated than groups correspondent to the components of a mix- basically evaluates the cluster compactness or homogeneity,
ture model. It might follow that an optimistic evaluation of the density-based Silhouette compares a measure of com-
dbs is due to the large log ratios between the conditional pactness with a measure of separation.
probabilities involved in the (3). However, the normaliza- The density-based Silhouette claims to be a completely
tion factor largely reduces this risk. In fact, the presented general technique, not linked to any specific density esti-
analysis does not provide elements to think that the dbs is mator nor to a clustering method. An application to real and
biased toward favouring parametric or nonparametric meth- simulated data has shown the ability of the proposed method
ods. in recognizing well clustered data, (likely) misclassified or
Stat Comput (2011) 21:295–308 307

fessor Adelchi Azzalini for his suggestions that greatly improved the
presentation of this paper.

Supplementary information: Enlarged figures and some supple-


mentary material are available at http://www2.units.it/~nirdses/sito
_inglese/working papers/files for wp/wp125.pdf.

References

Ankerst, M., Breuning, M.M., Kriegel, H., Sander, J., Optics: Order-
ing points to identify the clustering structure. In: Proc. ACM
SIGMOD Int. Conf. on Manag. Data (SIGMOD-96), pp. 49–60
(1999)
Azzalini, A., Torelli, N.: Clustering via nonparametric density estima-
tion. Stat. Comput. 17, 71–80 (2007)
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian
clustering. Biometrics 49, 803–821 (1993)
Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P.: Inference in
model-based cluster analysis. Stat. Comput. 7, 1–10 (1997)
Bezdek, J.C., Pal, N.R.: On cluster validity for the fuzzy c-means
model. IEEE Trans. Fuzzy Syst. 3, 190–193 (1995)
Binder, D.A.: Bayesian cluster analysis. Biometrika 65, 31–38 (1978)
Binder, D.A.: Approximations to bayesian clustering rules. Biometrika
68, 275–285 (1981)
Chang, W.C.: On using principal components before separating a mix-
ture of two multivariate normal distributions. Appl. Stat. 32, 267–
275 (1983)
Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further ap-
proach based on density estimation. Comput. Stat. Data Anal. 36,
441–459 (2001)
Cutler, A., Windham, M.P.: Information-based validity functionals for
mixture analysis. In: Proc. 1st US/Japan Conf. Front. Stat. Model.,
Bozdogan. Kluwer Academic, Norwell (1994)
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Trans.
Pattern Anal. Mach. Intell. 1, 224–227 (1979)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wi-
ley, New York (1973)
Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cy-
bern. 57, 3–32 (1974)
Dy, J.G., Brodley, C.: Feature selection for unsupervised learning.
J. Mach. Learn. Res. 5, 845–889 (2004)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density based algorithm
for discovering clusters in large spatial databases withe noise. In:
Proc. 2nd Int. Conf. Knowl. Discov. Data Min. (KDD-96). AAAI
Press, Menlo Park (1996)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett.
Fig. 12 From the top: ROC curves for olive oil, wine, simulated two 27, 861–874 (2006)
groups and normal data, respectively. The curves correspond to the best Forina, M., Armanino, C., Lanteri, S., Tiscornia, E.: Classification of
partitions detected by each clustering method (displayed in Figs. 3 to olive oils from their fatty acid composition. In: Martens, M., Russ-
10 and 11). The solid lines refer to the dbs returned by applying AT, wurm, H.J. (eds.) Food Research and Data Analysis, pp. 189–214.
the dashed lines correspond to MCLUST, the dotted line is the ROC Appl. Sci., London (1983)
curve applied on the Silhouette information of the Ward partition Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data
analysis as a discriminating method of the origin of wines. Vitis
25, 189–201 (1986)
less confident observations, and in determining the best par- Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analy-
tition among several groupings. A ROC analysis has further sis and density estimation. J. Am. Stat. Assoc. 97, 611–631
highlighted good levels of sensitivity and specificity and a (2002)
Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture
generally better diagnostic ability than the original Silhou- modeling and model-based clustering. Tech. Rep. 504, Univ. of
ette. Washington, Dep. of Stat. (2006)
Handl, J., Knowles, J., Kell, D.B.: Computational cluster validation
Acknowledgements The author wishes to gratefully acknowledge in post-genomic data analysis. Bioinformatics 21(15), 3201–3212
the Associate Editor and the reviewers for their useful remarks. A spe- (2005)
cial thanks to professor Nicola Torelli for his comments and to pro- Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)
308 Stat Comput (2011) 21:295–308

Hubert, L.J., Schultz, J.W.: Quadratic assignment as a general data Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and
analysis strategy. Br. J. Math. Stat. Psychol 29, 190–241 (1976) validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65
Kaufman, L., Rousseeuw, P.J.: Finding Groups in data: an introduction (1987)
to cluster analysis. Wiley, New York (1990) Sander, J., Ester, M., Kriegel, H.P., Xu, X.: Density-based clustering in
spatial databases: the algorithm GDBSCAN and its applications.
Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clus- Data Min. Knowl. Discov. 2, 169–194 (1998)
tering algorithms and validity indices. IEEE Trans. Pattern Anal. Stuetzle, W.: Estimating the cluster tree of a density by analyzing the
Mach. Intell. 24, 1650–1654 (2002) minimal spanning tree of a sample. J. Classif. 20, 25–47 (2003)
McLachlan, G.J., Peel, D.: Robust Cluster Analysis via Mixtures Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE
of Multivariate t-Distributions, pp. 658–666. Springer, Berlin Trans. Pattern Anal. Mach. Intell. 13, 841–847 (1991)
(1998),

You might also like