0% found this document useful (0 votes)
81 views7 pages

Recursive Hierarchical Clustering Algorithm

Recursive Hierarchical Clustering Algorithm - technical Paper published in 2018 in the International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018.

Uploaded by

reader29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views7 pages

Recursive Hierarchical Clustering Algorithm

Recursive Hierarchical Clustering Algorithm - technical Paper published in 2018 in the International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018.

Uploaded by

reader29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Machine Learning and Computing, Vol. 8, No.

1, February 2018

Recursive Hierarchical Clustering Algorithm


Pavani Y. De Silva, Chiran N. Fernando, Damith D. Wijethunge, and Subha D. Fernando

 on an aspect of a query. Cluster analysis is used to find


Abstract—Ultimate objective of data mining is to extract patterns in the climate of atmosphere and ocean [5].
information from large datasets, and to utilize the extracted Furthermore, clustering can also be used to find a group of
information in decision making process. Clustering is the most genes that are more likely to respond to the same treatment,
generic approach of many unsupervised algorithms in data
detecting communities in huge groups of people.
mining, which cluster data into samples so that objects with
similar statistical properties cluster together. Hierarchical, However, many of these approaches expect the number of
partition, grid and spectral are such clustering algorithms clusters required to be predetermined, or based on the
coming under unsupervised approach. Many of these deviations among the clusters, some algorithms itself
approaches produce clusters either according to a predefined determined the number of clusters to be produced as outputs.
value or according to its own algorithm or produces hierarchies Also, some clustering algorithms produce the distance or
letting the user to determine the preferred number of clusters. radius between clusters enabling the number of clusters to be
Selecting appropriate number of clusters for a given problem is determined by the user. Additionally, decision trees like
a crucial factor that determine the success of the approach. This
approaches using entropy or gini indexes split the data into
paper proposes a novel recursive hierarchical clustering
algorithm which combine the core concepts of hierarchical
hierarchies letting the user to determine the level to prune the
clustering and decision tree fundamentals to find the optimal tree.
number of clusters that suits to the given problem This paper proposes a technique for clustering that is
autonomously. efficient which concerns about the pitfalls and limitations of
existing clustering algorithms, mainly in determining the
Index Terms—Decision tree, gain ratio, gini-gain, gini-index, number of clusters to be produced. The proposed recursive
hierarchical clustering. hierarchical approach takes the gini-index from the decision
trees and enable to find the optimal number of clusters with
suitable radius and gini-index.
I. INTRODUCTION
Clustering is a foremost unsupervised learning technique
comes under data mining. Clustering organizes the groups of II. ANALYSIS ON EXISTING CLUSTERING ALGORITHMS
elements into clusters according to their statistical similarities.
The existing key clustering algorithms have been reviewed
The fundamentals of the concept of clustering techniques
with the purpose of highlighting the key performances,
allows the components of a cluster being highly similar to
strengthens and limitations.
each other, components of different clusters being highly
different to each other, and these similarity and dissimilarity A. K Means Clustering
measure have distinct and practical meaning [1], [2]. These K means clustering is one of the simplest unsupervised
similarity and distinct measures of clusters have been clustering algorithms. In K means clustering the number of
determined using clusters for which the dataset is intended to be alienated has
to be known in advance. In this approach a fixed set of
Euclidean distance, Cosine similarity and Jaccard centroids is defined initially by clustering the data set into k
similarity [3] based on the datatype the problem involve with. number of clusters arbitrarily. Then an iterative approach is
Cosine similarity measures the distance among two sets of followed for clustering data point into clusters by calculating
vectors. Distance between two sets of sets is measured using the similarity between data points and the centroids of the
Jaccard distance while Euclidean distance measures the clusters based on Euclidean distance or similar distance until
distance between two points [4]. the clusters of the two consecutive iterations are static and
Applications of clustering is varying from grouping would not change. K means clustering is fast, robust and
numerical data points to analyzing large corpus of unlabeled easier to understand [3]. For the purpose of determining the
data. For example, clustering is useful for fundamental tasks optimal k value for clustering a dataset, an experimentation
of marketing for identifying assembly of users with similar which compares the average distance to the centroid for
behavior using a large corpus of data containing past buying increasing k values has been conducted. As per the findings,
patterns and consumer behaviors. It is used in biology to the average distance to the centroid will be decreasing by a
create a taxonomy of living beings. Information retrieval uses very small value for increasing k values, when the number of
clustering algorithms to group search results of a query based clusters exceeds than the optimal number of clusters for a
dataset [3].
Manuscript received October 23, 2017; revised January 25, 2018. This
Weaknesses associated with k means clustering includes
work was supported in part by Faculty of Information Technology, with lower sample data sets it is difficult to cluster data
University of Moratuwa. accurately. Clustering algorithm requires a priory
The authors are with the University of Moratuwa, Sri Lanka (e-mail: specification of the number of clusters using another
[email protected], [email protected],
[email protected], [email protected]). algorithm such as Self Organizing Maps. Results are circular

doi: 10.18178/ijmlc.2018.8.1.654 1
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018

or spherical in shape due to calculating the Euclidean ability of projecting multivariate data into a two-dimensional
distance. There is no knowledge on the variable which highly plot while providing intelligible, useful summary of data [7].
contributes to the clustering process in case of high SOM is easy to understand and is simple. SOM can evaluate
dimensional data clustering. K means algorithm fails for its own quality. The disadvantages associated with SOM
categorical data. The use of exclusive assignment i.e. if there include the possibility of distortions due to the fact that high
exists data which are highly overlapping then k-means will dimensional data cannot be always represented in a
not be able to resolve the clusters distinctly [5]. Each iteration two-dimensional plot. In order to overcome the distortion
of this K means algorithm has complexity of O(kn), but the problem, training rate and neighborhood radius has been
number of iterations are very large therefore reduced slowly in many researches. Therefore, for successful
Bradley-Fayyad-Reina (BFR) is proposed as an alternative to memory map development, SOM needs a high number of
overcome this issue [3]. Storage requirement of the k means training iterations [7]. Getting the right data is a problem
algorithm is O((m+K)n),where m represents the number of associated with this algorithm because each attribute or the
data points and n is the dimension of the attributes. dimension of every data point needs to have a value. Missing
data is one of the major problems we find in a dataset. SOM
B. Self-Organizing Maps (SOM)
clusters data so that data points are surrounded by similar
Self-Organizing Maps are based on competitive learning data points. But similar points are not always together or
methods. It is an unsupervised learning technique which closer in this method [10].
non-linearly projects multi-dimensional data onto a
two-dimensional plot which is known as vector quantization. C. Hierarchical Clustering
In SOM, no human intervention is needed for the learning Hierarchical clustering is a clustering approach which
process and less information is needed to be known about the constructs a tree of data points by considering the similarity
input data. SOM measures the similarity among data points in between data points, while other clustering algorithms are flat
terms of statistical measures. SOM creates a network in a way clustering algorithms. Hierarchical clustering returns a set of
that topological relationship within the data set are preserved clusters which are informative than flat clustering structures.
[7]. The number of clusters need not to be specified in advance
and these algorithms are directive. Hierarchical clustering
iteratively merges clusters into either a single cluster or to
individual data nodes. Dendrogram is the diagrammatic
representation of the arrangement of clusters produced during
hierarchical clustering process. To stop at a predefined
number of clusters either the dendrogram has to be cut at a
similarity measure [11]. There are two types of hierarchical
clustering methods. Distance between the clusters or data
points are calculated using Euclidean distance or Manhattan
distance. Distance can be calculated using single linkage,
Fig. 1. Network architecture of self organizing maps. average linkage and complete linkage. Single linkage defines
the distance between two clusters as the shortest distance
Kohenan SOMs are a type of neural network. Network between two points in each cluster. While the distance
architecture of SOM is depicted in Fig. 1. Each between two clusters is defined as the longest distance
computational layer node is connected to each input layer between two points in each cluster in complete linkage. And
node. Nodes in the computational layer are connected only to the distance between two clusters is defined as the average
represent the adjacency and there is no weighted connection distance between each point in one cluster to every point in
between the nodes in that layer. Each node has a specific the other cluster in average linkage [12]. Single linkage
topological position and a vector of weights as of the same suffers from chaining because only two close points are
dimension as the weight vector. SOM is based on mainly needed to merge two clusters and the cluster may spread due
three steps: competition, cooperation and synaptic adaption. to data points far away in the cluster. Complete linkage
Competition process is about finding the winning neuron for avoids chaining but is subjected to crowding. Clusters can be
an input data. Winning neuron is the best matching unit and is closer to other clusters than to the clusters to which distance
calculated using the minimum Euclidean distance or the is calculated [13].
highest inner product of the weight vector and the input Agglomerative hierarchical clustering
vector [8].
Bottom up hierarchical clustering approach is referred to as
Then it finds the neighboring neurons affected by the
agglomerative clustering approach. This method treats the
winning neuron. It concentrates on the facts that search space
one data point as a single cluster and merges the clusters
or the neighboring function should decrease when the number
considering the similarity between individual clusters until a
of iterations increase and the effect of the winning neuron to
single cluster containing all the data points is constructed.
the neighboring neuron should decrease when the distance
Considering a set of 𝑁 data points, a distance matrix of
between the winning neuron and the neighboring neurons
increases. Finally, Synaptic adaption is adjusting the weights 𝑁 ∗ 𝑁 points is created and the basic algorithm of
agglomerative hierarchical clustering can be described as
of the neighboring neurons based on the effect to the winning
follows:
neuron [9].
a) Initialize with N clusters, where a single data point is
Main advantage of Self Organizing Maps includes its represented by one cluster

2
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018

b) Determine the pair of clusters with the least distance that a tree considering the underlying rules used primarily for
is the closest pair of clusters and unify them into a single classification and prediction. There exist numerous decision
cluster, so that it results with one cluster less than the tree methods namely ID3, CART, C4.5, PUBLIC, CN2,
original number of clusters. SPRINT etc. [20]. This paper analyses only ID3, C4.5 and
c) Calculate pairwise distances (similarities) between the CART decision tree development methods in detail as they
clusters at current that is the newly formed cluster and are the key approaches under Decision Trees.
the priory available clusters. In a top down recursive approach decision trees analyses the
d) Repeat steps 𝑏 and 𝑐 until all data samples are merged attributes in the internal nodes of the tree and predicts the
into a single cluster of size 𝑁 [14-16]. downward attributes based on the attributes of the node and
Complexity of agglomerative clustering algorithm is concludes the classification process using the leaf nodes. It
𝑂(𝑛3 ) in general case. Therefor this makes it less appropriate breaks down the data set into smaller subsets while
for larger datasets due to its high time consumption and high incrementally building a decision tree. Decision trees can
processing time [17]. handle both categorical and numerical data and uses either
Decisive hierarchical clustering the depth first greedy approach or the breadth first search to
find the suitable cluster [21].
This is a top down approach of constructing a hierarchical
tree of data points. The entire data sample set is considered as Iterative Dichotomized algorithm (ID3 algorithm)
one cluster initially and division of clusters considering the ID3 algorithm used entropy and information gain to build
„second flat hierarchical clustering‟ approach. This method is the decision tree. Entropy calculated the homogeneity of a
more complex than the bottom up approach because a second sample dataset. To build a decision tree the entropy is
method is used to split the clusters. In this case in most of the calculated by using one attribute factor. Secondly the entropy
time k means clustering approach is used as the flat clustering is again calculated by using two attribute factors, and so on.
algorithm. Most of the time divisive approach produce more The information gain is based on the decrease in entropy after
accurate results than bottom up approaches [18]. a dataset is split on an attribute. Constructing a decision tree
Algorithm of divisive hierarchical clustering is as follows. is all about finding attribute that returns the highest
a) Primarily initiate the process with one cluster containing information gain. Attribute factor with the highest
all the samples information gain is used to split the data set at the first level.
b) Select a cluster with the widest diameter as the largest Algorithm continues iteratively until pure node is received.
cluster. Algorithm of decision tree can be explained as follows;
c) Detect the data point in the cluster found in step b) with a) Calculate the entropy of the target split of data using (1).
the minimum average similarity to the other elements in
𝑐
𝑖=1 − Pi log 2 𝑃𝑖
that cluster. E(S) = (1)
d) The data sample found in c) is the first element to be
added to the fragment group where i is the target class of the split of data, Pi is the
e) Detect the element in the original group which reports probability of the target class and E(S) is the entropy of the
the highest average similarity with the fragment group; target split.
f) If the average similarity of element detected in e) with b) Split the dataset on different attributes and calculate the
the fragment group is greater than its average similarity entropy for each branch. Calculated entropy of each
with the original group then assign the data sample to the branch is added proportionally to get the entropy of the
fragment group and go to Step e; otherwise do nothing; split.
g) Repeat Step b – Step f until each data point is separated c) Resulting entropy is subtracted from the previous
into individual clusters [19]. entropy to get the information gain. Information gain is
Complexity of divisive hierarchical clustering algorithm is calculated using (2).
O(2n) [17].
Hierarchical clustering algorithms does not have any Gain (T, X) = Entropy (T) – Entropy (T, X) (2)
problems with choosing initial points and getting stuck in
where T and X are attributes.
local minima. These hierarchical clustering algorithms are
d) Choose attribute with the highest information gain as the
expensive in storage and computational complexity. Since all
decision node. Then divide the dataset by its branches
merges are final and as we do not have any control over the
and repeat the same process on every branch.
algorithm it might produce erroneous results in noisy and e) Splits with 0 entropy need not be further divided while
high dimensional data. Above identified problems could be splits with entropy higher than 0 needs to be further into
suppressed to some extent by initially clustering the data leaf nodes
using partitive clustering algorithms [6]. In hierarchical Computational complexity of ID3 algorithm is the linear
clustering there is no way to detect the optimal number of function of product of a number of examples, the
clusters. Number of clusters can be identified by cutting the characteristic number, and node number [22]. ID3 algorithm
dendrogram at a similarity measure so that the similarity does not reveal high accuracy rates for noisy data sets and
among the clusters of that point is the y axis value at the cut. highly detailed datasets. Therefor preprocessing of data is
Furthermore hierarchical clustering algorithms concerns only important before training the ID3 algorithm. Main drawback
about the inter cluster and intra cluster variations in clustering of this algorithm is it usually favors the attribute with unique
data. values using the „Gain‟ measurement [21]. This algorithm id
D. Decision Tree highly sensitive to the noise.
Decision tree is a classification algorithm which constructs Classification 4.5 algorithm (C4.5)

3
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018

This algorithm is the successor of ID3 algorithm and uses b) For each k, calculate the total within-cluster variation,
„Gain ratio‟ as the splitting criteria instead of „information using (6).
gain‟ [23]. Information gain is normalized using „split
𝑘=1 𝑘𝑊(𝐶𝑘) (6)
information‟ in C4.5 algorithm [24]. C4.3 uses gain ratio to
choose the best splitting attribute instead of information gain. where 𝐶𝑘 is the kth cluster and 𝑊(𝐶𝑘) is the within-cluster
Refer (3) for Gain ratio of an attribute. Split information is variation.
defined in (4), where pj is the probability of j the split and v is c) The curve of Within Cluster Variation is then plotted
the total splits of a target dataset. according to the number of clusters k.
d) The location of a bend (knee) in the plot is generally
Gain ratio (A) = Gain / Gini difference (A) considered as an indicator of the appropriate number of
Split Info (A) (3) clusters as shown in Fig. 2.
𝑣
Split(A) = 𝑗 =1
pjlog2(𝑝𝑗) (4)

Advantage of C4.3 includes handling both discrete and


continuous attributes. It defines a threshold for handling
continuous attributes and ruptures the list into two as above
the threshold and below or equal the threshold. This
algorithm can deal with datasets that have patterns and
contain unknown values. It avoids missing values when
calculating gain and entropy [21].
Classification and Regression Trees (CART algorithm)
CART algorithm deals with both categorical and
continuous data in developing a decision tree. Furthermore it
also handles missing values in the dataset. It uses the „Gini
Index‟ to select the best split criteria. Analogous to ID3 and
C4.5 algorithms it splits the data set into binary branches [24]. Fig. 2. Graphical representation of the results of Elbow method.
„Gini index‟ is a measure of impurity of a data sample set, and
defined in (5), where Pj is the probability of j the split of a Average Silhouette Method
dataset. This method measures the quality of clustering. It defines
how well each object lies within the cluster. A high average
Gini Index = 1 - 𝑗 𝑃𝑗2 (5)
silhouette width indicates a better clustering approach. This
CART trees are developed to a maximum size without the method observes average silhouette width for different
use of a stopping criterion. Cost complexity pruning is used number of clusters graphically and identify the optimal
to prune back the tree split by split [25]. CART algorithm number of clusters through the number of clusters which
yields high accuracy when compared to other two algorithms maximize the average silhouette width, refer Fig. 3.
[26], refer Table I.
Decision trees decrease the cost of predicting the class of
data with the addition of each data point. It can be used for
both categorical and numerical data.

TABLE I: COMPARISON OF THE ACCURACY OF DECISION TREE


ALGORITHMS [26]
Correctly Incorrectly
Decision Tree Unclassified
classified classified
Algorithms instances
instances instances
ID3 50% 47.5% 2.50%
C4.5 54.17% 45.83% 0%
CART 55.83% 44.17% 0%

However, when dealing with categorical data with multiple


levels, the information gain is biased in favor of the attributes Fig. 3. Graphical representation of the results of Average Silhoette method.
with the most levels. Its more complex when dealing with
linked outcomes.
E. Detecting the Optimal Number of Clusters III. RECURSIVE HIERARCHICAL CLUSTERING ALGORITHM
Elbow method In normal agglomerative and decisive hierarchical clustering
algorithms, the distance between the data points is used for
Elbow method uses the concept of any partitive clustering
merging the data points into clusters. If we consider about
algorithm – to minimize the within cluster variation. agglomerative hierarchical clustering, a dendrogram is
a) Compute clustering algorithm (e.g., k-means clustering) created by merging two individual data points at a single time
for different values of k. For instance, by varying k from by considering the pairwise distances of the data points
1 to 10 clusters existing at the moment. Merging of data points or clustered

4
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018

points continues until one cluster is formed. Detecting the Gini Index = 1 - 𝑗 𝑃𝑗2 (8)
best number of clusters is done using experimentation and
graphical methods. where j is class and Pj is probability of class (proportion of
In our algorithm, a method has been proposed to improve samples with class j). Gini index of a split is calculated as in
the accuracy of clustering images by combining core (9).
concepts of decision tree and hierarchical clustering. Local
S left S right
optima for the number of clusters is detected using a novel Gini index split 𝑁 = gini 𝑁left + gini 𝑁right (9)
S S
hierarchical clustering algorithm.
The proposed recursive hierarchical clustering algorithm
considers each data point as an individual data point, where 𝑆𝑙𝑒𝑓𝑡 the number of samples in the left node, 𝑆𝑟𝑖𝑔 𝑕𝑡 is
calculates the pairwise distances and merges the data points the number of samples in the right node, S is the total number
with minimum pairwise distance. Each merged cluster is of samples, 𝑁𝑙𝑒𝑓𝑡 is the left child and 𝑁𝑟𝑖𝑔 𝑕𝑡 is the right child.
considered as an individual data point in the next iteration.
Distance is calculated using one from minimum, complete Gini level = Max (Gini split of same level) (10)
and average linkage methods where minimum distance
calculates the closest distance between two clusters, complete Gini Difference between two consecutive levels is
linkage calculates the maximum distance between two calculated using gini indexes of two levels. Gini difference is
clusters and average linkage calculates the average distance a measure of information gain or the drop of impurity. We
between two clusters. The tree is developed by iteratively choose the two levels with the highest gini difference or of
merging data points or clusters until a single cluster is formed. highest drop calculated using (11).
Decisive clustering considers the primary data set as one
cluster and divides the data into clusters iteratively until a Gini diff = Gini level (i) - Gini level (i+1) (11)
single data point is considered as one cluster.
Refer (7) for pairwise distance among data points or data To choose between the upper level and the lower level of
clusters, calculated using Euclidean distance the highest gini drop we use “Gain Index” (Gain ratio). Even
though lower gini index reveals higher purity of a split,
𝑛
Dij2 = 𝑣=1 (𝑋𝑣𝑖 − 𝑋𝑣𝑗) 2 (7) number of data points in a cluster and the number of clusters
is a tradeoff. At the leaf nodes the purity of the split is higher
where 𝐷𝑖𝑗 the Euclidean distance between the two data is is but each cluster contains only individual elements. Gain ratio
points, 𝑋𝑣𝑖 and 𝑋𝑣𝑗 are the two data points where v is the is a modification of information gain that reduces its biasness.
dimension. The resulting dendrogram will be as in Fig. 4. X Gain ratio takes number and size of branches into account
axis of the dendrogram represent the individual data points when choosing a split. Gain ratio overcomes the tradeoff of
while Y axis of the dendrogram will represent the similarity Gini index through taking intrinsic information into account.
among the data points or the pairwise distances among the Therefore we calculate the Gain ratio for each level using (12)
data points or data clusters. and (13).
Gain ratio (A) = Gain / Gini difference (A)
Split Info (A) (12)

𝑣 |𝐷𝑗 | |𝐷𝑗 |
Split Info A (D) = 𝑗 =1 |𝐷| ∗ log 2 (13)
𝐷

where Dj is the number of elements of a class and D is the


total number of elements. Higher the gain ratio better the split.
Therefore we chose the level with higher gain ratio from the
higher and lower levels of the maximum gini difference.
Using the Recursive Hierarchical clustering algorithm, we
select the optimal number of clusters for a dataset.
Recursive Hierarchical clustering algorithm
Fig. 4. High level representation of the dendrogram of recursive hierarchical Input = Data set of„d‟ number of data points
clustering algorithm. .
Until d==1
Calculate pairwise distances among the data points or clusters using
In Recursive Hierarchical clustering algorithm, we Euclidean distance.
combine the approach of hierarchical clustering and decision Merge two data points or data clusters of the least pairwise distance.
tree to detect the optimal number of clusters. The completed End Until
dendrogram is used to calculate the Gini index (Gini Impurity) Compose the dendrogram of the hierarchical tree.
For each split {
of each split at each level. Dendrogram is assumed as a Calculate the gini index
decision tree and the gini index of each split is calculated If (split is leaf node)
recursively. Gini index of a split is a measure of impurity. Gini Index = 1 - 𝑗 𝑃𝑗2 (8)
Using the gini indexes of each split gini index of a level is Else
S left S right
calculated. Maximum gini index out of gini indexes of splits Gini indexsplit 𝑁 = gini 𝑁left + gini 𝑁right (9)
S S
in the same level is chosen as the gini index of a level. Gini For all levels of the dendrogram {
index of a leaf node is calculated using (8). Calculate Gini index of a level

5
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018

Gini index of a level = Gini level = Max (Gini split of same level) (10) tested. The results or the distribution of data of the Fisher's
Calculate Gini Ratio of a split. Iris Data set within two, three and four clusters and are
compared with the results obtained using recursive
Gain ratio (A) = Gain / Gini difference (A)
Split Info (A) (11) hierarchical clustering algorithm and graphically depicted in
𝑣 |𝐷𝑗 | |𝐷𝑗 |
Fig. 6. Recursive hierarchical clustering detects the optimal
Split Info A (D) = 𝑗 =1 |𝐷| ∗ log 2 (12)}
𝐷 cluster number as 3 for Fishers Iris dataset. The distribution
of samples among clusters using recursive hierarchical
Calculate gini differences between consecutive levels.
Select the two levels which results in the highest gini difference. clustering algorithm is almost similar to the K means
Best split = Split with higher Gain ratio and the highest gain difference clustering algorithm of 3 clusters.

IV. EXPERIMENTS AND RESULTS


Experiments have been conducted on clustering many
datasets using multiple clustering techniques and the results
have been cross validated against the results of recursive
hierarchical clustering algorithm. Clustering results of the
„Fisher's Iris dataset with K means clustering, hierarchical
clustering, Self-Organizing Maps are summarized and
discussed here.
In K means clustering we have to predefine the number of
clusters as an input to the algorithm. This is identified as a
pitfall of K means clustering because there exists a need for Fig. 7. Variation of the optimal clusters with linkage variation.
identifying the possible number of clusters using another
clustering approach such as SOM to detect the probable Optimal number of clusters of a dataset differs with the
distribution of clusters for better information gain. change of the linkage method in recursive hierarchical
clustering algorithm. Fishers Iris dataset was used to conduct
experiments by varying the linkage method and the results
are summarized in Fig. 7.
SOM uses statistical measures such as mean and variance
for the competitive learning process. Competitive learning is
based on these measures. Euclidean distance is the bottom
line for calculating mean and variance. Therefor the data
becomes meaningless when it is encoded into binary format.
K means algorithm also clusters data based on the
Euclidean distance. There is no explicit control of the
algorithm and the only control is the ability to define the
number of clusters explicitly. Even though we explicitly
define the number of clusters, optimal number of clusters is
derived by considering the intra cluster variation among the
data samples of the clusters and the inter cluster variations
between the clusters.
Decision trees focus on the gini index and information gain
in developing the decision tree. Decision tree considers about
the number of samples contained within the clusters but it
does not consider about the variance of data of a single
Fig. 5. Flow diagram of recursive hierarchical clustering algorithm.
branch that is the spread of data. Therefor decision trees gives
control over the number of samples within a cluster and not
over the spread or the variance of data in the clusters.
Hierarchical clustering concerns about the variance of
clusters. The sigma value of the dendrogram gives the
variance in the clusters. But this algorithm does not consider
about the information gain of data.
Recursive Hierarchical Clustering algorithm concerns all
factors of information gain, gini index and variance or the
spread of data in clusters. This algorithm finds the clusters
with similar variances or spreads of data. Each cut along the y
axis of the dendrogram detects clusters of a sigma (value of y
Fig. 6. Results of cluster samples variation using multiple clustering axis at the cut) variation. Then we choose the level with the
techniques. highest information gain to detect the optimal number of
clusters. This algorithm combines the features of the
K means clustering differentiating the number of clusters is hierarchical clustering and the decision tree. This can be used

6
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018

to cluster both label data and real data. [23] K. Sasirekha and P. Baby, “Agglomerative hierarchical clustering
algorithm — a review,” International Journal of Scientific and
Research Publications, vol. 3, issue 3, March 2013.
REFERENCES [24] M. Roux, “A comparative study of divisive hierarchical clustering
[1] F. Long, H. Zhang, and D. D. Feng, “Fundamentals of content-based algorithms,” 2015.
image retrieval,” Multi-media Information Retrieval and Management, [25] N. Rajalingam and K. Ranjini, “Hierarchical clustering algorithm — A
Springer Berlin Heidelberg, (2003), pp. 1-26. comparative study,” International Journal of Computer Applications,
[2] C. H. C. Leung and Y. Li, Semantic Image Retrieval, 2015. vol. 19, no. 3, pp. 0975–8887, April 2001.
[3] A. Alzu'bi, A. Amira, and N. Ramzan, “Semantic content-based image [26] T. M. Lakshmi, A. Martin, R. M. Begum, and V. P Venkatesan, “An
retrieval: A comprehensive study,” Journal of Visual Communication analysis on performance of decision tree algorithms using student‟s
and Image Representation, vol. 32, pp. 20-54, 2015. qualitative data,” I. J. Modern Education and Computer Science, vol. 5,
[4] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect pp. 18-27, June 2013.
unseen object classes by between-class attribute transfer,” CVPR,
2009.
[5] Attribute Datasets. (April 10, 2017). [Online]. Available: Pavani Y. De Silva was born in Colombo, Sri Lanka in
https://www.ecse.rpi.edu/homepages/cvrl/database/AttributeDataset.ht October 1992. She received her BSc. (Hons.) degree in
m. information technology with first class honors from
[6] J. Deng et al., “ImageNet: A large-scale hierarchical image database,” University of Moratuwa, Sri Lanka in 2017. She works
in Proc. 2013 IEEE Conference on Computer Vision and Pattern as a software engineer at IFS R & D International (Pvt)
Recognition, 2009, pp. 248-255. Ltd. Her current research interests include machine
[7] M. R. Anderberg, “Cluster analysis for application,” Academic Press, learning, data mining, big data analytics and data
1973. science.
[8] P. N. Tan, “Cluster analysis: Basic concepts and algorithms,”
Introduction to Data Mining, pp. 48-559, 2006.
[9] O. Maimon and L. Rokach, “Data mining and knowledge discovery Chiran N. Fernando was born in Sri Lanka in
handbook,” Springer Science Business Media Inc, pp. 321-352, 2005. November 1991. He graduated from University of
[10] A. Huang, “Similarity measures for text document clustering,” Moratuwa, Sri Lanka with a BSc. (Hons.) degree in
NZCSRSC 2008, April 2008, Christchurch, New Zealand. information technology in 2017. He has worked at
[11] L. V. Bijuraj, “Clustering and its Applications,” in Proc. National Virtusa Polaris Pvt. Ltd as a software engineer intern.
Conference on New Horizons in IT – NCNHIT, 2013. He currently works as a software engineer at WSO2
[12] D. M. Blei. (September 5, 2007). Clustering and the k-means Lanka (Pvt) Ltd. His current research interests are
Algorithm. [Online]. Available: machine intelligence, deep learning and artificial
http://www.cs.princeton.edu/courses/archive/spr08/cos424/slides/clust neural networks.
ering-1.pdf
[13] P. Stefanovic and O. Kurasova, “Visual analysis of self-organizing
maps,” Nonlinear Analysis: Modelling and Control, vol. 16, no. 4, pp. Damith D. Wijethunge was born in September 1991
488–504, 2011. in Kandy, Sri Lanka. He has earned his bachelor‟s
[14] S. Krüger. Self-Organizing Maps. [Online]. Available: honours degree in the field of information technology
http://www.iikt.ovgu.de/iesk_media/Downloads/ks/computational_ne from University of Moratuwa in 2017. He has worked
uroscience/vorlesung/comp_neuro8-p-2090.pdf at Virtusa Polaris Pvt Ltd, SquareMobile Pvt Ltd as a
[15] T. Kohonen, Self-Organizing Maps, 3rd edition, Springer Ser. Inf. Sci., software engineer intern. Currently he is working at
Springer-Verlag, Berlin, 2001. DirectFN Ltd, Sri Lanka as a software engineer. He is
[16] A. K. Mann and N. Kaur, “Survey paper on clustering techniques,” interested in image processing, deep learning, artificial
International Journal of Science, Engineering and Technology neural network, big data analytics and data mining.
Research, vol. 2, issue 4, April 2013.
[17] R. Tibshirani. (September 14, 2009). Distances between clustering,
Hierarchical clustering. [Online]. Available: Subha D. Fernando graduated from University of
http://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf Kelaniya, Sri Lanka with BSc. Special (Hons)
[18] S. Sayad. Hierarchical clustering. [Online]. Available: Statistics and computer science in 2004 and from
http://www.saedsayad.com/clustering_hierarchical.htm Nagaoka University of Technology with master of
[19] R. Tibshirani. (January 29, 2013). Clustering 2: Hierarchical clustering. engineering (M. Eng.), Management information
[Online]. Available: systems science in 2010. She completed her PhD
http://www.stat.cmu.edu/~ryantibs/datamining/lectures/05-clus2-mark degree in computational intelligence from Nagaoka
ed.pdf University of Technology in 2013. She is the head of
[20] M. S. Yang,” A Survey of hierarchical clustering,” Mathl. Comput. the Department of Computational Mathematics, Faculty of Information
Modelling, vol. 18, no. 11, pp. 1-16, 1993. Technology, University of Moratuwa, Sri Lanka. She is the president of Sri
[21] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” Lanka Association for Artificial Intelligence. Her current research interests
ACM Computing Surveys, vol. 31, no. 3, September 1999. are machine learning, deep learning, artificial neural networks,
[22] Hierarchical Clustering Algorithms. [Online]. Available: multi-agent-systems and intelligent informatics.
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchic
al.html

You might also like