Recursive Hierarchical Clustering Algorithm
Recursive Hierarchical Clustering Algorithm
1, February 2018
doi: 10.18178/ijmlc.2018.8.1.654 1
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018
or spherical in shape due to calculating the Euclidean ability of projecting multivariate data into a two-dimensional
distance. There is no knowledge on the variable which highly plot while providing intelligible, useful summary of data [7].
contributes to the clustering process in case of high SOM is easy to understand and is simple. SOM can evaluate
dimensional data clustering. K means algorithm fails for its own quality. The disadvantages associated with SOM
categorical data. The use of exclusive assignment i.e. if there include the possibility of distortions due to the fact that high
exists data which are highly overlapping then k-means will dimensional data cannot be always represented in a
not be able to resolve the clusters distinctly [5]. Each iteration two-dimensional plot. In order to overcome the distortion
of this K means algorithm has complexity of O(kn), but the problem, training rate and neighborhood radius has been
number of iterations are very large therefore reduced slowly in many researches. Therefore, for successful
Bradley-Fayyad-Reina (BFR) is proposed as an alternative to memory map development, SOM needs a high number of
overcome this issue [3]. Storage requirement of the k means training iterations [7]. Getting the right data is a problem
algorithm is O((m+K)n),where m represents the number of associated with this algorithm because each attribute or the
data points and n is the dimension of the attributes. dimension of every data point needs to have a value. Missing
data is one of the major problems we find in a dataset. SOM
B. Self-Organizing Maps (SOM)
clusters data so that data points are surrounded by similar
Self-Organizing Maps are based on competitive learning data points. But similar points are not always together or
methods. It is an unsupervised learning technique which closer in this method [10].
non-linearly projects multi-dimensional data onto a
two-dimensional plot which is known as vector quantization. C. Hierarchical Clustering
In SOM, no human intervention is needed for the learning Hierarchical clustering is a clustering approach which
process and less information is needed to be known about the constructs a tree of data points by considering the similarity
input data. SOM measures the similarity among data points in between data points, while other clustering algorithms are flat
terms of statistical measures. SOM creates a network in a way clustering algorithms. Hierarchical clustering returns a set of
that topological relationship within the data set are preserved clusters which are informative than flat clustering structures.
[7]. The number of clusters need not to be specified in advance
and these algorithms are directive. Hierarchical clustering
iteratively merges clusters into either a single cluster or to
individual data nodes. Dendrogram is the diagrammatic
representation of the arrangement of clusters produced during
hierarchical clustering process. To stop at a predefined
number of clusters either the dendrogram has to be cut at a
similarity measure [11]. There are two types of hierarchical
clustering methods. Distance between the clusters or data
points are calculated using Euclidean distance or Manhattan
distance. Distance can be calculated using single linkage,
Fig. 1. Network architecture of self organizing maps. average linkage and complete linkage. Single linkage defines
the distance between two clusters as the shortest distance
Kohenan SOMs are a type of neural network. Network between two points in each cluster. While the distance
architecture of SOM is depicted in Fig. 1. Each between two clusters is defined as the longest distance
computational layer node is connected to each input layer between two points in each cluster in complete linkage. And
node. Nodes in the computational layer are connected only to the distance between two clusters is defined as the average
represent the adjacency and there is no weighted connection distance between each point in one cluster to every point in
between the nodes in that layer. Each node has a specific the other cluster in average linkage [12]. Single linkage
topological position and a vector of weights as of the same suffers from chaining because only two close points are
dimension as the weight vector. SOM is based on mainly needed to merge two clusters and the cluster may spread due
three steps: competition, cooperation and synaptic adaption. to data points far away in the cluster. Complete linkage
Competition process is about finding the winning neuron for avoids chaining but is subjected to crowding. Clusters can be
an input data. Winning neuron is the best matching unit and is closer to other clusters than to the clusters to which distance
calculated using the minimum Euclidean distance or the is calculated [13].
highest inner product of the weight vector and the input Agglomerative hierarchical clustering
vector [8].
Bottom up hierarchical clustering approach is referred to as
Then it finds the neighboring neurons affected by the
agglomerative clustering approach. This method treats the
winning neuron. It concentrates on the facts that search space
one data point as a single cluster and merges the clusters
or the neighboring function should decrease when the number
considering the similarity between individual clusters until a
of iterations increase and the effect of the winning neuron to
single cluster containing all the data points is constructed.
the neighboring neuron should decrease when the distance
Considering a set of 𝑁 data points, a distance matrix of
between the winning neuron and the neighboring neurons
increases. Finally, Synaptic adaption is adjusting the weights 𝑁 ∗ 𝑁 points is created and the basic algorithm of
agglomerative hierarchical clustering can be described as
of the neighboring neurons based on the effect to the winning
follows:
neuron [9].
a) Initialize with N clusters, where a single data point is
Main advantage of Self Organizing Maps includes its represented by one cluster
2
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018
b) Determine the pair of clusters with the least distance that a tree considering the underlying rules used primarily for
is the closest pair of clusters and unify them into a single classification and prediction. There exist numerous decision
cluster, so that it results with one cluster less than the tree methods namely ID3, CART, C4.5, PUBLIC, CN2,
original number of clusters. SPRINT etc. [20]. This paper analyses only ID3, C4.5 and
c) Calculate pairwise distances (similarities) between the CART decision tree development methods in detail as they
clusters at current that is the newly formed cluster and are the key approaches under Decision Trees.
the priory available clusters. In a top down recursive approach decision trees analyses the
d) Repeat steps 𝑏 and 𝑐 until all data samples are merged attributes in the internal nodes of the tree and predicts the
into a single cluster of size 𝑁 [14-16]. downward attributes based on the attributes of the node and
Complexity of agglomerative clustering algorithm is concludes the classification process using the leaf nodes. It
𝑂(𝑛3 ) in general case. Therefor this makes it less appropriate breaks down the data set into smaller subsets while
for larger datasets due to its high time consumption and high incrementally building a decision tree. Decision trees can
processing time [17]. handle both categorical and numerical data and uses either
Decisive hierarchical clustering the depth first greedy approach or the breadth first search to
find the suitable cluster [21].
This is a top down approach of constructing a hierarchical
tree of data points. The entire data sample set is considered as Iterative Dichotomized algorithm (ID3 algorithm)
one cluster initially and division of clusters considering the ID3 algorithm used entropy and information gain to build
„second flat hierarchical clustering‟ approach. This method is the decision tree. Entropy calculated the homogeneity of a
more complex than the bottom up approach because a second sample dataset. To build a decision tree the entropy is
method is used to split the clusters. In this case in most of the calculated by using one attribute factor. Secondly the entropy
time k means clustering approach is used as the flat clustering is again calculated by using two attribute factors, and so on.
algorithm. Most of the time divisive approach produce more The information gain is based on the decrease in entropy after
accurate results than bottom up approaches [18]. a dataset is split on an attribute. Constructing a decision tree
Algorithm of divisive hierarchical clustering is as follows. is all about finding attribute that returns the highest
a) Primarily initiate the process with one cluster containing information gain. Attribute factor with the highest
all the samples information gain is used to split the data set at the first level.
b) Select a cluster with the widest diameter as the largest Algorithm continues iteratively until pure node is received.
cluster. Algorithm of decision tree can be explained as follows;
c) Detect the data point in the cluster found in step b) with a) Calculate the entropy of the target split of data using (1).
the minimum average similarity to the other elements in
𝑐
𝑖=1 − Pi log 2 𝑃𝑖
that cluster. E(S) = (1)
d) The data sample found in c) is the first element to be
added to the fragment group where i is the target class of the split of data, Pi is the
e) Detect the element in the original group which reports probability of the target class and E(S) is the entropy of the
the highest average similarity with the fragment group; target split.
f) If the average similarity of element detected in e) with b) Split the dataset on different attributes and calculate the
the fragment group is greater than its average similarity entropy for each branch. Calculated entropy of each
with the original group then assign the data sample to the branch is added proportionally to get the entropy of the
fragment group and go to Step e; otherwise do nothing; split.
g) Repeat Step b – Step f until each data point is separated c) Resulting entropy is subtracted from the previous
into individual clusters [19]. entropy to get the information gain. Information gain is
Complexity of divisive hierarchical clustering algorithm is calculated using (2).
O(2n) [17].
Hierarchical clustering algorithms does not have any Gain (T, X) = Entropy (T) – Entropy (T, X) (2)
problems with choosing initial points and getting stuck in
where T and X are attributes.
local minima. These hierarchical clustering algorithms are
d) Choose attribute with the highest information gain as the
expensive in storage and computational complexity. Since all
decision node. Then divide the dataset by its branches
merges are final and as we do not have any control over the
and repeat the same process on every branch.
algorithm it might produce erroneous results in noisy and e) Splits with 0 entropy need not be further divided while
high dimensional data. Above identified problems could be splits with entropy higher than 0 needs to be further into
suppressed to some extent by initially clustering the data leaf nodes
using partitive clustering algorithms [6]. In hierarchical Computational complexity of ID3 algorithm is the linear
clustering there is no way to detect the optimal number of function of product of a number of examples, the
clusters. Number of clusters can be identified by cutting the characteristic number, and node number [22]. ID3 algorithm
dendrogram at a similarity measure so that the similarity does not reveal high accuracy rates for noisy data sets and
among the clusters of that point is the y axis value at the cut. highly detailed datasets. Therefor preprocessing of data is
Furthermore hierarchical clustering algorithms concerns only important before training the ID3 algorithm. Main drawback
about the inter cluster and intra cluster variations in clustering of this algorithm is it usually favors the attribute with unique
data. values using the „Gain‟ measurement [21]. This algorithm id
D. Decision Tree highly sensitive to the noise.
Decision tree is a classification algorithm which constructs Classification 4.5 algorithm (C4.5)
3
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018
This algorithm is the successor of ID3 algorithm and uses b) For each k, calculate the total within-cluster variation,
„Gain ratio‟ as the splitting criteria instead of „information using (6).
gain‟ [23]. Information gain is normalized using „split
𝑘=1 𝑘𝑊(𝐶𝑘) (6)
information‟ in C4.5 algorithm [24]. C4.3 uses gain ratio to
choose the best splitting attribute instead of information gain. where 𝐶𝑘 is the kth cluster and 𝑊(𝐶𝑘) is the within-cluster
Refer (3) for Gain ratio of an attribute. Split information is variation.
defined in (4), where pj is the probability of j the split and v is c) The curve of Within Cluster Variation is then plotted
the total splits of a target dataset. according to the number of clusters k.
d) The location of a bend (knee) in the plot is generally
Gain ratio (A) = Gain / Gini difference (A) considered as an indicator of the appropriate number of
Split Info (A) (3) clusters as shown in Fig. 2.
𝑣
Split(A) = 𝑗 =1
pjlog2(𝑝𝑗) (4)
4
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018
points continues until one cluster is formed. Detecting the Gini Index = 1 - 𝑗 𝑃𝑗2 (8)
best number of clusters is done using experimentation and
graphical methods. where j is class and Pj is probability of class (proportion of
In our algorithm, a method has been proposed to improve samples with class j). Gini index of a split is calculated as in
the accuracy of clustering images by combining core (9).
concepts of decision tree and hierarchical clustering. Local
S left S right
optima for the number of clusters is detected using a novel Gini index split 𝑁 = gini 𝑁left + gini 𝑁right (9)
S S
hierarchical clustering algorithm.
The proposed recursive hierarchical clustering algorithm
considers each data point as an individual data point, where 𝑆𝑙𝑒𝑓𝑡 the number of samples in the left node, 𝑆𝑟𝑖𝑔 𝑡 is
calculates the pairwise distances and merges the data points the number of samples in the right node, S is the total number
with minimum pairwise distance. Each merged cluster is of samples, 𝑁𝑙𝑒𝑓𝑡 is the left child and 𝑁𝑟𝑖𝑔 𝑡 is the right child.
considered as an individual data point in the next iteration.
Distance is calculated using one from minimum, complete Gini level = Max (Gini split of same level) (10)
and average linkage methods where minimum distance
calculates the closest distance between two clusters, complete Gini Difference between two consecutive levels is
linkage calculates the maximum distance between two calculated using gini indexes of two levels. Gini difference is
clusters and average linkage calculates the average distance a measure of information gain or the drop of impurity. We
between two clusters. The tree is developed by iteratively choose the two levels with the highest gini difference or of
merging data points or clusters until a single cluster is formed. highest drop calculated using (11).
Decisive clustering considers the primary data set as one
cluster and divides the data into clusters iteratively until a Gini diff = Gini level (i) - Gini level (i+1) (11)
single data point is considered as one cluster.
Refer (7) for pairwise distance among data points or data To choose between the upper level and the lower level of
clusters, calculated using Euclidean distance the highest gini drop we use “Gain Index” (Gain ratio). Even
though lower gini index reveals higher purity of a split,
𝑛
Dij2 = 𝑣=1 (𝑋𝑣𝑖 − 𝑋𝑣𝑗) 2 (7) number of data points in a cluster and the number of clusters
is a tradeoff. At the leaf nodes the purity of the split is higher
where 𝐷𝑖𝑗 the Euclidean distance between the two data is is but each cluster contains only individual elements. Gain ratio
points, 𝑋𝑣𝑖 and 𝑋𝑣𝑗 are the two data points where v is the is a modification of information gain that reduces its biasness.
dimension. The resulting dendrogram will be as in Fig. 4. X Gain ratio takes number and size of branches into account
axis of the dendrogram represent the individual data points when choosing a split. Gain ratio overcomes the tradeoff of
while Y axis of the dendrogram will represent the similarity Gini index through taking intrinsic information into account.
among the data points or the pairwise distances among the Therefore we calculate the Gain ratio for each level using (12)
data points or data clusters. and (13).
Gain ratio (A) = Gain / Gini difference (A)
Split Info (A) (12)
𝑣 |𝐷𝑗 | |𝐷𝑗 |
Split Info A (D) = 𝑗 =1 |𝐷| ∗ log 2 (13)
𝐷
5
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018
Gini index of a level = Gini level = Max (Gini split of same level) (10) tested. The results or the distribution of data of the Fisher's
Calculate Gini Ratio of a split. Iris Data set within two, three and four clusters and are
compared with the results obtained using recursive
Gain ratio (A) = Gain / Gini difference (A)
Split Info (A) (11) hierarchical clustering algorithm and graphically depicted in
𝑣 |𝐷𝑗 | |𝐷𝑗 |
Fig. 6. Recursive hierarchical clustering detects the optimal
Split Info A (D) = 𝑗 =1 |𝐷| ∗ log 2 (12)}
𝐷 cluster number as 3 for Fishers Iris dataset. The distribution
of samples among clusters using recursive hierarchical
Calculate gini differences between consecutive levels.
Select the two levels which results in the highest gini difference. clustering algorithm is almost similar to the K means
Best split = Split with higher Gain ratio and the highest gain difference clustering algorithm of 3 clusters.
6
International Journal of Machine Learning and Computing, Vol. 8, No. 1, February 2018
to cluster both label data and real data. [23] K. Sasirekha and P. Baby, “Agglomerative hierarchical clustering
algorithm — a review,” International Journal of Scientific and
Research Publications, vol. 3, issue 3, March 2013.
REFERENCES [24] M. Roux, “A comparative study of divisive hierarchical clustering
[1] F. Long, H. Zhang, and D. D. Feng, “Fundamentals of content-based algorithms,” 2015.
image retrieval,” Multi-media Information Retrieval and Management, [25] N. Rajalingam and K. Ranjini, “Hierarchical clustering algorithm — A
Springer Berlin Heidelberg, (2003), pp. 1-26. comparative study,” International Journal of Computer Applications,
[2] C. H. C. Leung and Y. Li, Semantic Image Retrieval, 2015. vol. 19, no. 3, pp. 0975–8887, April 2001.
[3] A. Alzu'bi, A. Amira, and N. Ramzan, “Semantic content-based image [26] T. M. Lakshmi, A. Martin, R. M. Begum, and V. P Venkatesan, “An
retrieval: A comprehensive study,” Journal of Visual Communication analysis on performance of decision tree algorithms using student‟s
and Image Representation, vol. 32, pp. 20-54, 2015. qualitative data,” I. J. Modern Education and Computer Science, vol. 5,
[4] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect pp. 18-27, June 2013.
unseen object classes by between-class attribute transfer,” CVPR,
2009.
[5] Attribute Datasets. (April 10, 2017). [Online]. Available: Pavani Y. De Silva was born in Colombo, Sri Lanka in
https://www.ecse.rpi.edu/homepages/cvrl/database/AttributeDataset.ht October 1992. She received her BSc. (Hons.) degree in
m. information technology with first class honors from
[6] J. Deng et al., “ImageNet: A large-scale hierarchical image database,” University of Moratuwa, Sri Lanka in 2017. She works
in Proc. 2013 IEEE Conference on Computer Vision and Pattern as a software engineer at IFS R & D International (Pvt)
Recognition, 2009, pp. 248-255. Ltd. Her current research interests include machine
[7] M. R. Anderberg, “Cluster analysis for application,” Academic Press, learning, data mining, big data analytics and data
1973. science.
[8] P. N. Tan, “Cluster analysis: Basic concepts and algorithms,”
Introduction to Data Mining, pp. 48-559, 2006.
[9] O. Maimon and L. Rokach, “Data mining and knowledge discovery Chiran N. Fernando was born in Sri Lanka in
handbook,” Springer Science Business Media Inc, pp. 321-352, 2005. November 1991. He graduated from University of
[10] A. Huang, “Similarity measures for text document clustering,” Moratuwa, Sri Lanka with a BSc. (Hons.) degree in
NZCSRSC 2008, April 2008, Christchurch, New Zealand. information technology in 2017. He has worked at
[11] L. V. Bijuraj, “Clustering and its Applications,” in Proc. National Virtusa Polaris Pvt. Ltd as a software engineer intern.
Conference on New Horizons in IT – NCNHIT, 2013. He currently works as a software engineer at WSO2
[12] D. M. Blei. (September 5, 2007). Clustering and the k-means Lanka (Pvt) Ltd. His current research interests are
Algorithm. [Online]. Available: machine intelligence, deep learning and artificial
http://www.cs.princeton.edu/courses/archive/spr08/cos424/slides/clust neural networks.
ering-1.pdf
[13] P. Stefanovic and O. Kurasova, “Visual analysis of self-organizing
maps,” Nonlinear Analysis: Modelling and Control, vol. 16, no. 4, pp. Damith D. Wijethunge was born in September 1991
488–504, 2011. in Kandy, Sri Lanka. He has earned his bachelor‟s
[14] S. Krüger. Self-Organizing Maps. [Online]. Available: honours degree in the field of information technology
http://www.iikt.ovgu.de/iesk_media/Downloads/ks/computational_ne from University of Moratuwa in 2017. He has worked
uroscience/vorlesung/comp_neuro8-p-2090.pdf at Virtusa Polaris Pvt Ltd, SquareMobile Pvt Ltd as a
[15] T. Kohonen, Self-Organizing Maps, 3rd edition, Springer Ser. Inf. Sci., software engineer intern. Currently he is working at
Springer-Verlag, Berlin, 2001. DirectFN Ltd, Sri Lanka as a software engineer. He is
[16] A. K. Mann and N. Kaur, “Survey paper on clustering techniques,” interested in image processing, deep learning, artificial
International Journal of Science, Engineering and Technology neural network, big data analytics and data mining.
Research, vol. 2, issue 4, April 2013.
[17] R. Tibshirani. (September 14, 2009). Distances between clustering,
Hierarchical clustering. [Online]. Available: Subha D. Fernando graduated from University of
http://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf Kelaniya, Sri Lanka with BSc. Special (Hons)
[18] S. Sayad. Hierarchical clustering. [Online]. Available: Statistics and computer science in 2004 and from
http://www.saedsayad.com/clustering_hierarchical.htm Nagaoka University of Technology with master of
[19] R. Tibshirani. (January 29, 2013). Clustering 2: Hierarchical clustering. engineering (M. Eng.), Management information
[Online]. Available: systems science in 2010. She completed her PhD
http://www.stat.cmu.edu/~ryantibs/datamining/lectures/05-clus2-mark degree in computational intelligence from Nagaoka
ed.pdf University of Technology in 2013. She is the head of
[20] M. S. Yang,” A Survey of hierarchical clustering,” Mathl. Comput. the Department of Computational Mathematics, Faculty of Information
Modelling, vol. 18, no. 11, pp. 1-16, 1993. Technology, University of Moratuwa, Sri Lanka. She is the president of Sri
[21] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” Lanka Association for Artificial Intelligence. Her current research interests
ACM Computing Surveys, vol. 31, no. 3, September 1999. are machine learning, deep learning, artificial neural networks,
[22] Hierarchical Clustering Algorithms. [Online]. Available: multi-agent-systems and intelligent informatics.
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchic
al.html