UNIT 4 K-Means Clustring
UNIT 4 K-Means Clustring
Unsupervised Learning-Clustering
Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset.
It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that
has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled
dataset such as shape, size, colour, behaviour, etc., and divides
them as per the presence and absence of those similar patterns.
It is an unsupervised learning method; hence no supervision is
provided to the algorithm, and it deals with the unlabelled
dataset.
The clustering technique is commonly used for statistical data
analysis.
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups.
It is also known as the centroid-based method. The most common
example of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used
to define the number of pre-defined groups.
The cluster centre is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster
centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed if the dense
region can be connected.
This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by sparser
areas.
These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular
distribution.
The grouping is done by assuming some distributions commonly
Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of
clusters to be created.
In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram.
The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the
Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may
belong to more than one group or cluster.
Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster.
Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as,
some algorithms need to guess the number of clusters in the given dataset,
whereas some are required to find the minimum distance between the
observation of the dataset.
Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:
Let us take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group these
datasets into two different clusters.
We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics that
we have studied to calculate the distance between two points.
So, we will draw a median between both the centroids. Consider the
below image:
From the above image, points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow
centroid.
Let us colour them as blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid.
To choose the new centroids, we will compute the centre of gravity of
these centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line.
The median will be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which
is finding new centroids or K-points.
We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image: