0% found this document useful (0 votes)
46 views13 pages

UNIT 4 K-Means Clustring

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views13 pages

UNIT 4 K-Means Clustring

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-4

Unsupervised Learning-Clustering
Clustering in Machine Learning
 Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset.
 It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
 The objects with the possible similarities remain in a group that
has less or no similarities with another group."
 It does it by finding some similar patterns in the unlabelled
dataset such as shape, size, colour, behaviour, etc., and divides
them as per the presence and absence of those similar patterns.
 It is an unsupervised learning method; hence no supervision is
provided to the algorithm, and it deals with the unlabelled
dataset.
 The clustering technique is commonly used for statistical data
analysis.

Note: Clustering is somewhere like the classification algorithm, but


the difference is the type of dataset that we are using. In classification,
we work with the labelled data set, whereas in clustering, we work
with the unlabelled dataset.

Types of Clustering Methods


 The clustering methods are broadly divided into Hard clustering
(datapoint belongs to only one group) and Soft Clustering (data points
can belong to another group also).
 The Clustering methods are listed below:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
 It is a type of clustering that divides the data into non-hierarchical groups.
It is also known as the centroid-based method. The most common
example of partitioning clustering is the K-Means Clustering algorithm.
 In this type, the dataset is divided into a set of k groups, where K is used
to define the number of pre-defined groups.
 The cluster centre is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster
centroid.

Density-Based Clustering
 The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed if the dense
region can be connected.
 This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters.
 The dense areas in data space are divided from each other by sparser
areas.
 These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.
Distribution Model-Based Clustering
 In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular
distribution.
 The grouping is done by assuming some distributions commonly
Gaussian Distribution.
 The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
 Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of
clusters to be created.
 In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram.
 The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the
Agglomerative Hierarchical algorithm.

Fuzzy Clustering
 Fuzzy clustering is a type of soft method in which a data object may
belong to more than one group or cluster.
 Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster.
 Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as,
some algorithms need to guess the number of clusters in the given dataset,
whereas some are required to find the minimum distance between the
observation of the dataset.

K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm
DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model like the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of
high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.

K-Means Clustering Algorithms

 K-Means Clustering is an unsupervised learning algorithm that is used to


solve the clustering problems in machine learning.
 K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabelled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabelled dataset into k
different clusters in such a way that each dataset belongs only one group
that has similar properties.
 It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabelled dataset on its own
without the need for any training.
 It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
 The algorithm takes the unlabelled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K centre points or centroids by an iterative


process.
 Assigns each data point to its closest k-centre. Those data points which
are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:
 Let us take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group these
datasets into two different clusters.
 We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
 So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:

 Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics that
we have studied to calculate the distance between two points.
 So, we will draw a median between both the centroids. Consider the
below image:
 From the above image, points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow
centroid.
 Let us colour them as blue and yellow for clear visualization.
 As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid.
 To choose the new centroids, we will compute the centre of gravity of
these centroids, and will find new centroids as below:

 Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line.
 The median will be like below image:
 From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
 As reassignment has taken place, so we will again go to the step-4, which
is finding new centroids or K-points.

 We will repeat the process by finding the centre of gravity of centroids,


so the new centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:

We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:

You might also like