UNIT 4 K-Means Clustring

Uploaded by

sahil.utube2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views13 pages

UNIT 4 K-Means Clustring

Uploaded by

sahil.utube2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

UNIT-4

Unsupervised Learning-Clustering
Clustering in Machine Learning
 Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset.
 It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points.
 The objects with the possible similarities remain in a group that
has less or no similarities with another group."
 It does it by finding some similar patterns in the unlabelled
dataset such as shape, size, colour, behaviour, etc., and divides
them as per the presence and absence of those similar patterns.
 It is an unsupervised learning method; hence no supervision is
provided to the algorithm, and it deals with the unlabelled
dataset.
 The clustering technique is commonly used for statistical data
analysis.

Note: Clustering is somewhere like the classification algorithm, but

the difference is the type of dataset that we are using. In classification,
we work with the labelled data set, whereas in clustering, we work
with the unlabelled dataset.

Types of Clustering Methods

 The clustering methods are broadly divided into Hard clustering
(datapoint belongs to only one group) and Soft Clustering (data points
can belong to another group also).
 The Clustering methods are listed below:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
 It is a type of clustering that divides the data into non-hierarchical groups.
It is also known as the centroid-based method. The most common
example of partitioning clustering is the K-Means Clustering algorithm.
 In this type, the dataset is divided into a set of k groups, where K is used
to define the number of pre-defined groups.
 The cluster centre is created in such a way that the distance between the
data points of one cluster is minimum as compared to another cluster
centroid.

Density-Based Clustering
 The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed if the dense
region can be connected.
 This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters.
 The dense areas in data space are divided from each other by sparser
areas.
 These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.
Distribution Model-Based Clustering
 In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular
distribution.
 The grouping is done by assuming some distributions commonly
Gaussian Distribution.
 The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
 Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of
clusters to be created.
 In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram.
 The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the
Agglomerative Hierarchical algorithm.

Fuzzy Clustering
 Fuzzy clustering is a type of soft method in which a data object may
belong to more than one group or cluster.
 Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster.
 Fuzzy C-means algorithm is the example of this type of clustering; it is
sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The clustering algorithm is based on the kind of data that we are using. Such as,
some algorithms need to guess the number of clusters in the given dataset,
whereas some are required to find the minimum distance between the
observation of the dataset.

K-Means algorithm: The k-means algorithm is one of the most popular

clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm
DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model like the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of
high density are separated by the areas of low density. Because of this, the
clusters can be found in any arbitrary shape.

K-Means Clustering Algorithms

 K-Means Clustering is an unsupervised learning algorithm that is used to

solve the clustering problems in machine learning.
 K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabelled dataset into different clusters.
 Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
 It is an iterative algorithm that divides the unlabelled dataset into k
different clusters in such a way that each dataset belongs only one group
that has similar properties.
 It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabelled dataset on its own
without the need for any training.
 It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
 The algorithm takes the unlabelled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K centre points or centroids by an iterative

process.
 Assigns each data point to its closest k-centre. Those data points which
are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:
 Let us take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group these
datasets into two different clusters.
 We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
 So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:

 Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics that
we have studied to calculate the distance between two points.
 So, we will draw a median between both the centroids. Consider the
below image:
 From the above image, points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow
centroid.
 Let us colour them as blue and yellow for clear visualization.
 As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid.
 To choose the new centroids, we will compute the centre of gravity of
these centroids, and will find new centroids as below:

 Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line.
 The median will be like below image:
 From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
 As reassignment has taken place, so we will again go to the step-4, which
is finding new centroids or K-points.

 We will repeat the process by finding the centre of gravity of centroids,

so the new centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:

We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Quality Control: Mcgraw-Hill/Irwin
No ratings yet
Quality Control: Mcgraw-Hill/Irwin
68 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
Clustering
No ratings yet
Clustering
10 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
K means Clustering
No ratings yet
K means Clustering
11 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Clustering
No ratings yet
Clustering
17 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
DM UNIT IV (1)
No ratings yet
DM UNIT IV (1)
45 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
unit4_ml[1]
No ratings yet
unit4_ml[1]
20 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
clustering
No ratings yet
clustering
9 pages
ML UNIT 2
No ratings yet
ML UNIT 2
17 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
Unit IV
No ratings yet
Unit IV
96 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
ML UNIT-III
No ratings yet
ML UNIT-III
18 pages
UNIT-4
No ratings yet
UNIT-4
22 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Module-5_Notes_13-12-2024.docx
No ratings yet
Module-5_Notes_13-12-2024.docx
45 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
20 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
DOC-20250407-WA0033.
No ratings yet
DOC-20250407-WA0033.
38 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
algo
No ratings yet
algo
59 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
ML unit 4
No ratings yet
ML unit 4
110 pages
Ml Module5 Clustering
No ratings yet
Ml Module5 Clustering
71 pages
ML-UNIT-3[1]
No ratings yet
ML-UNIT-3[1]
28 pages
ML(unit 4)
No ratings yet
ML(unit 4)
19 pages
Unit Iv
No ratings yet
Unit Iv
12 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Unit-4
No ratings yet
Unit-4
19 pages
UNIT-5 Material
No ratings yet
UNIT-5 Material
42 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Machine_Learning_Unit_4
No ratings yet
Machine_Learning_Unit_4
22 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
An Introduction To Different Methods of Clustering in Machine Learning
No ratings yet
An Introduction To Different Methods of Clustering in Machine Learning
8 pages
Clustering
No ratings yet
Clustering
125 pages
Unsupervised-Learning-Part-1 (1)
No ratings yet
Unsupervised-Learning-Part-1 (1)
9 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Module 5
No ratings yet
Module 5
98 pages
Module 5.Docx Aiml
No ratings yet
Module 5.Docx Aiml
28 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
Unit 2 Deep Learning
No ratings yet
Unit 2 Deep Learning
19 pages
MPPT
No ratings yet
MPPT
8 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
Machine Learning Unit-1.2
No ratings yet
Machine Learning Unit-1.2
38 pages
Bahan Ajar Pemodelan Dan Identifikasi Sistem PDF
No ratings yet
Bahan Ajar Pemodelan Dan Identifikasi Sistem PDF
5 pages
Factor Analysis
No ratings yet
Factor Analysis
14 pages
Econ 326 2024 Group Assignment
No ratings yet
Econ 326 2024 Group Assignment
2 pages
Advanced Statistics in Criminology and Criminal Justice 5th Edition David Weisburd David B Wilson Alese Wooditch Chester Britt All Chapters Instant Download
100% (1)
Advanced Statistics in Criminology and Criminal Justice 5th Edition David Weisburd David B Wilson Alese Wooditch Chester Britt All Chapters Instant Download
50 pages
Multivariate GARCH Models: Software Choice and Estimation Issues
No ratings yet
Multivariate GARCH Models: Software Choice and Estimation Issues
21 pages
Mathematics: Resource Pack Grade 12 Term 3
No ratings yet
Mathematics: Resource Pack Grade 12 Term 3
20 pages
Exercises Topic 1
No ratings yet
Exercises Topic 1
13 pages
3sensitivity Lecture Note
No ratings yet
3sensitivity Lecture Note
17 pages
Get Statistics for Business and Economics 12th Edition Anderson Test Bank free all chapters
100% (16)
Get Statistics for Business and Economics 12th Edition Anderson Test Bank free all chapters
65 pages
A) Lose Exactly 430 Bags Next Week?
No ratings yet
A) Lose Exactly 430 Bags Next Week?
12 pages
Example of Interpreting and Applying A Multiple Regression Model
No ratings yet
Example of Interpreting and Applying A Multiple Regression Model
4 pages
Chapter 4. Gauss-Markov Model
No ratings yet
Chapter 4. Gauss-Markov Model
20 pages
Generalized Linear Models
No ratings yet
Generalized Linear Models
12 pages
AdEase Time Series
No ratings yet
AdEase Time Series
28 pages
Stat Lecture 1
No ratings yet
Stat Lecture 1
41 pages
Repeated Measures Design PDF
No ratings yet
Repeated Measures Design PDF
44 pages
This Study Resource Was: Statistics 355 Homework 15
No ratings yet
This Study Resource Was: Statistics 355 Homework 15
3 pages
DLSU DSILYTC: Data Dec2
No ratings yet
DLSU DSILYTC: Data Dec2
22 pages
Lecture - Final Survival Analysis)
No ratings yet
Lecture - Final Survival Analysis)
86 pages
Chapter - 5 - Correlation and Regression
No ratings yet
Chapter - 5 - Correlation and Regression
70 pages
Pilot Shop Is A Catalog Business Providing A Wide Variety
No ratings yet
Pilot Shop Is A Catalog Business Providing A Wide Variety
1 page
Solutions 7
No ratings yet
Solutions 7
4 pages
Test Bank for Introductory Statistics 10th Edition Neil A. Weiss - Full Version Is Ready For Free Download
100% (4)
Test Bank for Introductory Statistics 10th Edition Neil A. Weiss - Full Version Is Ready For Free Download
48 pages
Examination 110 - Probability and Statistics Examination: X X XY
No ratings yet
Examination 110 - Probability and Statistics Examination: X X XY
13 pages
Techniques For Assessing A Project's Cost and Schedule Performance
No ratings yet
Techniques For Assessing A Project's Cost and Schedule Performance
21 pages
Assignment 5 - STAT
No ratings yet
Assignment 5 - STAT
8 pages
Unit II - 1 - Chapter 4 - Training Models
No ratings yet
Unit II - 1 - Chapter 4 - Training Models
20 pages
B.Sc. (Hons.) Biotechnology Core Course 13: Basics of Bioinformatics and Biostatistics (BIOT 3013) Biostatistics (BIOT 3013)
No ratings yet
B.Sc. (Hons.) Biotechnology Core Course 13: Basics of Bioinformatics and Biostatistics (BIOT 3013) Biostatistics (BIOT 3013)
29 pages
Moving and Exponential Smoothing
No ratings yet
Moving and Exponential Smoothing
26 pages