0% found this document useful (0 votes)
23 views

DM UNIT IV (1)

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

DM UNIT IV (1)

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

UNIT IV

Cluster Analysis – Partitioning Methods: k-Means and k- Medoids – Hierarchical Methods:


Agglomerative and Divisive –Model Based Clustering Methods: Fuzzy clusters and Expectation
Maximization Algorithm

Cluster analysis, also known as clustering, is a method of data mining that groups similar
data points together.

Clustering is one of the branches of Unsupervised Learning Defined as an unsupervised


learning problem that aims to make training data with a given set of inputs but without any
target values. where unlabelled data is divided into groups with similar data instances
assigned to the same cluster while dissimilar data instances are assigned to different clusters.

Clustering has various uses in market segmentation, outlier detection, and network analysis,
to name a few.

There are many different algorithms used for cluster analysis, such as k-means,
hierarchical clustering, and density-based clustering. The choice of algorithm will
depend on the specific requirements of the analysis and the nature of the data being
analyzed.

The given data is divided into different groups by combining similar objects into a
group. This group is nothing but a cluster. A cluster is nothing but a collection of similar
data which is grouped together.
For example, consider a dataset of vehicles given in which it contains information
about different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning
there are no class labels like Cars, Bikes, etc for all the vehicles, all the data is
combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using
clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming
clusters like cars cluster which contains all the cars, bikes clusters which contains all the
bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Clustering Methods:
The clustering methods can be classified into the following categories:
 Partitioning Method--K MEANS.K MEDOIDS
 Hierarchical Method-AGGLOMERATIVE,DIVISIVE
 Density-based Method
 Grid-Based Method
 Model-Based Method-EXPECTATION MAXIMIZATION,FUZZY CLUSTERS
 Constraint-based Method
Partitioning methods in data mining is a popular family of clustering algorithms that partition
a dataset into K distinct clusters. These algorithms aim to group similar data points together
while maximizing the differences between the clusters. The most widely used partitioning
method is the K-means algorithm, which randomly assigns data points to clusters and
iteratively refines the clusters' centroids until convergence. Other popular partitioning
methods in data mining include K-medoids. he choice of the algorithm depends on the
specific clustering problem and the dataset characteristics.

The most widely used partitioning method is the K-means algorithm. Other popular
partitioning methods include K-medoids. The K-medoids are similar to K-means but use
medoids instead of centroids as cluster representatives.
speed,
Partitioning methods offer several benefits, including
scalability, and
simplicity.

They are relatively easy to implement and can handle large datasets. Partitioning methods are also
effective in identifying natural clusters within data and can be used for various applications, such as,

customer segmentation,
image segmentation, and
anomaly detection.

K-Means (A Centroid-Based Technique)


K-means is the most popular algorithm in partitioning methods for clustering. It partitions a
dataset into K clusters, where K is a user-defined parameter. Let’s understand the K-Means
algorithm in more detail.
How does K-Means Work?

The K-Means algorithm begins by randomly assigning each data point to a cluster. It then
iteratively refines the clusters' centroids until convergence. The refinement process involves
calculating the mean of the data points assigned to each cluster and updating the cluster
centroids' coordinates accordingly.

The algorithm continues to iterate until convergence, meaning the cluster assignments no
longer change. K-means clustering aims to minimize the sum of squared distances between
each data point and its assigned cluster centroid.
K-means is widely used in various applications, such as customer segmentation, image
segmentation, and anomaly detection, due to its simplicity and efficiency in handling large
datasets. For example, the K-Means algorithm can group data points into two clusters, as
shown below.
It partitions the given data set into k predefined distinct clusters.

It partitions the data set such that-

Each data point belongs to a cluster with the nearest mean.


Data points belonging to one cluster have high degree of similarity.
Data points belonging to different clusters have high degree of dissimilarity.
K-Means Clustering Algorithm-
K-Means Clustering Algorithm involves the following steps-

Step-01:
 Choose the number of clusters K.

Step-02:
 Randomly select any K data points as cluster centers.
 Select cluster centers in such a way that they are as farther as possible from each other. 
Step-03:
 Calculate the distance between each data point and each cluster center. 
 The distance may be calculated either by using given distance function or by using euclidean
distance formula.
Step-04:
 Assign each data point to some cluster.
 A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
 Re-compute the center of newly formed clusters.
 The center of a cluster is computed by taking mean of all the data points contained in that
cluster.
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-
 Center of newly formed clusters do not change
 Data points remain present in the same cluster
 Maximum number of iterations are reached

Advantages-
K-Means Clustering Algorithm offers the following advantages-
Point-01:
It is relatively efficient with time complexity O(nkt) where-
 n = number of instances
 k = number of clusters
 t = number of iterations
Disadvantages-
K-Means Clustering Algorithm has the following disadvantages-
 It requires to specify the number of clusters (k) in advance.
 It can not handle noisy data and outliers.
 It is not suitable to identify clusters with non-convex shapes.

PRACTICE PROBLEMS BASED ON K-MEANS


CLUSTERING ALGORITHM-
Problem-01:
Cluster the following eight points (with (x, y) representing locations) into three clusters:

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-

Ρ(a, b) = |x2 – x1| + |y2 – y1|


Use K-Means Algorithm to find the three cluster centers after the second iteration.

Iteration-01:
 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
Ρ(A1, C2)

= |x2 – x1| + |y2 – y1|


= |5 – 2| + |8 – 10|

=3+2
=5
Calculating Distance Between A1(2, 10) and C3(1, 2)-
Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1 – 2| + |2 – 10|
=1+8

=9

In the similar manner, we calculate the distance of other points from each of the center of the three
clusters.

Next,
 We draw a table showing all the results.
 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.

Distance from Distance from Distance from


Point belongs
Given Points center (2, 10) of center (5, 8) of center (1, 2) of
to Cluster
Cluster-01 Cluster-02 Cluster-03

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2

A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2

From here, New clusters are-

Cluster-01:
First cluster contains points-A1(2, 10)

Cluster-02:
Second cluster contains points-

 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)
 A8(4, 9)

Cluster-03:
Third cluster contains points-
 A2(2, 5)
 A7(1, 2)

Now,
 We re-compute the new cluster clusters.
 The new cluster center is computed by taking mean of all the points contained in that
cluster.

For Cluster-01:
We have only one point A1(2, 10) in Cluster-01.

 So, cluster center remains the same.

For Cluster-02:
Center of Cluster-02

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

= (6, 6)

For Cluster-03:
Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)


= (1.5, 3.5)
This is completion of Iteration-01.
Iteration-02:
 We calculate the distance of each point from each of the center of the three clusters.
 The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point A1(2, 10) and each of
the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)

= |x2 – x1| + |y2 – y1|

= |2 – 2| + |10 – 10|

=0

Calculating Distance Between A1(2, 10) and C2(6, 6)-


Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|

= |6 – 2| + |6 – 10|
=4+4

=8

Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-


Ρ(A1, C3)

= |x2 – x1| + |y2 – y1|

= |1.5 – 2| + |3.5 – 10|


= 0.5 + 6.5
=7

In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,

 We draw a table showing all the results.


 Using the table, we decide which point belongs to which cluster.
 The given point belongs to that cluster whose center is nearest to it.

Distance from Distance from Distance from


Point belongs
Given Points center (2, 10) center (6, 6) of center (1.5, 3.5) of
to Cluster
of Cluster-01 Cluster-02 Cluster-03
A1(2, 10) 0 8 7 C1

A2(2, 5) 5 5 2 C3

A3(8, 4) 12 4 7 C2

A4(5, 8) 5 3 8 C2

A5(7, 5) 10 2 7 C2

A6(6, 4) 10 2 5 C2

A7(1, 2) 9 9 2 C3

A8(4, 9) 3 5 8 C1

From here, New clusters are-


Cluster-01:
First cluster contains points-

 A1(2, 10)
 A8(4, 9)

Cluster-02:
Second cluster contains points-

 A3(8, 4)
 A4(5, 8)
 A5(7, 5)
 A6(6, 4)

Cluster-03:
Third cluster contains points-
 A2(2, 5)
 A7(1, 2)
Now,

 We re-compute the new cluster clusters.


 The new cluster center is computed by taking mean of all the points contained in that
cluster.

For Cluster-01:
Center of Cluster-01

= ((2 + 4)/2, (10 + 9)/2)

= (3, 9.5)
For Cluster-02:
Center of Cluster-02

= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)


= (6.5, 5.25)

For Cluster-03:
Center of Cluster-03

= ((2 + 1)/2, (5 + 2)/2)

= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-


 C1(3, 9.5)
 C2(6.5, 5.25) 
C3(1.5, 3.5)

In clustering, the machine learns the attributes and trends by itself without any provided
input-output mapping. The clustering algorithms extract patterns and inferences from the type
of data objects and then make discrete classes of clustering them suitably.
K-medoids is an unsupervised method with unlabelled data to be clustered. It is an
improvised version of the K-Means algorithm mainly designed to deal with outlier
data sensitivity. Compared to other partitioning algorithms, the algorithm is simple,
fast, and easy to implement.

K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances to
other data points is minimal.

(or)

A Medoid is a point in the cluster from which dissimilarities with all the other points
in the clusters are minimal.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids


algorithm takes a Medoid as a reference point.

There are three types of algorithms for K-Medoids Clustering:

1. PAM (Partitioning Around Clustering)


2. CLARA (Clustering Large Applications)
3. CLARANS (Randomized Clustering Large Applications)

PAM is the most powerful algorithm of the three algorithms but has the
disadvantage of time complexity.

Algorithm:
Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points
to k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid
and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the
medoids)
4. Select a random point as the new medoid and swap it with the previous
medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid,
make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous
medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new
medoids to classify data points.

Here is an example to make the theory clear:

Data set:

x y

0 5 4

1 7 7

2 1 3

3 8 6

4 4 9

Scatter plot:
If k is given as 2, we need to break down the data points into 2 clusters.

1. Initial medoids: M1(1, 3) and M2(4, 9)


2. Calculation of distances

Manhattan Disance: |x1 - x2| + |y1 - y2|

x< y From M1(1, 3) From M2(4, 9)

0 5 4 5 6

1 7 7 10 5

2 1 3 - -

3 8 6 10 7
4 4 9 - -

Cluster 1: 0

Cluster 2: 1, 3

1. Calculation of total cost:


(5) + (5 + 7) = 17
2. Random medoid: (5, 4)

M1(5, 4) and M2(4, 9):

x y From M1(5, 4) From M2(4, 9)

0 5 4 - -

1 7 7 5 5

2 1 3 5 9

3 8 6 5 7

4 4 9 - -

Cluster 1: 2, 3

Cluster 2: 1

1. Calculation of total cost:


(5 + 5) + 5 = 15
Less than the previous cost
New medoid: (5, 4).
2. Random medoid: (7, 7)
M1(5, 4) and M2(7, 7)

x y From M1(5, 4) From M2(7,


7)

0 5 4 - -

1 7 7 - -

2 1 3 5 10

3 8 6 5 2

4 4 9 6 5

Cluster 1: 2

Cluster 2: 3, 4

1. Calculation of total cost:


(5) + (2 + 5) = 12
Less than the previous cost
New medoid: (7, 7).
2. Random medoid: (8, 6)

M1(7, 7) and M2(8, 6)

x y From M1(7, 7) From M2(8, 6)

0 5 4 5 5

1 7 7 - -
2 1 3 10 10

3 8 6 - -

4 4 9 5 7

Cluster 1: 4

Cluster 2: 0, 2

1. Calculation of total cost:


(5) + (5 + 10) = 20
Greater than the previous cost
UNDO
Hence, the final medoids: M1(5, 4) and M2(7, 7)
Cluster 1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:
Limitation of PAM:
Time complexity: O(k * (n - k)2)
Possible combinations for every node: k*(n - k)

Cost for each computation: (n - k)

Total cost: k*(n - k)2

Hence, PAM is suitable and recommended to be used for small data sets.

CLARA:

It is an extension to PAM to support Medoid clustering for large data sets. This
algorithm selects data samples from the data set, applies Pam on each sample,
and outputs the best Clustering out of these samples. This is more effective than
PAM. We should ensure that the selected samples aren't biased as they affect the
Clustering of the whole data.

CLARANS:
This algorithm selects a sample of neighbors to examine instead of selecting samples
from the data set. In every step, it examines the neighbors of every node. The time
complexity of this algorithm is O(n2), and this is the best and most efficient Medoids
algorithm of all.

Advantages of using K-Medoids:


1. Deals with noise and outlier data effectively
2. Easily implementable and simple to understand
3. Faster compared to other partitioning algorithms

Disadvantages:
1. Not suitable for Clustering arbitrarily shaped groups of data points.
2. As the initial medoids are chosen randomly, the results might vary based on
the choice in different runs.

K-Means and K-Medoids:


K-Means K-Medoids

Both methods are types of Partition Clustering.

Unsupervised iterative algorithms

Have to deal with unlabelled data

Both algorithms group n objects into k clusters based on similar traits where k is
pre-defined.

Inputs: Unlabelled data and the value of k

Metric of similarity: Euclidian Distance Metric of


similarity: Manhattan
Distance

Clustering is done based on distance from centroids. Clustering is done based


on distance from medoids.

A centroid can be a data point or some other point in A medoid is always a data
the cluster point in the cluster.

Can't cope with outlier data Can manage outlier data


too

Sometimes, outlier sensitivity can turn out to be Tendency to ignore


useful meaningful clusters in
outlier data

Useful Outlier Clusters:


For suppose, A data set with data on people's income is being clustered to analyze
and understand individuals' purchasing and investing behavior within each cluster.

Here outlier data will be people with high incomes-billionaires. All such people tend
to purchase and invest more. Hence, a separate cluster for billionaires would be
useful in this scenario.

In K-Medoids, It merges this data into the upper-class cluster, which loses the
meaningful outlier data in Clustering and is one of the disadvantages of K-Medoids
in special situations.

Hierarchical clustering

Hierarchical clustering is separating data into groups based on some measure of similarity,
finding a way to measure how they’re alike and different, and further narrowing down the
data.

Hierarchical Clustering creates clusters in a hierarchical tree-like structure (also called a


Dendrogram). Meaning, a subset of similar data is created in a tree-like structure in which the
root node corresponds to the entire data, and branches are created from the root node to form
several clusters.

Hierarchical clustering is separating data into groups based on some measure of similarity,
finding a way to measure how they’re alike and different, and further narrowing down the
data.

Every kind of clustering has its own purpose and numerous use cases.
Customer Segmentation

In customer segmentation, clustering can help answer the questions:

 What people belong to together?

 How do we group them together?

Social Network Analysis

User personas are a good use of clustering for social networking analysis. We can look for
similarities between people and group them accordingly.

City Planning

Clustering is popular in the realm of city planning. Planners need to check that an industrial
zone isn’t near a residential area, or that a commercial zone somehow wound up in the
middle of an industrial zone.

An Example of Hierarchical Clustering

Let's consider that we have a set of cars and we want to group similar ones together. Look at
the image shown below:
For starters, we have four cars that we can put into two clusters of car types: sedan and SUV.
Next, we'll bunch the sedans and the SUVs together. For the last step, we can group
everything into one cluster and finish when we’re left with only one cluster.

Types of Hierarchical Clustering

Hierarchical clustering is divided into:

 Agglomerative 

 Divisive 



Divisive clustering is known as the top-down approach. We take a large cluster and start
dividing it into two, three, four, or more clusters.

Agglomerative Clustering

Agglomerative Hierarchical Clustering is popularly known as a bottom-up approach,


wherein each data or observation is treated as its cluster. A pair of clusters are combined until
all clusters are merged into one big cluster that contains all the data. Agglomerative clustering
is known as a bottom-up approach. Consider it as bringing things together.

Agglomerative clustering: Divide the data points into different clusters and then aggregate
them as the distance decreases.Whereas,

Divisive clustering: Combine all the data points as a single cluster and divide them as the
distance between them increases.
Agglomerative Clustering

Agglomerative clustering is a bottom-up approach. It starts clustering by treating the


individual data points as a single cluster then it is merged continuously based on similarity
until it forms one big cluster containing all objects. It is good at identifying small clusters.

The steps for agglomerative clustering are as follows:

 Compute the proximity matrix using a distance metric.


 Use a linkage function to group objects into a hierarchical cluster tree based on the
computed distance matrix from the above step.
 Data points with close proximity are merged together to form a cluster.
 Repeat steps 2 and 3 until a single cluster remains.

The pictorial representation of the above steps would be:


In the above figure,

The data points 1,2,...6 are assigned to each individual cluster.

After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5 are
merged together to form clusters.

Again, the proximity matrix is computed and clusters with points 4,5 and 6 are merged
together.

And again, the proximity matrix is computed, then the clusters with points 4,5,6 and 2,3 are
merged together to form a cluster.

As a final step, the remaining clusters are merged together to form a single cluster.
Agglomerative clustering linkage algorithm (Cluster Distance Measure)

This technique is used for combining two clusters. Note that it’s the distance between
clusters, and not individual observation.

Calculation of Distance Between Two Clusters


The distance between clusters in agglomerative clustering can be calculated using three
approaches namely single linkage, complete linkage, and average linkage.

 In the single linkage approach, we take the distance between the nearest points in two
clusters as the distance between the clusters.
 In the complete linkage approach, we take the distance between the farthest points in
two clusters as the distance between the clusters.
 In the average linkage approach, we take the average distance between each pair of
points in two given clusters as the distance between the clusters. You can also take the
distance between the centroids of the clusters as their distance from each other.
Step2: Calculating the distance matrix in Euclidean method
Step 3: Look for the least distance and merge those into a cluster

Distance Matrix

We see the points P3, P4 has the least distance “0.30232”. So we will first merge those into a

cluster.

Step 4: Re-compute the distance matrix after forming a cluster

Update the distance between the cluster (P3,P4) to P1

= Min(dist(P3,P4), P1)) -> Min(dist(P3,P1),dist(P4,P1))

= Min(0.59304, 0.46098)

= 0.46098

Update the distance between the cluster (P3,P4) to P2

= Min(dist(P3,P4), P2) -> Min(dist(P3,P2),dist(P4,P2))

= Min(0.77369, 0.61612)

= 0.61612

Update the distance between the cluster (P3,P4) to P5


= Min(dist(P3,P4), P5) -> Min(dist(P3,P5),dist(P4,P5))

= Min(0.45222, 0.35847)

= 0.35847

Updated Distance Matrix

Repeat steps 3,4 until we are left with one single cluster.

After re-computing the distance matrix, we need to again look for the least distance to make a

cluster

We see the points P2, P5 has the least distance “0.32388”. So we will group those into a

cluster and recompute the distance matrix.

Update the distance between the cluster (P2,P5) to P1

= Min(dist((P2,P5),P1)) -> Min(dist(P2,P1), dist(P5, P1))

= Min(1.04139, 0.81841)
= 0.81841

Update the distance between the cluster (P2,P5) to (P3,P4)

= Min(dist((P2,P5), (P3,P4))) -> = Min(dist(P2,(P3,P4)), dist(P5,(P3,P4)))

= Min(dist(0.61612, 0.35847))

= 0.35847

After recomputing the distance matrix, we need to again look for the least distance.

The cluster (P2,P5) has the least distance with the cluster (P3, P4) “0.35847”. So we will

cluster them together.

Update the distance between the cluster (P3,P4, P2,P5) to P1

= Min(dist(((P3,P4),(P2,P5)), P1))

= Min(0.46098, 0.81841)

= 0.46098
With this, we are done with obtaining a single cluster.

Theoretically, below are the clustering steps:

 P3, P4 points have the least distance and are merged

 P2, P5 points have the least distance and are merged

 The clusters (P3, P4), (P2, P5) are clustered

 The cluster (P3, P4, P2, P5) is merged with the datapoint P1

The length of the vertical lines in the dendrogram shows the distance. For example, the
distance between the points P2, P5 is 0.32388.

What is dendrogram?

A dendrogram is a tree-structured graph used in heat maps to visualize the result of a


hierarchical clustering calculation. The result of a clustering is presented either as the
distance or the similarity between the clustered rows or columns depending on the selected
distance measure.

EXAMPLE 2

REFER
https://codinginfinite.com/agglomerative-clustering-numerical-example-advantages-and-
disadvantages/#:~:text=To%20solve%20a%20numerical%20example,each%20point%20in%
20the%20dataset.

Both of these approaches are as shown below:

Agglomerative Clustering using Single Linkage

As we all know, Hierarchical Agglomerative clustering starts with treating each observation

as an individual cluster, and then iteratively merges clusters until all the data points are

merged into a single cluster. Dendrograms are used to represent hierarchical clustering

results.
What is the Distance Measure?

Distance measure determines the similarity between two elements and it influences the shape
of the clusters.

Some of the ways we can calculate distance measures include:

 Euclidean distance measure

 Squared Euclidean distance measure

 Manhattan distance measure

 Cosine distance measure

Euclidean Distance Measure

The Euclidean distance is the most widely used distance measure when the variables are
continuous (either interval or ratio scale).

The Euclidean distance between two points calculates the length of a segment connecting the
two points. It is the most evident way of representing the distance between two points.

The Pythagorean Theorem can be used to calculate the distance between two points, as
shown in the figure below. If the points (x1, y1)) and (x2, y2) in 2-dimensional space,

Then the Euclidean distance between them is as shown in the figure below.
As this is the sum of more than two dimensions, we calculate the distance between each of
the different dimensions squared and then take the square root of that to get the actual
distance between them.

Squared Euclidean Distance Measurement

This is identical to the Euclidean measurement method, except we don't take the square root
at the end. The formula is shown below:

Depending on whether the points are farther apart or closer together, then the difference in
distances can be computed faster by using squared Euclidean distance measurement.

While this method gives us the exact distance, it won't make a difference when calculating
which is smaller and which is larger. Removing the square root can make the computation
faster.

Manhattan Distance Measurement

Manhattan Distance

Euclidean distance may not be suitable while measuring the distance between different
locations. If we wanted to measure a distance between two retail stores in a city, then
Manhattan distance will be more suitable to use, instead of Euclidean distance.

The distance between two points in a grid-based on a strictly horizontal and vertical path. The
Manhattan distance is the simple sum of the horizontal and vertical components.

In nutshell, we can say Manhattan distance is the distance if you had to travel
along coordinates only.
Minkowski Distance

The Minkowski distance between two variables X and Y is defined as-

When p = 1, Minkowski Distance is equivalent to the Manhattan distance, and the case where
p = 2, is equivalent to the Euclidean distance.

This method is a simple sum of horizontal and vertical components or the distance between
two points measured along axes at right angles.

The formula is shown below:


This method is different because you're not looking at the direct line, and in certain cases, the
individual distances measured will give you a better result.

Most of the time, you’ll go with the Euclidean squared method because it's faster. But when
using the Manhattan distance, you measure either the X difference or the Y difference and
take the absolute value of it.

Cosine Distance Measure

The cosine distance similarity measures the angle between the two vectors. The formula is:

As the two vectors separate, the cosine distance becomes greater. This method is similar to
the Euclidean distance measure, and you can expect to get similar results with both of them.

Note that the Manhattan measurement method will produce a very different result. You can
end up with bias if your data is very skewed or if both sets of values have a dramatic size
difference.
What is Divisive Clustering?

Divisive Clustering

Divisive Hierarchical Clustering is also termed as a top-down clustering approach. In this


technique, entire data or observation is assigned to a single cluster. The cluster is further split
until there is one cluster for each data or observation.

The divisive clustering approach begins with a whole set composed of all the data points and
divides it into smaller clusters. This can be done using a monothetic divisive method.

But what is a monothetic divisive method?

Let's try to understand it by using the example from the agglomerative clustering section
above. We consider a space with six points in it as we did before.

We name each point in the cluster as ABCDEF.


Here, we obtain all possible splits into two clusters, as shown.

For each split, we can compute cluster sum of squares as shown:


Next, we select the cluster with the largest sum of squares. Let's assume that the sum of
squared distance is the largest for the third split ABCDEF. We split the ABC out, and we're
left with the DEF on the other side. We again find this sum of squared distances and split it
into clusters, as shown.

You can see the hierarchical dendrogram coming down as we start splitting everything apart.
It continues to divide until every data point has its node or until we get to K (if we have set a
K value).

It is based on the core idea that similar objects lie nearby to each other in a data space while
others lie far away. It uses distance functions to find nearby data points and group the data
points together as clusters.There are two major types of approaches in hierarchical
clustering:

Divisive Clustering

Divisive clustering works just the opposite of agglomerative clustering. It starts by


considering all the data points into a big single cluster and later on splitting them into smaller
heterogeneous clusters continuously until all data points are in their own cluster. Thus, they
are good at identifying large clusters. It follows a top-down approach and is more efficient
than agglomerative clustering. But, due to its complexity in implementation, it doesn’t have
any predefined implementation in any of the major machine learning frameworks.
STEPS IN DIVISIVE CLUSTERING

Consider all the data points as a single cluster.

Split into clusters using any flat-clustering method, say K-Means.

Choose the best cluster among the clusters to split further, choose the one that has the largest
Sum of Squared Error (SSE).

Repeat steps 2 and 3 until a single cluster is formed.

In the above figure,

The data points 1,2,...6 are assigned to large cluster.

After calculating the proximity matrix, based on the dissimilarity the points are split up into
separate clusters.

The proximity matrix is again computed until each point is assigned to an individual cluster.

Limits of Hierarchical Clustering

Hierarchical clustering isn’t a fix-all; it does have some limits. Among them:

It has high time and space computational complexity. For computing proximity matrix, the
time complexity is O(N2), since it takes N steps to search, the total time complexity is O(N3)

There is no objective function for hierarchical clustering.

Due to high time complexity, it cannot be used for large datasets.

It is sensitive to noise and outliers since we use distance metrics.

It has difficulty handling large clusters.


Model-based clustering

Model-based clustering refers to a class of clustering algorithms that assumes certain


probability models for the data within clusters. These methods attempt to identify
clusters by fitting statistical models to the data, often assuming that the data within
each cluster follows a particular distribution (such as Gaussian distributions or other
probability models).

Unlike some other clustering methods that solely rely on distance measures or
centroids for cluster assignment, model-based clustering approaches assume a
probabilistic model for the data and use statistical inference techniques to estimate
model parameters. They typically involve estimating parameters such as means,
variances, and covariances for the assumed distributions to describe the clusters
within the data.

 FUZZY CLUSTERING
 EXPECTATION-MAXIMIZATION ALGORITHM

EXPECTATION-MAXIMIZATION ALGORITHM

In the real world applications of machine learning, it is very common that there are many
relevant features available for learning but only a small subset of them are observable.

The expectation-Maximization algorithm can be used for the latent variables (variables that
are not directly observable and are actually inferred from the values of the other observed
variables.

It is the base for many unsupervised clustering algorithms in the field of machine learning.

The primary goal of the EM algorithm is to use the available observed data of the dataset to
estimate the missing data of the latent variables and then use that data to update the values
of the parameters in the M-step.

What is Convergence in the EM algorithm?


Convergence is defined as the specific situation in probability based on intuition, e.g., if
there are two random variables that have very less difference in their probability, then they
are known as converged. In other words, whenever the values of given variables are matched
with each other, it is called convergence.
Step in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, , and convergence Step. These steps are explained as follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained
from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further,
E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or
not. If it gets "yes", then stop the process; else, repeat the process from step 2 until
the convergence occurs.

Imagine now that the labels for the data are gone. It means that the data is all
mixed. The data is now mixed and now we can’t separate the data from the two.
Here is where Expectation Maximisation comes into play. The steps involved in
EM algorithms are:

1. Guess the random initial estimates of θA and θB between 0 and 1.


2. Use the likelihood function use the parameter values to see how probable
the data is.
3. Use this likelihood to generate a weighting for indicating the probability of
each sequence produced using θA and θB. This is called the Expectation
Step.
4. Add up the total number of weighted counts for heads and tails across all
sequences (call these counts H′ and T′) for both parameter estimates.
Produce new estimates for θA and θB using the maximum likelihood
formula H′ / H′+T′ (the Maximisation step).
5. Repeat steps 2-4 until each parameter estimate has converged, or a set
number of iterations has been reached. The total weight for each sequence
should be normalised to 1.
o

o
o
o 80% chance of getting head 20% chance of getting tail in the 10th iteration.
o

o
The process get stopped because of stable value of coin a and coin b is achieved.

Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm
o The convergence of the EM algorithm is very slow.
o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is opposite to
that of numerical optimization, which takes only forward probabilities.
Clustering is an unsupervised machine learning technique that divides the given data into different
clusters based on their distances (similarity) from each other.
The unsupervised k-means clustering algorithm gives the values of any point lying in some
particular cluster to be either as 0 or 1 i.e., either true or false. But the fuzzy logic gives the fuzzy
values of any particular data point to be lying in either of the clusters. Here, in fuzzy c-means
clustering, we find out the centroid of the data points and then calculate the distance of each data
point from the given centroids until the clusters formed become constant.
Fuzzy Clustering

Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to
belong to more than one cluster with different degrees of membership. Unlike traditional clustering
algorithms, such as k-means or hierarchical clustering, which assign each data point to a single
cluster, fuzzy clustering assigns a membership degree between 0 and 1 for each data point for each
cluster.

suppose the given data points are {(1, 3), (2, 5), (6, 8), (7, 9)}

Step 1: Initialize the data points into desired number of clusters randomly.
Let’s assume there are 2 clusters in which the data is to be divided, initializing the data point
randomly. Each data point lies in both the clusters with some membership value which can be
assumed anything in the initial state.

The table below represents the values of the data points along with their membership (gamma) in
each of the cluster.
Step 2: Find out the centroid.
The formula for finding out the centroid (V) is:

Where, µ is fuzzy membership value of the data point, m is the fuzziness parameter (generally taken
as 2), and xk is the data point. Here,

V11 = (0.82 *1 + 0.72 * 2 + 0.22 * 4 + 0.12 * 7) / ( (0.82 + 0.72 + 0.22 + 0.12 ) = 1.568
V12 = (0.82 *3 + 0.72 * 5 + 0.22 * 8 + 0.12 * 9) / ( (0.82 + 0.72 + 0.22 + 0.12 ) = 4.051
V21 = (0.22 *1 + 0.32 * 2 + 0.82 * 4 + 0.92 * 7) / ( (0.22 + 0.32 + 0.82 + 0.92 ) = 5.35
V22 = (0.22 *3 + 0.32 * 5 + 0.82 * 8 + 0.92 * 9) / ( (0.22 + 0.32 + 0.82 + 0.92 ) = 8.215

Centroids are: (1.568, 4.051) and (5.35, 8.215)


Step 5: Repeat the steps(2-4) until the constant values are obtained for the membership values or the
difference is less than the tolerance value (a small value up to which the difference in values of two
consequent updations is accepted).

Step 6: Defuzzify the obtained membership values.

You might also like