0% found this document useful (0 votes)
3 views12 pages

Machine Learning Chapter 3

Clustering is an unsupervised machine learning technique that groups similar data points into clusters without requiring labeled data. Various algorithms such as K-Means, Hierarchical, and DBSCAN are used for clustering, each with unique applications like customer segmentation, anomaly detection, and image processing. The choice of algorithm depends on the data characteristics and specific use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Machine Learning Chapter 3

Clustering is an unsupervised machine learning technique that groups similar data points into clusters without requiring labeled data. Various algorithms such as K-Means, Hierarchical, and DBSCAN are used for clustering, each with unique applications like customer segmentation, anomaly detection, and image processing. The choice of algorithm depends on the data characteristics and specific use cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Clustering & Its Use Cases

What is Clustering?
Clustering is an unsupervised machine learning technique that groups similar data points into clusters
based on their similarities. Unlike classification, clustering does not require labeled data and is used to
identify patterns within datasets. Each cluster represents a group of data points that are more similar to
each other than to those in other clusters.

Types of Clustering Algorithms

K-Means Clustering
Divides the data into ‘K’ clusters using centroids.

Hierarchical Clustering
Forms a hierarchy of clusters using a tree-like structure.

DBSCAN (Density-Based Spatial Clustering)


Groups data points based on density, useful for discovering arbitrary-shaped clusters.

Gaussian Mixture Model (GMM)


Uses probabilistic clustering to model data points as belonging to multiple clusters.

Use Cases of Clustering

Customer Segmentation
Businesses use clustering to segment customers based on purchasing behavior, demographics, or
interests.
Example: E-commerce platforms categorize customers to provide personalized recommendations.

Anomaly Detection
Clustering helps detect outliers or fraudulent transactions.
Example: Banks and credit card companies use clustering to flag suspicious activities.

Image Segmentation
Clustering is used in image processing to separate objects in images.
Example: Medical imaging to identify tumors in MRI scans.

Document Clustering
Organizing documents based on topics using clustering techniques.
Example: News aggregation websites group articles with similar content.

Social Network Analysis


Clustering helps identify communities in social networks.
Example: Facebook & LinkedIn recommend friends or connections based on clustering.

Biological Data Analysis


Used in genetics to group similar DNA sequences or cell types.
Example: Cancer research for detecting similar genetic expressions.
Conclusion
Clustering is a powerful technique in machine learning that helps uncover patterns in unlabeled
datasets. From customer segmentation to medical imaging, clustering plays a vital role in various
industries. The choice of clustering algorithm depends on the nature of the data and the specific
problem at hand.

K-Means Clustering

Introduction
K-Means is one of the most popular unsupervised learning algorithms used for clustering. It partitions
data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid). The
algorithm iteratively refines the clusters until the centroids stabilize.

How K-Means Clustering Works?


1. Select the number of clusters (K): Choose the number of clusters based on domain knowledge or use
methods like the Elbow Method.

2. Initialize Centroids: Randomly select K data points as initial cluster centroids.

3. Assign Points to Clusters: Each data point is assigned to the nearest centroid.

4. Recalculate Centroids: Compute the new mean of data points in each cluster.

5. Repeat Steps 3 & 4 until centroids no longer change significantly.

Use Cases of K-Means Clustering

Customer Segmentation
Used in e-commerce and marketing to categorize customers based on purchasing behavior.

Image Compression
Reduces image size by clustering similar colors together (color quantization).

Anomaly Detection
Detects fraudulent transactions or network intrusions by identifying outliers.

Document Clustering
Groups similar articles or documents based on topic.

Biological Data Analysis


Clusters genes or disease types in medical research.

Python Implementation of K-Means Clustering


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Apply K-Means clustering


k = 4 # Number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', marker='o', edgecolors='k', alpha=0.75)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='X',
label='Centroids')
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Explanation of the Code


1. Generate Data: We create a synthetic dataset with 4 clusters using `make_blobs`.
2. Apply K-Means: We set `K=4` and fit the K-Means model to the data.
3. Cluster Assignment: The model assigns each point to a cluster.
4. Plot the Clusters: The clusters are visualized along with their centroids.

How Does the K-Means Algorithm Work?


The K-Means algorithm is an unsupervised machine learning algorithm used for clustering data points
into groups based on similarity. It works by partitioning a dataset into K clusters, where each data point
belongs to the nearest cluster center (centroid).

1. Steps in the K-Means Algorithm

Step 1: Choose the Number of Clusters (K)


Decide the number of clusters (K) you want to divide the data into. The choice of K is crucial and can be
determined using techniques like the Elbow Method or Silhouette Score.

Step 2: Initialize Cluster Centroids


Randomly select K points from the dataset as the initial cluster centers (centroids). These centroids act
as the starting points for defining the clusters.

Step 3: Assign Data Points to the Nearest Centroid


Each data point is assigned to the nearest centroid using the Euclidean distance formula:
d =√(𝑥2 − 𝑥1 )2 − (𝑦2 − 𝑦1 )2
This forms K clusters where each point belongs to the closest centroid.

Step 4: Compute New Centroids


For each cluster, calculate the mean of all points in that cluster. The new centroid is the average position
of all points in the cluster.
Step 5: Repeat Steps 3 & 4 Until Convergence
The assignments and centroid updates repeat until the centroids stop changing or the changes are
minimal. The algorithm converges when points no longer switch clusters.

2. Example of K-Means Clustering


Imagine we have a dataset of customers and want to cluster them based on spending patterns:
1. Select K = 3 (for three customer groups: low, medium, and high spenders).
2. Randomly pick 3 centroids from the dataset.
3. Assign each customer to the nearest centroid based on spending behavior.
4. Compute new centroids based on the average spending of each cluster.
5. Repeat until clusters stabilize.

3. Choosing the Optimal K (Elbow Method)


The Elbow Method helps find the best K by plotting the Sum of Squared Errors (SSE) for different values
of K. SSE decreases as K increases, but after a certain point, the reduction becomes insignificant (elbow
point). The K at this elbow point is chosen as the optimal number of clusters.

4. Applications of K-Means
• Customer Segmentation (E-commerce, banking)
• Image Segmentation (Grouping similar pixels in images)
• Anomaly Detection (Detecting fraud in transactions)
• Genetics (Clustering genes with similar expressions)

C-Means Clustering (Fuzzy C-Means Algorithm)


C-Means Clustering, specifically Fuzzy C-Means (FCM), is an unsupervised machine learning algorithm
used for clustering data points into groups. Unlike K-Means, where each data point belongs strictly to
one cluster, FCM allows a point to belong to multiple clusters with different degrees of membership.

1. How Does C-Means Clustering Work?


2. Difference Between K-Means and Fuzzy C-Means
Feature K-Means Clustering Fuzzy C-Means Clustering

Membership Hard assignment (0 or 1) Soft assignment (values


between 0 and 1)

Cluster Assignment Each point belongs to one Each point belongs to


cluster only multiple clusters

Flexibility Rigid clustering More flexible clustering

Best for Well-separated clusters Overlapping clusters

3. Applications of Fuzzy C-Means Clustering


• Image Segmentation (Medical imaging, object detection)
• Pattern Recognition (Speech recognition, handwriting recognition)
• Anomaly Detection (Fraud detection, cybersecurity)
• Data Compression (Reducing data complexity in large datasets)

Hierarchical Clustering

Introduction
Hierarchical Clustering is an unsupervised machine learning algorithm used to group data
into clusters. Unlike K-Means, it does not require specifying the number of clusters in
advance. Instead, it creates a hierarchy of clusters that can be visualized using a
dendrogram.

Types of Hierarchical Clustering

Agglomerative Clustering (Bottom-Up Approach)


Each data point starts as its own cluster. Clusters are merged step by step until only one
remains. This is the most commonly used approach.

Divisive Clustering (Top-Down Approach)


Starts with a single large cluster and splits recursively into smaller clusters.

How Agglomerative Hierarchical Clustering Works?


1. Assign Each Data Point as an Individual Cluster.
2. Compute Distance Between Clusters using metrics like Euclidean, Manhattan, or Cosine
distance.

3. Merge the Closest Clusters.

4. Repeat Steps 2 & 3 Until One Cluster Remains.

5. Use a Dendrogram to Decide the Optimal Number of Clusters.

Use Cases of Hierarchical Clustering

Customer Segmentation
Grouping customers based on purchasing behavior or demographics.

Bioinformatics
Identifying genetic similarity or DNA sequence analysis.

Document Clustering
Organizing articles or research papers into categories.

Anomaly Detection
Detecting fraud in financial transactions or network intrusions.

Python Implementation of Hierarchical Clustering


import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Create a Dendrogram to visualize cluster hierarchy


plt.figure(figsize=(10, 5))
sch.dendrogram(sch.linkage(X, method='ward'))
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()

# Apply Agglomerative Clustering


hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
y_hc = hc.fit_predict(X)

# Plot Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='viridis', marker='o', edgecolors='k', alpha=0.75)
plt.title('Hierarchical Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Explanation of the Code


1. Generate Data: We create a synthetic dataset with 4 clusters using `make_blobs`.
2. Create a Dendrogram: We use `scipy.cluster.hierarchy.dendrogram` to visualize the
hierarchy.
3. Apply Hierarchical Clustering: We use `AgglomerativeClustering` from `sklearn` with
`n_clusters=4`.
4. Plot the Clusters: The final clusters are visualized in a scatter plot.

How Hierarchical Clustering Works?

Introduction
Hierarchical Clustering is a bottom-up (agglomerative) or top-down (divisive) clustering
algorithm that groups similar data points into a hierarchy of clusters. It is commonly
visualized using a dendrogram, which shows the merging or splitting of clusters at different
levels.

Step-by-Step Working of Agglomerative Hierarchical Clustering

1. Assign Each Data Point as an Individual Cluster


Initially, each data point is treated as its own cluster. If there are N data points, we start
with N clusters.

2. Compute the Distance Between Clusters


Calculate the distance between every pair of clusters using methods like:
- Euclidean Distance (default)
- Manhattan Distance
- Cosine Similarity

3. Merge the Two Closest Clusters


Find the two clusters that have the smallest distance and merge them. This reduces the
number of clusters from N to N-1.

4. Repeat Until One Cluster Remains


The process continues iteratively, merging the closest clusters at each step. A dendrogram
visually represents these merging steps.
5. Use a Dendrogram to Decide the Optimal Number of Clusters
The dendrogram helps in selecting the best number of clusters by setting a threshold for
cutting the tree. The larger vertical gaps in the dendrogram suggest natural cluster
divisions.

Advantages of Hierarchical Clustering


✅Does not require specifying the number of clusters in advance.

✅Produces a hierarchy of clusters (useful for detailed analysis).

✅Can be visualized using a dendrogram.

Limitations of Hierarchical Clustering


❌Computationally expensive for large datasets (O(n² log n) complexity).

❌Sensitive to noisy data and outliers.

Python Implementation of Hierarchical Clustering


import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Create a Dendrogram to visualize cluster hierarchy


plt.figure(figsize=(10, 5))
sch.dendrogram(sch.linkage(X, method='ward'))
plt.title('Dendrogram for Hierarchical Clustering')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()

# Apply Agglomerative Clustering


hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
y_hc = hc.fit_predict(X)

# Plot Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='viridis', marker='o', edgecolors='k', alpha=0.75)
plt.title('Hierarchical Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

You might also like