Machine Learning Chapter 3
Machine Learning Chapter 3
What is Clustering?
Clustering is an unsupervised machine learning technique that groups similar data points into clusters
based on their similarities. Unlike classification, clustering does not require labeled data and is used to
identify patterns within datasets. Each cluster represents a group of data points that are more similar to
each other than to those in other clusters.
K-Means Clustering
Divides the data into ‘K’ clusters using centroids.
Hierarchical Clustering
Forms a hierarchy of clusters using a tree-like structure.
Customer Segmentation
Businesses use clustering to segment customers based on purchasing behavior, demographics, or
interests.
Example: E-commerce platforms categorize customers to provide personalized recommendations.
Anomaly Detection
Clustering helps detect outliers or fraudulent transactions.
Example: Banks and credit card companies use clustering to flag suspicious activities.
Image Segmentation
Clustering is used in image processing to separate objects in images.
Example: Medical imaging to identify tumors in MRI scans.
Document Clustering
Organizing documents based on topics using clustering techniques.
Example: News aggregation websites group articles with similar content.
K-Means Clustering
Introduction
K-Means is one of the most popular unsupervised learning algorithms used for clustering. It partitions
data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid). The
algorithm iteratively refines the clusters until the centroids stabilize.
3. Assign Points to Clusters: Each data point is assigned to the nearest centroid.
4. Recalculate Centroids: Compute the new mean of data points in each cluster.
Customer Segmentation
Used in e-commerce and marketing to categorize customers based on purchasing behavior.
Image Compression
Reduces image size by clustering similar colors together (color quantization).
Anomaly Detection
Detects fraudulent transactions or network intrusions by identifying outliers.
Document Clustering
Groups similar articles or documents based on topic.
# Plot results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', marker='o', edgecolors='k', alpha=0.75)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='X',
label='Centroids')
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
4. Applications of K-Means
• Customer Segmentation (E-commerce, banking)
• Image Segmentation (Grouping similar pixels in images)
• Anomaly Detection (Detecting fraud in transactions)
• Genetics (Clustering genes with similar expressions)
Hierarchical Clustering
Introduction
Hierarchical Clustering is an unsupervised machine learning algorithm used to group data
into clusters. Unlike K-Means, it does not require specifying the number of clusters in
advance. Instead, it creates a hierarchy of clusters that can be visualized using a
dendrogram.
Customer Segmentation
Grouping customers based on purchasing behavior or demographics.
Bioinformatics
Identifying genetic similarity or DNA sequence analysis.
Document Clustering
Organizing articles or research papers into categories.
Anomaly Detection
Detecting fraud in financial transactions or network intrusions.
# Plot Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='viridis', marker='o', edgecolors='k', alpha=0.75)
plt.title('Hierarchical Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Introduction
Hierarchical Clustering is a bottom-up (agglomerative) or top-down (divisive) clustering
algorithm that groups similar data points into a hierarchy of clusters. It is commonly
visualized using a dendrogram, which shows the merging or splitting of clusters at different
levels.
# Plot Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='viridis', marker='o', edgecolors='k', alpha=0.75)
plt.title('Hierarchical Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()