0% found this document useful (0 votes)
31 views

Data Mining P

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Data Mining P

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

WELCOME

to the presentation of “Group-2”


PRESENTATION
TOPIC :
Clustering

COURSE CODE : STAT-309


COURSE
TITLE : Data
Mining
COURSE
TEACHER :

AHSANUL HAQUE
LECTURER
DEPARTMENT OF
UNIVERSITY
STATISTICS OF
BARISHAL
Our Team

SMINA AHMED FARZANA TABASU


SHORMY SHARMISTHA ESHITA AKTER MIM
BISWAS SWARNA MONIRUL ISLAM
RONI
Clustering

Grouping of a particular set of


objects based on their
characteristics,
aggregating them according
to their similarities
Huge Dataset

Find common attributes - all data in the


same group have similar attributes
Clustering

Examine the data to form clusters


Entitles in the real world are very complex

Products sold on a Users of a social Readers of an online


e-commerce site media platform newspaper
Defining Characteristics Using Numbers
Ratings
Review sentiment (1-positive, 0-negative
Category (1- electronics, 2- fashion,...)
Dimensions (size,height,weight)
Color
RatScore posts, comments,likes,shares
Score every post by topic (music lovers,sports lovers)
Activity score(100 most active, 0 not active at all)
Number of connections
% profile complete
Sports Science
professional basketball teams may Health Insurance
collect the following information
about players: an actuary may collect the
 Points per game
following information about
households:
 Assists per game
 Total number of doctor
 Steals per game
REAL LIFE visits per year
EXAMPLE
 Total household size
Email Marketing  Total number of chronic
11
a business may collect the following conditions per household
information about consumers:  Average age of household
 Percentage of emails opened members
 Number of clicks per email
 Time spent viewing email
Basic
Features:
• ·The number of clusters is
not known
• There may not be any a
priori knowledge
concerning the clusters
• Cluster results are
dynamic.
HIERARCHICAL
AGGLOMERATIV
E CLUSTERING
(CLUSTER
• Single: nearest distance or
HAC CAN BE single linkage.
• Complete: farthest
REPRESENTED distance or complete

USING THREE linkage.


• Average: average distance
TECHNIQUES- or average linkage.
Linkage Method Merits Demerits

can separate non-elliptical


cannot separate the
shapes as long as the gap
Single clusters properly if there is
between two clusters is not
noise between clusters.
small.

does well in separating


biased towards equal
Complete clusters if there is noise
variance clusters
between clusters.

balances compactness and


Average computationally intensive
connectivity
K-Means Clustering-
· K-Means clustering is an unsupervised
iterative clustering technique.

· It partitions the given data set into k


predefined distinct clusters.

· A cluster is defined as a collection


of data points exhibiting certain similarities.
IT PARTITIONS THE DATA SET
SUCH THAT-

Each data point belongs to a cluster with the nearest


mean.

Data points belonging to one cluster have high degree of


similarity.

Data points belonging to different clusters have high


degree of dissimilarity.
K-Means Clustering
Step-01: Step-04:

Choose the number of


Algorithm Assign each data point to some
cluster.
clusters K.
A data point is assigned to that cluster
whose center is nearest to that data
point.

Step-02: K-Means
Randomly select any K data points Clustering
as cluster centers. Step-05:
Algorithm Re-compute the center of newly
 Select cluster centers in such a
way that they are as farther as involves the formed clusters.
possible from each other. following steps- The center of a cluster is computed by
taking mean of all the data points
contained in that cluster.

Step-06:
Keep repeating the procedure from Step-03 to
Step-03:
Step-05 until any of the following stopping
Calculate the distance between each data
point and each cluster center.
criteria is met-

The distance may be calculated either by using Center of newly formed clusters do not change
given distance function or by using Euclidean Data points remain present in the same cluster
distance formula.
Maximum number of iterations are reached
Advantages of k-means Disadvantages

Relatively simple to implement.


Scales to large data sets. It requires to specify the number of

Guarantees convergence. clusters (k) in advance.

Can warm-start the positions of


centroids. • It cannot handle noisy data and outliers.

Easily adapts to new examples.


It is not suitable to identify clusters with
Generalizes to clusters of •
non-convex shapes.
different shapes and sizes, such
as elliptical clusters.

You might also like