0% found this document useful (0 votes)

7 views31 pages

Mod 4 Types of Data in Cluster Analysis

The document provides an overview of data types, including structured and unstructured data, and their attributes. It discusses statistical descriptions of data, methods for measuring similarity and dissimilarity, and various distance measures such as Minkowski distance. Additionally, it covers the importance of data visualization and the characteristics of data objects.

Uploaded by

pobocow192

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views31 pages

Mod 4 Types of Data in Cluster Analysis

Uploaded by

pobocow192

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary
1
Types of D a t a Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical

timeout

season
coach

game
score
team

pla y
matrix, crosstabs

wi n
ball

lost
◼ Document data: text documents:
term- frequency vector
◼ Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
◼ Social or information networks
◼ Molecular Structures
◼ Ordered TID Items
◼ Video data: sequence of images 1 Bread, Coke, Milk
◼ Temporal data: time-series
2 Beer, Bread
◼ Sequential Data: transaction
sequences 3 Beer, Coke, Diaper, Milk
◼ Genetic sequence data 4 Beer, Bread, Diaper, Milk
◼ Spatial, image and multimedia: 5 Coke, Diaper, Milk
◼ Spatial data: maps
◼ Image data:
◼ Video data: 2
Important Characteristics of Structured D a t a

◼ Dimensionality
◼ Curse of dimensionality
◼ Sparsity
◼ Only presence counts
◼ Resolution
◼ Patterns depend on the
scale
◼ Distribution
◼ Centrality and dispersion
3
D a t a Objects

◼ Data sets are made up of data objects.

◼ A data object represents an entity.
◼ Examples:
◼ sales database: customers, store items,
sales
◼ medical database: patients, treatments
◼ university database: students, professors,
courses
◼ Also called samples , examples, instances, data
points, objects, tuples.
◼ Data objects are described by attributes.
4
Attributes
◼ Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
◼ E.g., customer _ID, name, address
◼ Types:
◼ Nominal

◼ Binary

◼ Numeric: quantitative

◼ Interval-scaled

◼ Ratio-scaled

5
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome
(e.g., HIV positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but
magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized
units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar
dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point
◼ We can speak of values as being an order
of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5
K˚).
7
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of

values
◼ E.g., zip codes, profession, or the set of

words in a
collection of documents
◼ Sometimes, represented as integer variables

◼ Note: Binary attributes are a special case of

discrete attributes
◼ Continuous Attribute
◼ Has real numbers as attribute values

◼ E.g., temperature, height, or weight

◼ Practically, real values can only be

measured and represented using a finite

number of digits 8
Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary
9
Basic Statistical Descriptions of D a t a
◼ Motivation
◼ To better understand the data: central
tendency, variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers,
variance, etc.
◼ Numerical dimensions correspond to sorted
intervals
◼ Data dispersion: analyzed with multiple
granularities
of precision
◼ Boxplot or quantile analysis on sorted
intervals
10
Measuring the Central Tendency
n
◼ Mean (algebraic measure) (sample vs. 1
x =
population): Note: n is sample size and N n
i
μ=∑
N
is population size.
∑ n x i=
1 x
∑ wixi
◼ Weightedmean:
Trimmed arithmetic mean:
chopping extreme i=1
values x = n

∑
◼ Median:
◼ Middle value if odd number of values, or wi
average of the middle two values otherwise i=1

◼ Estimated by interpolation (for grouped data):

◼ Mode

◼ Value that occurs most frequently in

the data
◼ Unimodal, bimodal, trimodal
◼ Empirical mean − mode = 3×(mean −
formula:
median) 11
Symmetric vs. Skewed D a t a
ta
◼ Median, mean and symmetr
ic
mode of symmetric,
positively and
negatively skewed data

positively negatively
skewed skewed

March 7, 2023 Data Mining: ncepts and Techniques

Measuring the Dispersion of D a t a
◼ Quartiles, outliers and boxplots
◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◼ Inter-quartile range: IQR = Q3 – Q1
◼ Five number summary: min, Q1, median, Q3, max
◼ Boxplot: ends of the box are the quartiles; median is
marked; add whiskers, and plot outliers individually
◼ Outlier: usually, a value higher/lower than 1.5 x IQR
◼ Variance and standard deviation (sample: s, population: σ)
◼ Variance: (algebraic, scalable computation)

◼ Standard deviation s (or σ) is the square root of variance

s2 (or σ2)

13
Chapter 2: Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary

14
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data

objects are
◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)

◼ Numerical measure of how different two data

objects are
◼ Lower when objects are more alike

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

15
D a t a Matrix and Dissimilarity Matrix
◼ Data matrix
◼ n data points
⎡ x 11 ... x 1f ... x 1p ⎤
with p ⎢
⎢ ... ... ⎥⎥
... ... ...
dimensions ⎢⎢ xi1 ⎥
... xif ... ip ⎥
◼ Two modes
⎢ ... x ... ⎥
⎢⎢ x ... ...
x nf ... x ⎥⎥
n1 np
... ...
◼ Dissimilarity ⎣ ⎦
matrix ⎡ 0 ⎤
⎢ d(2,1) 0 ⎥
◼ n data points, ⎢ ⎥
⎢ d(3,1) d (3,2) 0 ⎥
but registers ⎥
only the ⎢⎢ : : : ⎥
distance ⎢⎣ d ... 0 ⎥
d (n,2) ⎦
◼ A triangular (n,1)
... 16
Proximity Measure for Nominal Attributes

◼ Can take 2 or more states, e.g., red, yellow,

blue, green (generalization of a binary
attribute)
◼ Method 1: Simple matching
◼ m: # of matches, p:ptotal
− # of variables
d(i, j) = p
◼ Method 2: Use
m a large number of binary
attributes
◼ creating a new binary attribute for each
of the
M nominal states 17
Proximity Measure for Binary Attributes
Object
j
◼ A contingency table for binary data
Object i

◼ Distance measure for symmetric

binary variables:
◼ Distance measure for asymmetric
binary variables:
◼ Jaccard coefficient
(similarity measure for
asymmetric binary
variables):
◼ Note: Jaccard coefficient is
the same as “coherence”:

18
Dissimilarity between Binary Variables
◼ Exampl
e Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric
binary
◼ Let the values Y and P be 1, and
0 + 1 the
value Nd0( j a c k , m a r y ) = 2 + 0 + 1 = 0 . 3 3
1 +1
d ( jack, jim) = = 0.67
1 +1 +1
1 +2
d ( jim, mary) = = 0.75
1 +1 +2
19
Standardizing Numeric D a t a
◼ Z- z = xσ−
score:
◼ X: raw score to be standardized, μ: mean of the
μ
population, σ:
standard deviation
◼ the distance between the raw score and the population
mean in units of the standard deviation
◼ negative when the raw score is below the mean, “+”
when above
◼ s f =nCalculate
An alternative way: 1 (|1 xf − m|+| 2xf − absolute
fthe mean m|+...+|
f x− m
nf f
deviation
wher
e m f =n1|)
(x1 f + 2xf +...+ nf
. )
x −
if f
x zif
ms
◼ standardized measure (z-
score = f
): absolute deviation is more robust than using
◼ Using mean
standard deviation

20
Example:
D a t a Matrix and Dissimilarity Matrix
Data Matrix
point attribute1
attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean
Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

21
Distance on Numeric Data: Minkowski Distance
◼ Minkowski distance: A popular distance
measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h
norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality) 22
Special Cases of Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
different between two binary vectors

d(i, j) =| xi1 − x j1 |+| xi2 − x j2 |+...+| xip − x jp |

◼ h = 2: (L2 norm) Euclidean distance

d(i, j) = (| x − x |2 +| x − x |2
+...+| x − x |2 )
i1 j1
i2 j2
ip jp
◼ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
◼ This is the maximum difference between any component
(attribute) of the vectors

23
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3
x2 3 5 x4
x3 2 0 x1 0
x4 4 5 x2 5 0
x3 3 6 0
x4 6 1 7 0
Euclidean
(L2)L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39
0
Supremu
m L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
24
Ordinal Variables

◼ An ordinal variable can be discrete or

continuous
◼ Order is important, e.g., rank
◼ Can be treated
◼ replace like interval-scaled
by their r i f ∈{1, . . . , M
◼ x
map
if rank of each variable
the range f
} onto [0, 1] by
replacing
i-th object in the f-th variable
− 1 by
z i f = r if
M f − 1
◼ compute the dissimilarity using methods for
interval-
scaled variables
25
Attributes of Mixed Type
◼ A database may contain all attribute types
◼ Nominal, symmetric binary, asymmetric binary,

numeric, ordinal
◼ One may use a weighted formula to combine their
effects
Σ p δ ( f )d
d(i, j) = ( f )f = 1 ij ij
Σ fp = 1δij
◼ f is binary or nominal: (f)
dij (f) = 0 if xif = xjf , or d (f) = 1
◼ f otherwise
is
ij numeric: use the normalized

distance
◼ f ◼isCompute
ordinal ranks rif and −
zif r 1
◼ Treat z as interval-
if = Mif
f

−1
scaled 26
Cosine Similarity
◼ A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

◼ Other vector objects: gene features in micro-arrays, …

◼ Applications: information retrieval, biologic taxonomy,
gene feature mapping, ...
◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-
frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of
vector d 27
Example: Cosine Similarity
◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of
vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||=
(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||=
(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
2
Chapter 2: Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary
29
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-
scaled, ratio- scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web,
image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency,
dispersion, graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active
area of research.

30
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics
Press,
◼
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009 31

Woman-Centered Coaching Blueprint - Workshop 3 - Handout
No ratings yet
Woman-Centered Coaching Blueprint - Workshop 3 - Handout
14 pages
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
No ratings yet
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
16 pages
EWD Camry 2006
No ratings yet
EWD Camry 2006
400 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
02 Data
No ratings yet
02 Data
35 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Lec 5
No ratings yet
Lec 5
24 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
02data Part4
No ratings yet
02data Part4
28 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Lect 3
No ratings yet
Lect 3
51 pages
CH 2
No ratings yet
CH 2
35 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
Data ch2
No ratings yet
Data ch2
16 pages
02 Data
No ratings yet
02 Data
41 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
01 Data
No ratings yet
01 Data
100 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
43 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
02 Data
No ratings yet
02 Data
24 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
About Data
No ratings yet
About Data
25 pages
3 Data
No ratings yet
3 Data
64 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
02 Data
No ratings yet
02 Data
64 pages
CH 2
No ratings yet
CH 2
68 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
en - GASP 2020 2022 Global Aviation Safety Plan
No ratings yet
en - GASP 2020 2022 Global Aviation Safety Plan
144 pages
International Journal of Non-Linear Mechanics: Chiara Gastaldi, Teresa M. Berruti
No ratings yet
International Journal of Non-Linear Mechanics: Chiara Gastaldi, Teresa M. Berruti
16 pages
Tomato Processing Guide by Mynampati Sreenivasa Rao
No ratings yet
Tomato Processing Guide by Mynampati Sreenivasa Rao
4 pages
Statistical Tests - Handout PDF
No ratings yet
Statistical Tests - Handout PDF
21 pages
GS 150
No ratings yet
GS 150
72 pages
Cre Project Group 2 Sec01
No ratings yet
Cre Project Group 2 Sec01
28 pages
Math Test: Rounding & Operations
No ratings yet
Math Test: Rounding & Operations
4 pages
Mrcs Part B Osce Anatomy
No ratings yet
Mrcs Part B Osce Anatomy
287 pages
Falke Talk - The Falke 80 - 90 Serial No Database - 03
No ratings yet
Falke Talk - The Falke 80 - 90 Serial No Database - 03
5 pages
Ocular Ischemic Syndrome Case Report
No ratings yet
Ocular Ischemic Syndrome Case Report
18 pages
UV Stable Waterproof Membrane Guide
No ratings yet
UV Stable Waterproof Membrane Guide
3 pages
Vacuum Test Procedure (VCP)
No ratings yet
Vacuum Test Procedure (VCP)
5 pages
Dbms Theory
No ratings yet
Dbms Theory
20 pages
Artistic Skills and Techniques To Contemporary Art Creations
No ratings yet
Artistic Skills and Techniques To Contemporary Art Creations
40 pages
From Vivaldi To Viotti - A History of The Early Classical - White, Chappell - 2. Print, Philadelphia, 1992 - Philadelphia - Gordon and Breach - 97828812449
No ratings yet
From Vivaldi To Viotti - A History of The Early Classical - White, Chappell - 2. Print, Philadelphia, 1992 - Philadelphia - Gordon and Breach - 97828812449
416 pages
BZ-08-062-F Forklift Handover Checklist Form
No ratings yet
BZ-08-062-F Forklift Handover Checklist Form
2 pages
Span 210-MW Syllabus Spring 2014
No ratings yet
Span 210-MW Syllabus Spring 2014
12 pages
Hoc Sinh Gioi 8 - 2022
No ratings yet
Hoc Sinh Gioi 8 - 2022
10 pages
The Ultimate Guide To Reading The Water
No ratings yet
The Ultimate Guide To Reading The Water
39 pages
Pega CSSA Cheat Sheet For OOTB Rules
No ratings yet
Pega CSSA Cheat Sheet For OOTB Rules
4 pages
Preboard Exam in Ee 2
No ratings yet
Preboard Exam in Ee 2
14 pages
Respect FocusedTherapy CH 1
100% (1)
Respect FocusedTherapy CH 1
15 pages
Regular Letter 2024 Dulguime Jesus Carl 1
No ratings yet
Regular Letter 2024 Dulguime Jesus Carl 1
2 pages
MM-Last Day Assignment
No ratings yet
MM-Last Day Assignment
18 pages
Refrigerants: The Pragmatic Solution of Today
No ratings yet
Refrigerants: The Pragmatic Solution of Today
2 pages
Chapter 1 SAD
No ratings yet
Chapter 1 SAD
8 pages
Reto 4
No ratings yet
Reto 4
5 pages

Mod 4 Types of Data in Cluster Analysis

Uploaded by

Mod 4 Types of Data in Cluster Analysis

Uploaded by

Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

◼ Data sets are made up of data objects.

◼ Note: Binary attributes are a special case of

◼ E.g., temperature, height, or weight

◼ Practically, real values can only be

measured and represented using a finite

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

◼ Estimated by interpolation (for grouped data):

◼ Value that occurs most frequently in

March 7, 2023 Data Mining: ncepts and Techniques

◼ Standard deviation s (or σ) is the square root of variance

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

◼ Can take 2 or more states, e.g., red, yellow,

◼ Distance measure for symmetric

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

d(i, j) =| xi1 − x j1 |+| xi2 − x j2 |+...+| xip − x jp |

◼ h = 2: (L2 norm) Euclidean distance

◼ An ordinal variable can be discrete or

◼ Other vector objects: gene features in micro-arrays, …

◼ Ex: Find the similarity between documents 1 and 2.

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

You might also like