Getting to Know Your D a t a
◼ Data Objects and Attribute Types
◼ Basic Statistical Descriptions of Data
◼ Data Visualization
◼ Measuring Data Similarity and
Dissimilarity
◼ Summary
1
Types of D a t a Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical
timeout
season
coach
game
score
team
pla y
matrix, crosstabs
wi n
ball
lost
◼ Document data: text documents:
term- frequency vector
◼ Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
◼ Social or information networks
◼ Molecular Structures
◼ Ordered TID Items
◼ Video data: sequence of images 1 Bread, Coke, Milk
◼ Temporal data: time-series
2 Beer, Bread
◼ Sequential Data: transaction
sequences 3 Beer, Coke, Diaper, Milk
◼ Genetic sequence data 4 Beer, Bread, Diaper, Milk
◼ Spatial, image and multimedia: 5 Coke, Diaper, Milk
◼ Spatial data: maps
◼ Image data:
◼ Video data: 2
Important Characteristics of Structured D a t a
◼ Dimensionality
◼ Curse of dimensionality
◼ Sparsity
◼ Only presence counts
◼ Resolution
◼ Patterns depend on the
scale
◼ Distribution
◼ Centrality and dispersion
3
D a t a Objects
◼ Data sets are made up of data objects.
◼ A data object represents an entity.
◼ Examples:
◼ sales database: customers, store items,
sales
◼ medical database: patients, treatments
◼ university database: students, professors,
courses
◼ Also called samples , examples, instances, data
points, objects, tuples.
◼ Data objects are described by attributes.
4
Attributes
◼ Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
◼ E.g., customer _ID, name, address
◼ Types:
◼ Nominal
◼ Binary
◼ Numeric: quantitative
◼ Interval-scaled
◼ Ratio-scaled
5
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome
(e.g., HIV positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but
magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized
units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar
dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point
◼ We can speak of values as being an order
of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5
K˚).
7
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of
values
◼ E.g., zip codes, profession, or the set of
words in a
collection of documents
◼ Sometimes, represented as integer variables
◼ Note: Binary attributes are a special case of
discrete attributes
◼ Continuous Attribute
◼ Has real numbers as attribute values
◼ E.g., temperature, height, or weight
◼ Practically, real values can only be
measured and represented using a finite
number of digits 8
Getting to Know Your D a t a
◼ Data Objects and Attribute Types
◼ Basic Statistical Descriptions of Data
◼ Data Visualization
◼ Measuring Data Similarity and
Dissimilarity
◼ Summary
9
Basic Statistical Descriptions of D a t a
◼ Motivation
◼ To better understand the data: central
tendency, variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers,
variance, etc.
◼ Numerical dimensions correspond to sorted
intervals
◼ Data dispersion: analyzed with multiple
granularities
of precision
◼ Boxplot or quantile analysis on sorted
intervals
10
Measuring the Central Tendency
n
◼ Mean (algebraic measure) (sample vs. 1
x =
population): Note: n is sample size and N n
i
μ=∑
N
is population size.
∑ n x i=
1 x
∑ wixi
◼ Weightedmean:
Trimmed arithmetic mean:
chopping extreme i=1
values x = n
∑
◼ Median:
◼ Middle value if odd number of values, or wi
average of the middle two values otherwise i=1
◼ Estimated by interpolation (for grouped data):
◼ Mode
◼ Value that occurs most frequently in
the data
◼ Unimodal, bimodal, trimodal
◼ Empirical mean − mode = 3×(mean −
formula:
median) 11
Symmetric vs. Skewed D a t a
ta
◼ Median, mean and symmetr
ic
mode of symmetric,
positively and
negatively skewed data
positively negatively
skewed skewed
March 7, 2023 Data Mining: ncepts and Techniques
Measuring the Dispersion of D a t a
◼ Quartiles, outliers and boxplots
◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◼ Inter-quartile range: IQR = Q3 – Q1
◼ Five number summary: min, Q1, median, Q3, max
◼ Boxplot: ends of the box are the quartiles; median is
marked; add whiskers, and plot outliers individually
◼ Outlier: usually, a value higher/lower than 1.5 x IQR
◼ Variance and standard deviation (sample: s, population: σ)
◼ Variance: (algebraic, scalable computation)
◼ Standard deviation s (or σ) is the square root of variance
s2 (or σ2)
13
Chapter 2: Getting to Know Your D a t a
◼ Data Objects and Attribute Types
◼ Basic Statistical Descriptions of Data
◼ Measuring Data Similarity and
Dissimilarity
◼ Summary
14
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data
objects are
◼ Value is higher when objects are more alike
◼ Often falls in the range [0,1]
◼ Dissimilarity (e.g., distance)
◼ Numerical measure of how different two data
objects are
◼ Lower when objects are more alike
◼ Minimum dissimilarity is often 0
◼ Upper limit varies
◼ Proximity refers to a similarity or dissimilarity
15
D a t a Matrix and Dissimilarity Matrix
◼ Data matrix
◼ n data points
⎡ x 11 ... x 1f ... x 1p ⎤
with p ⎢
⎢ ... ... ⎥⎥
... ... ...
dimensions ⎢⎢ xi1 ⎥
... xif ... ip ⎥
◼ Two modes
⎢ ... x ... ⎥
⎢⎢ x ... ...
x nf ... x ⎥⎥
n1 np
... ...
◼ Dissimilarity ⎣ ⎦
matrix ⎡ 0 ⎤
⎢ d(2,1) 0 ⎥
◼ n data points, ⎢ ⎥
⎢ d(3,1) d (3,2) 0 ⎥
but registers ⎥
only the ⎢⎢ : : : ⎥
distance ⎢⎣ d ... 0 ⎥
d (n,2) ⎦
◼ A triangular (n,1)
... 16
Proximity Measure for Nominal Attributes
◼ Can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary
attribute)
◼ Method 1: Simple matching
◼ m: # of matches, p:ptotal
− # of variables
d(i, j) = p
◼ Method 2: Use
m a large number of binary
attributes
◼ creating a new binary attribute for each
of the
M nominal states 17
Proximity Measure for Binary Attributes
Object
j
◼ A contingency table for binary data
Object i
◼ Distance measure for symmetric
binary variables:
◼ Distance measure for asymmetric
binary variables:
◼ Jaccard coefficient
(similarity measure for
asymmetric binary
variables):
◼ Note: Jaccard coefficient is
the same as “coherence”:
18
Dissimilarity between Binary Variables
◼ Exampl
e Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric
binary
◼ Let the values Y and P be 1, and
0 + 1 the
value Nd0( j a c k , m a r y ) = 2 + 0 + 1 = 0 . 3 3
1 +1
d ( jack, jim) = = 0.67
1 +1 +1
1 +2
d ( jim, mary) = = 0.75
1 +1 +2
19
Standardizing Numeric D a t a
◼ Z- z = xσ−
score:
◼ X: raw score to be standardized, μ: mean of the
μ
population, σ:
standard deviation
◼ the distance between the raw score and the population
mean in units of the standard deviation
◼ negative when the raw score is below the mean, “+”
when above
◼ s f =nCalculate
An alternative way: 1 (|1 xf − m|+| 2xf − absolute
fthe mean m|+...+|
f x− m
nf f
deviation
wher
e m f =n1|)
(x1 f + 2xf +...+ nf
. )
x −
if f
x zif
ms
◼ standardized measure (z-
score = f
): absolute deviation is more robust than using
◼ Using mean
standard deviation
20
Example:
D a t a Matrix and Dissimilarity Matrix
Data Matrix
point attribute1
attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean
Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
21
Distance on Numeric Data: Minkowski Distance
◼ Minkowski distance: A popular distance
measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)
are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h
norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality) 22
Special Cases of Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d(i, j) =| xi1 − x j1 |+| xi2 − x j2 |+...+| xip − x jp |
◼ h = 2: (L2 norm) Euclidean distance
d(i, j) = (| x − x |2 +| x − x |2
+...+| x − x |2 )
i1 j1
i2 j2
ip jp
◼ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
◼ This is the maximum difference between any component
(attribute) of the vectors
23
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3
x2 3 5 x4
x3 2 0 x1 0
x4 4 5 x2 5 0
x3 3 6 0
x4 6 1 7 0
Euclidean
(L2)L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39
0
Supremu
m L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
24
Ordinal Variables
◼ An ordinal variable can be discrete or
continuous
◼ Order is important, e.g., rank
◼ Can be treated
◼ replace like interval-scaled
by their r i f ∈{1, . . . , M
◼ x
map
if rank of each variable
the range f
} onto [0, 1] by
replacing
i-th object in the f-th variable
− 1 by
z i f = r if
M f − 1
◼ compute the dissimilarity using methods for
interval-
scaled variables
25
Attributes of Mixed Type
◼ A database may contain all attribute types
◼ Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
◼ One may use a weighted formula to combine their
effects
Σ p δ ( f )d
d(i, j) = ( f )f = 1 ij ij
Σ fp = 1δij
◼ f is binary or nominal: (f)
dij (f) = 0 if xif = xjf , or d (f) = 1
◼ f otherwise
is
ij numeric: use the normalized
distance
◼ f ◼isCompute
ordinal ranks rif and −
zif r 1
◼ Treat z as interval-
if = Mif
f
−1
scaled 26
Cosine Similarity
◼ A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.
◼ Other vector objects: gene features in micro-arrays, …
◼ Applications: information retrieval, biologic taxonomy,
gene feature mapping, ...
◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-
frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of
vector d 27
Example: Cosine Similarity
◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of
vector d
◼ Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||=
(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||=
(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
2
Chapter 2: Getting to Know Your D a t a
◼ Data Objects and Attribute Types
◼ Basic Statistical Descriptions of Data
◼ Data Visualization
◼ Measuring Data Similarity and
Dissimilarity
◼ Summary
29
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-
scaled, ratio- scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web,
image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency,
dispersion, graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active
area of research.
30
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics
Press,
◼
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009 31