MEASURES OF SIMILARITY AND
DISSIMILARITY
Similarity measure between two objects is a numerical measure of the degree to which two
objects are alike .
Dissimilarity measure between two objects is a numerical measure of the degree to which two
objects are different
TYPES OF ATTRIBUTES
There are different types of attributes
Binary : True/False
Nominal: Examples: ID numbers, eye color, zip codes
Ordinal: Examples: rankings (e.g., taste of potato chips on a scale from 1 ‐10), grades, height
in {tall, medium, short}
Interval: Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio: Examples: temperature in Kelvin, length, time, counts
PROXIMITY MEASURE FOR BINARY ATTRIBUTES
Object j
A contingency table for binary data
Object i
Distance measure for symmetric binary
variables:
Distance measure for asymmetric binary
variables:
Jaccard coefficient (similarity measure
for asymmetric binary variables):
5
DISSIMILARITY BETWEEN BINARY VARIABLES
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2 6
EXAMPLE:
DATA MATRIX AND DISSIMILARITY MATRIX
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
7
DISTANCE ON NUMERIC DATA: MINKOWSKI
DISTANCE
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
8
SPECIAL CASES OF MINKOWSKI DISTANCE
h = 1: Manhattan (city block, L norm) distance
1
E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
h = 2: (L2 norm) Euclidean distance
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 ip jp
h . “supremum” (L norm, L norm) distance.
max
This is the maximum difference between any component
(attribute) of the vectors
9
EXAMPLE: MINKOWSKI DISTANCE
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0 10
ORDINAL VARIABLES
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
rif {1,...,M f }
replace xif by their rank
map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif
M f 1
compute the dissimilarity using methods for interval-scaled
variables
11