University of Sadat City
Faculty of Computers and Artificial Intelligence (FCAI)
IS Department
DATA MINING (IS)
LECTURE 3
Chapter 3
Prepared By:
Dr. Heba Askr
First Term 2023-2024
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
3
Data Quality: Why Preprocess the Data?
◼ Measures for data quality: A multidimensional view
◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be
understood?
4
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization (decomposition)
◼ Normalization
◼ Concept hierarchy generation
5
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
6
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the
time of entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred
8
How to Handle Missing Data?
◼ Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective when the
% of missing values per attribute varies considerably
◼ Fill in the missing value manually: tedious +
infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the
same class: smarter
◼ the most probable value: inference-based such as
9
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments
◼ data entry problems
◼ data transmission problems
◼ technology limitation
◼ inconsistency in naming convention
◼ Other data problems which require data cleaning
◼ duplicate records
◼ incomplete data
◼ inconsistent data
10
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
◼ Regression
◼ smooth by fitting the data into regression functions
◼ Clustering
◼ detect and remove outliers
◼ Combined computer and human inspection
◼ detect suspicious المشبوهةvalues and check by human
(e.g., deal with possible outliers)
11
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)
◼ Check field overloading
◼ Check uniqueness rule, consecutive rule and null rule
◼ Use commercial tools
◼ Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
◼ Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
◼ Data migration and integration
◼ Data migration tools: allow transformations to be specified
◼ ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
◼ Integration of the two processes
◼ Iterative and interactive (e.g., Potter’s Wheels)
12
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
13
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales, e.g.,
metric vs. British units
14
Handling Redundancy in Data Integration
◼ Redundant data occur often when integration of
multiple databases
◼ Object identification: The same attribute or object
may have different names in different databases
◼ Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
◼ Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
◼ Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
15
Correlation Analysis (Nominal Data)
◼ Χ2 (chi-square) test
(Observed − Expected) 2
2 =
Expected
◼ The larger the Χ2 value, the more likely the variables are
related
◼ The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
◼ Correlation does not imply causality
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
16
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science 250(90) 200(360) 450
fictionخيال علمى
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
◼ Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
=
2
+ + + = 507.93
90 210 360 840
◼ It shows that like_science_fiction and play_chess are
correlated in the group
17
Correlation Analysis (Numeric Data)
◼ Correlation coefficient (also called Pearson’s product
moment coefficient)
i=1 (ai − A)(bi − B)
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
◼ If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
◼ rA,B = 0: independent; rAB < 0: negatively correlated
18
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
19
Correlation (viewed as linear relationship)
◼ Correlation measures the linear relationship
between objects
◼ To compute correlation, we standardize data
objects, A and B, and then take their dot product
a 'k = (ak − mean( A)) / std ( A)
b'k = (bk − mean( B )) / std ( B)
correlation( A, B) = A'• B '
20
Covariance (Numeric Data)
◼ Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
◼ Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
◼ Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
◼ Independence: CovA,B = 0 but the converse is not true:
◼ Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence21
Co-Variance: An Example
◼ It can be simplified in computation as
◼ Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
◼ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
◼ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
◼ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
◼ Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
23
23
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
◼ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes
◼ Wavelet transforms
◼ Principal Components Analysis (PCA)
◼ Feature subset selection, feature creation
◼ Numerosity reduction (some simply call it: Data Reduction)
◼ Regression and Log-Linear Models
◼ Histograms, clustering, sampling
◼ Data cube aggregation
◼ Data compression
24
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Wavelet transforms
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature selection)
25
Mapping Data to a New Space
◼ Fourier transform
◼ Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency
26
What Is Wavelet Transform?
◼ Decomposes a signal into
different frequency subbands
◼ Applicable to n-
dimensional signals
◼ Data are transformed to
preserve relative distance
between objects at different
levels of resolution
◼ Allow natural clusters to
become more distinguishable
◼ Used for image compression
27
Principal Component Analysis (PCA)
◼ Find a projection that captures the largest amount of variation in data
◼ The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
32
Attribute Subset Selection
◼ Another way to reduce dimensionality of data
◼ Redundant attributes
◼ Duplicate much or all of the information contained in
one or more other attributes
◼ E.g., purchase price of a product and the amount of
sales tax paid
◼ Irrelevant attributes
◼ Contain no information that is useful for the data
mining task at hand
◼ E.g., students' ID is often irrelevant to the task of
predicting students' GPA
34
Data Reduction 2: Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller
forms of data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
◼ Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
◼ Non-parametric methods
◼ Do not assume models
◼ Major families: histograms, clustering, sampling, …
37
Parametric Data Reduction: Regression
and Log-Linear Models
◼ Linear regression
◼ Data modeled to fit a straight line
◼ Often uses the least-square method to fit the line
◼ Multiple regression
◼ Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
◼ Log-linear model
◼ Approximates discrete multidimensional probability
distributions
38
y
Regression Analysis
Y1
◼ Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors)
◼ Used for prediction
◼ The parameters are estimated so as to give (including forecasting of
a "best fit" of the data time-series data), inference,
hypothesis testing, and
◼ Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but
relationships
other criteria have also been used
39
Regress Analysis and Log-Linear Models
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
◼ Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
◼ Useful for dimensionality reduction and data smoothing
40
Histogram Analysis
◼ Divide data into buckets and 40
store average (sum) for each 35
bucket
30
◼ Partitioning rules: 25
◼ Equal-width: equal bucket 20
range
15
◼ Equal-frequency (or equal- 10
depth)
5
0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
41
Clustering
◼ Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
◼ Can be very effective if data is clustered but not if data
is “smeared”
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and
clustering algorithms
◼ Cluster analysis will be studied in depth in Chapter 10
42
Sampling
◼ Sampling: obtaining a small sample s to represent the
whole data set N
◼ Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
◼ Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor
performance in the presence of skew
◼ Develop adaptive sampling methods, e.g., stratified
sampling:
◼ Note: Sampling may not reduce database I/Os (page at a
time)
43
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
50
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing 51
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
A
73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
52
Discretization
◼ Three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession
◼ Ordinal—values from an ordered set, e.g., military or academic
rank
◼ Numeric—real numbers, e.g., integer or real numbers
◼ Discretization: Divide the range of a continuous attribute into intervals
◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization
◼ Supervised vs. unsupervised
◼ Split (top-down) vs. merge (bottom-up)
◼ Discretization can be performed recursively on an attribute
◼ Prepare for further analysis, e.g., classification
53
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
54
Simple Discretization: Binning
◼ Equal-width (distance) partitioning
◼ Divides the range into N intervals of equal size: uniform grid
◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼ The most straightforward, but outliers may dominate presentation
◼ Skewed data is not handled well
◼ Equal-depth (frequency) partitioning
◼ Divides the range into N intervals, each containing approximately
same number of samples
◼ Good data scaling
◼ Managing categorical attributes can be tricky
55
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
56
Concept Hierarchy Generation
for Nominal Data
◼ Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
◼ street < city < state < country
◼ Specification of a hierarchy for a set of values by explicit
data grouping
◼ {Urbana, Champaign, Chicago} < Illinois
◼ Specification of only a partial set of attributes
◼ E.g., only street < city, not others
◼ Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
◼ E.g., for a set of attributes: {street, city, state, country}
60
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
62
Summary
◼ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
◼ Data cleaning: e.g. missing/noisy values, outliers
◼ Data integration from multiple sources:
◼ Entity identification problem
◼ Remove redundancies
◼ Detect inconsistencies
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
63
References
◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
64