DATA AND
PREPROCESSING
1
WHAT IS DATA?
Collection of data objects Attributes
and their attributes
An attribute is a property
or characteristic of an
object
– Examples: eye color of
a person,
temperature, etc.
– Attribute is also known as
A collection
variable, of
field, Objects
characteristic,
attributes describe or an
feature
object
– Object is also known as
record, point, case, sample,
entity, or instance
TYPES OF ATTRIBUTES
There are different types of attributes
– Nominal
Examples: ID numbers, eye color,
– Ordinal
Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
– Interval
Examples: calendar dates
– Ratio
Examples: temperature in Kelvin, length, time,
counts
3
DISCRETE AND CONTINUOUS ATTRIBUTES
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as
floating- point variables.
4
TYPES OF DATA SETS
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Temporal Data
– Sequential Data
– Genetic Sequence
Data
5
RECORD DATA
Data that consists of a collection of records,
each of which consists of a fixed set of
attributes
6
DATA MATRIX
If data objects have the same fixed set of
numeric attributes, then the data objects can
be thought of as points in a multi-dimensional
space, where each dimension represents a
distinct attribute
Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
7
DOCUMENT DATA
Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of
times the corresponding term occurs in the
document.
8
TRANSACTION DATA
A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
item
transaction
9
GRAPH DATA
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb"> Data
Mining </a>
<li> <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li> <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of
Equations </a>
<li> <a href="papers/papers.html#ffff"> N-
Body Computation and Dense Linear
System Solvers
10
CHEMICAL DATA
Benzene Molecule:
C6H6
11
ORDERED DATA
Sequences of
transactions
Items/Events
An element of
the 13
sequence
ORDERED DATA
Genomic sequence
data
13
ORDERED DATA
Spatio-Temporal
Data
Average Monthly
Temperature of
land and ocean
Trajectories of
Moving Objects
14
Spatial Data: Refer to the location-related aspects of
data
Application: Healthcare, environmental studies,
geography Land
Temporal Data: Time-Related Aspects e.g. hours days,
years
Application: Weather Forecasting, E-Commerce,
Education
DATA QUALITY
What kinds of data quality problems?
How can we detect problems with the
data?
What can we do about these problems?
Examples of data quality
problems:
– Noise and outliers
– missing values
– duplicate data
16
NOISE
Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
17
Two Sine Waves Two Sine Waves + Noise
OUTLIERS
Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set
18
DEVIATION/ANOMALY DETECTION
Outliers are useful when we need to detect
significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network
Intrusion
Detection
19
day
MISSING VALUES
Reasons for missing
values
– Information is not collected
(e.g., people decline to
give their age and
weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children)
Handling missing values
– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
20
– Replace with all possible values (weighted by
their probabilities)
DUPLICATE DATA
Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
21