By Rajeev Kumar
Agenda
Data Mining Definition
Data Mining Comparisons
Data Evolution
KDD Process
Data Mining Process
Data Mining Techniques
Applications
CASE STUDY-DM application in GIS
Pattern Recognition
Definition
DATA MINING is defined as “the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data“
and "the science of extracting useful information from large data
sets or databases" .
It is also said to be the search for the relationships and global
patterns that exist in large databases but are hidden among vast
amounts of data.
The patterns must be
valid: hold on new data with some certainty
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the pattern
Why Data Mining?
Competitive Advantage!
“The secret of success is to know something that
nobody else knows.”
-Aristotle Onassis
PHOTO: HULTON-DEUTSCH COLL
Human analysis skills are inadequate
Volume and dimensionality of the data
High data growth rate
Availability of:
• Data
• Storage
• Computational power
• Off-the-shelf software
• Expertise
Why Data Mining?
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely to default on
their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, given the demographics and
transactional history of a particular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and which are most likely to
leave for a competitor? :
Data Mining helps extract such information
Data Mining works with Warehouse Data
Data Warehousing provides the
Enterprise with a memory
Data Mining provides the
Enterprise with intelligence
Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"
Data Access "What were unit Relational Oracle, Sybase, Retrospective,
(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC
Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses
Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,
(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases startups
The Knowledge Discovery Process
in data.
Problem formulation
Data collection
subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis
Pre-processing: cleaning
name/address cleaning, different meanings (annual, yearly), duplicate
removal, supplying missing values
Transformation:
map complex objects e.g. time series data to features e.g. frequency
Choosing mining task and mining method
Result evaluation and Visualization
Knowledge discovery is an iterative process
The data mining Process
Why Should There be a Standard Process?
The data mining process must be reliable and repeatable by people with
little data mining background.
Process Standardization
CRISP-DM:
CRoss Industry Standard Process for Data Mining
Initiative launched Sept.1996
• CRISP-DM provides a uniform framework for
– guidelines
– experience documentation
• CRISP-DM is flexible to account for differences
– Different business/agency problems
– Different data
The data mining Process
Phases in the DM Process:
Phases
Business Understanding
understanding project objectives and data
mining problem identification
Data Understanding
capture and understand data for quality
issues
Data Preparation
data cleaning, merging data and deriving
attributes
Phases
Modeling
select the data mining technique and build
the model
Evaluation
evaluate results and approved model
Deployment
put the models into practice, monitoring and
maintenance
DM Techniques
Based on two models
Verification Model
-user makes hypothesis
-tests hypothesis to verify its validity
Discovery Model
-automatically discovering important information hidden in
the data
-data is sifted in search of frequently occurring patterns,
trends and generalizations
-no guidance from user
DM Techniques
1.Discovery of Association Rules
2.Clustering
3.Discovery of Classification Rules
4.Frequent Episodes
5.Deviation Detection
6.Neural Networks
7.Genetic Algorithms
8.Rough Set Techniques
9.Support Vector Machines
Association Rules
Purchasing of one product when another product is purchased
represents an AR
Used mainly in retail stores to
-Assist in marketing
-Shelf management
-Inventory control
Support means how often X and Y occur together as a percentage of the
total transactions
Confidence means how much a particular item is dependant on the other
Given set T of groups of items
Example: set of item sets purchased
Goal: find all rules on itemsets of the form a-->b such that
support of a and b > user threshold s
conditional probability (confidence) of b given a > user
threshold c
Example: Milk --> bread
Purchase of product A --> service B
Clustering
Unsupervised learning when old data with class labels not
available e.g. when introducing a new product.
Group/cluster existing customers based on time series of
payment history such that similar customers in same
cluster.
Key requirement: Need a good measure of similarity
between instances.
Identify micro-markets and develop policies for each
Clustering- Applications
Customer segmentation e.g. for targeted marketing
Group/cluster existing customers based on time series of
payment history such that similar customers in same cluster.
Identify micro-markets and develop policies for each
Collaborative filtering:
Given database of user preferences, predict preference of new user
Example: predict what new movies you will like based on
your past preferences
others with similar past preferences
their preferences for the new movies
Decision trees
Tree where internal nodes are simple decision rules on
one or more attributes and leaf nodes are predicted
class labels.
Is company software?
Product On site job
development
Good Bad Bad Good
Decision tree
Widely used learning method
Easy to interpret: can be re-represented as if-then-else rules
Does not require any prior knowledge of data distribution, works well
on noisy data.
Has been applied to:
classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
· Pros
· Cons
+ Reasonable training
Cannot handle complicated
time
relationship between features
+ Fast application
simple decision boundaries
+ Easy to interpret
problems with lots of missing
+ Easy to implement
data
+ Can handle large
number of features
Neural networks
Involves developing mathematical structures with the ability to
learn
Best at identifying patterns or trends in data
Well suited for prediction or forecasting needs
Useful for learning complex data like handwriting, speech and
image recognition
· Pros · Cons
+ Can learn more complicated Slow training time
class boundaries Hard to interpret
+ Fast application Hard to implement: trial
+ Can handle large number of and error for choosing
features number of nodes
Data Mining Applications
Some examples of “successes":
1. Decision trees constructed from bank-loan histories to produce
algorithms to decide whether to grant a loan.
2. Patterns of traveler behavior mined to manage the sale of
discounted seats on planes, rooms in hotels,etc.
3. “Bread and Butter." Observation that customers who buy bread
are more likely to buy butter than average, allowed supermarkets
to place bread and butter nearby, knowing many customers would
walk between them.
4. Skycat and Sloan Sky Survey: clustering sky objects by their
radiation levels in different bands allowed astronomers to
distinguish between galaxies, nearby stars, and many other kinds
of celestial objects.
5. Comparison of the genotype of people with/without a condition
allowed the discovery of a set of genes that together account for
many cases of diabetes. This sort of mining has become much
more important as the human genome has fully been decoded
Issues and Challenges in DM
Limited Information
Noise or missing Data
User interaction
Uncertainty
Size,updates and irrelevant fields
Other Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of event identify fraudulent events
Manufacturing and production:
automatically adjust knobs when process parameter changes
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between
diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
Introduction
In USA, traffic crashes cause one death
every 13 minutes and an injury every 15
seconds
If current US crash rate remains
unchanged
* One child out of every 84 born today will die
violently in a motor vehicle crash
* Six out of every 10 children will be injured in a
highway crash over a lifetime
The economic impact to the U.S. is
roughly $150 billion
Deaths and injuries on Florida’s highways
cost society $14 billion annually, ranking
Florida in the top five nationally (FDOT,
2003)
In addition, the insecurity on the roads has an important effect on the economic costs
associated with traffic accidents
Over the last two decades, increasingly large amount of
transportation data have been stored electronically
This volume is expected to continue to grow considerably in the future
A tremendous amount of data pertain to transportation safety
Despite this wealth of data, we have been unable to fully capitalize on
its value
This is because the information implicit in the data is not easily
discernable without the use of proper data analysis techniques
Advanced modeling and analysis of such data are needed to
understand relationships and dependencies
Two Major challenges face transportation engineers today:
How to extract the most relevant information from the vast amount of
available traffic data
How to use the wealth of data to enhance our understanding of traffic
behavior
Data mining is an emerging field that promotes the progress of data
analysis
Why Data Mining ?
When dealing with large and complex data sets, such as
transportation data, the use of data mining techniques seems useful
for knowledge discovery and in identifying relevant variables that
make strong contribution towards better understanding of safety
patterns, problems and causes.
Data mining applications in transportation is still in its infancy, but
have a high potential for growth
Applications of Data Mining in Transportation:
Improve Traffic Signal Timing Plans
Measure Freeway Performance
Improve Aviation Safety
Improve delivery and quality of traffic safety information
Developed Analytical Methodology
Step1- GIS: identify relevant freeway features at each accident location and
integrate them with crash database
Step2- Preliminary Statistical Analysis: better understanding of data
Step3- Data Mining Clustering Techniques: identify clusters of common
accidents, and conditions under which accidents are more likely to cause
death or injury
Step4- Data Mining Profiling Techniques: profile accidents in terms of
accident and freeway characteristics
Step5- Visualization Techniques: better understanding & presentation of
results
Clustering Analysis:
Clustering = Unsupervised classification
Place objects into clusters suggested by the data, not defined a priori
Studies have shown that clustering methods are an important tool when
analyzing traffic accidents as these methods are able to identify groups of
road users, vehicles and road segments which would be suitable targets
for countermeasures
Demographic clustering is a distribution-based data mining technique
that provides fast and natural clustering of very large databases
The Data Mining Process:
1. Data Capture
2. Data Merging
3. Data Description
4. Statistical Analysis
5. Data Pre-Processing
Data Transformation
Data Cleaning
6. Building Data-Mining Models
Data Capture
The 1999 crash data for MDC Freeways was utilized
Crash data:
Rich source of accident related information
Contains 39 attributes that describe the accident: Roadway section,
date, time, accident type, driver age, lighting condition, traffic control,
and road condition
But
This does not provide information about physical characteristics of
the roadway at the accident location, which is necessary to develop
appropriate countermeasures .
Data Merging
Roadway Characteristics Inventory (RCI) database, contains various
physical and administrative related features to the roadway networks
Speed limit, local name, number of lanes, shoulder type, median type, and
median width were extracted from the RCI database using GIS Software
package (Arc View)
Arc Avenue was used to write a special script to merge crash data and
roadway attribute tables
Spatial reference and analysis were performed to identify the roadway
features at each accident location
Crash and roadway characteristic data merging
Data Merging
Median Width Shoulder Data
Accidents
Median Width at each Shoulder Type at each
Accident Location Accident Location
Data Description
Merged dataset contains 5,870 records (one for each accident)
The 45 attributes describing each accident can be divided into seven
dimensions:
Accident Information (e.g. number of injuries, number of fatalities,
accident type, point of impact)
Road Information (e.g. road condition, road surface, road side)
Traffic Information (e.g. accident lane number)
Drivers Information (e.g. drivers age)
Geographical Information (e.g. roadway section, area type)
Environmental Conditions Information (e.g. lighting condition, weather
condition, date and time of the accident)
&
Roadway feature Information (e.g. number of lanes, speed limit, local
name, median width)
Statistical Analysis
The database:
Large
Noisy and/or missing data
No clearly defined expectations about the kind of clusters to be
discovered
Preliminary statistical analysis was performed
Basic statistics, such as maximum, minimum, mean, variance, and
frequencies were calculated for numeric fields
Frequencies for continuous numeric fields are calculated
Frequencies for categorical fields and discrete numeric fields were
calculated
These statistics helped in providing a better understanding of the data and
in speeding up the problem identification process
Data Pre-Processing
Data transformation
Datasets:
Most variables are categorical
Discrete numeric
Continuous
Data Cleaning
Irregularities were tracked, listed and corrected or removed from the
dataset
Building Data-Mining Models
This process is iterative
Several input variables and model parameters are explored
The 45 variables were narrowed down to the most important ones
Resulting in a set of variables that was used in building the data-mining
models
Clustering- Input Variables:
Area type, accident type, main contributing cause, site location,
highway name, driver Age, accident Lane, traffic control, and time of
the accident
Profiling -Input Variables:
Other discrete and categorical variables and some interesting
continuous variables were input as supplementary variables
Variables used to profile the clusters but not to define them
Building Data-Mining Models
Model 1
The freeway accidents dataset involving one or more injuries
Modal values for each cluster variable given below.
Cluster 4 6 5 9 3 8 1 2 7
Percentage 36.15% 11.60% 11.11% 8.54% 7.95% 7.95% 5.90% 5.76% 5.03%
Local Name Palmetto Palmetto Palmetto I-95 I-95 I-95 I-95 I-95 Palmetto
Expressw Expressw Expressw Expressw
ay ay ay ay
Type of Rear-End Hit Conc. Rear-End Sideswipe Rear-End Angle Rear-End Angle Angle
Accident Barrier
Wall
Accident Time 4-7 PM 9PM-7AM 4-7 PM 1-4PM 4-7 PM 9PM-7AM 1-4PM 1-4PM 9PM-7AM
Site Location Not at Not at Exit Ramp Not at Not at Not at Bridge Not at At
Interchange Interchange Interchange Interchange Interchange Interchange Intersection
Accident Lane 1 Median Ramp 3 2 Side of Ramp Side of Ramp
Road Road
Contributing Careless Careless Careless Improper No Careless Careless Careless No
Cause Driving Act Driving Act Driving Act Lane Improper Driving Act Driving Act Driving Act Improper
Change Driving Act Driving Act
Driver 1 Age [25 –35] [25 –35] [15 –25] [35 –45] [25 –35] [25 –35] [25 –35] [25 –35] [25 –35]
Type of Control No Control No Control No Control No Control No Control No Control No Control No Control No Control
Model 2
The freeway accidents dataset that involved one or more
fatalities
Model 2
The freeway accidents dataset that involved one or more fatalities
Detailed visualization of one of the clusters involving 26.67% of the population
The analytical methodology developed in this study:
Provide an insight to identify roadway and drivers’ characteristics
that contributes to severe accidents on Miami-Dade County
Freeway System
Provides information for transportation planners to improve
planning, operating, and managing the freeway system
Could help interested agencies effectively allocate resources to
improve safety measures in those areas with high accident
frequency
Definiton:
“The assignment of a physical object or event to one of several
pre-specified categories” –Duda and Hart
“The science that concerns the description or classification
(recognition) of measurements” –Schalkoff
Pattern Recognition is concerned with answering the question
“What is this?” –Morse
It is the study of how machines can :
•observe the environment
• learn to distinguish patterns of interest from their
background
•make sound and reasonable decisions about the
categories of the patterns.
A pattern is an object, process or event that can be given a name.
A pattern class (or category) is a set of patterns sharing common attributes and usually
originating from the same source.
During recognition (or classification) given objects are assigned to prescribed classes.
A classifier is a machine which performs classification.
APPLICATIONS:
Computer Vision
Speech Recognition
Automated Target Recognition
Optical Character Recognition
Seismic Analysis
Man and Machine Diagnostics
Fingerprint Identification
Age Preprocessing / Segmentation
Industrial Inspection
Financial Forecast
Medical Diagnosis
ECG Signal Analysis
Emerging PR Applications
Problem Input Output
Speech recognition Speech waveforms Spoken words, speaker
identity
Non-destructive testing Ultrasound, eddy current, Presence/absence of flaw,
acoustic emission waveforms type of flaw
Detection and diagnosis of EKG, EEG waveforms Types of cardiac
disease conditions, classes of brain
conditions
Natural resource Multispectral images Terrain forms, vegetation
identification cover
Aerial reconnaissance Visual, infrared, radar images Tanks, airfields
Character recognition (page Optical scanned image Alphanumeric characters
readers, zip code, license
plate)
Emerging PR Applications (cont’d)
Problem Input Output
Identification and counting Slides of blood samples, micro- Type of cells
of cells sections of tissues
Inspection (PC boards, IC Scanned image (visible, Acceptable/unacceptable
masks, textiles) infrared)
Manufacturing 3-D images (structured light, Identify objects, pose,
laser, stereo) assembly
Web search Key words specified by a user Text relevant to the user
Fingerprint identification Input image from fingerprint Owner of the fingerprint,
sensors fingerprint classes
Online handwriting Query word written by a user Occurrence of the word in
retrieval the database
Key Objectives:
Process the sensed data to eliminate noise
Given a sensed pattern, choose the best-fitting model for it and
then assign it to class associated with the model.
Classification v/s Clustering
• Classification (known categories)
• Clustering (creation of new categories)
Category “A”
Category “B”
Classification (Recognition) Clustering
(Supervised Classification) (Unsupervised Classification)
A basic pattern classification system contains
A sensor
A preprocessing mechanism
A feature extraction mechanism (manual or automated)
A classification algorithm
A set of examples (training set) already classified or described
Main PR Approaches:
Template matching
The pattern to be recognized is matched against a stored
template while taking into account all allowable pose
(translation and rotation) and scale changes.
Statistical pattern recognition
Focuses on the statistical properties of the patterns (i.e.,
probability densities).
Structural Pattern Recognition
Describe complicated objects in terms of simple primitives and
structural relationships.
Syntactic pattern recognition
Decisions consist of logical rules or grammars.
Artificial Neural Networks
Inspired by biological neural network models.
Template Matching
Template
Input scene
• Patterns represented in a feature space
• Statistical model for pattern generation in feature space
• Basically adopted for numeric data.
Structural Pattern Recognition
Describe complicated objects in terms of simple primitives and
structural relationships.
Decision-making when features are non-numeric or structural
Scene
N
L
X Object Background
T
M Y Z
D E M N
D E
L T X Y Z
Artificial Neural Networks
Massive parallelism is essential for complex pattern recognition
tasks (e.g., speech and image recognition)
Human take only a few hundred ms for most cognitive tasks;
suggests parallel computation
A typical Pattern Recognition System:
Sensing:
INPUT
The sensor converts images or
sounds or other physical inputs
SENSING into signal data.
The input to the PR system is a
SEGMENTATION transducer such as a camera or a
microphone. Therefore the
difficulty of the problem depends
FEATURE EXTRATION on the limitations and
characteristics of the transducer-
its bandwidth, resolution,
CLASSIFICATION sensitivity, distortion, signal to
noise ratio etc.,
Segmentation:
POST-PROCESSING
Isolates sensed objects from the
background or from other data.
OUTPUT
• Feature extraction:
Measures object properties that are useful for the classification. It aims to
create discriminative features good for classification.
So, the aim is to extract features which are “good” for classification.
Good features:
Objects from the same class have similar feature values.
Objects from different classes have different values.
“good” features “bad” features
Classification:
Takes in the feature vector from the feature extractor and assigns the object to a
category.
The variability of the feature values for objects in the same category may be due
to complexity of data or due to noise
How can the classifier best cope with the variability and what is the best
performance possible?
Is it possible to extract all the values of a particular feature of a particular input?
What should the classifier do in case some of the feature data is missing?
The classification consists of determining to which region a feature vector x belongs
to.
Borders between decision boundaries are called decision regions.
Post-processing
A post-processor takes care of other considerations like the effects of
context and the costs of errors to decide on the appropriate action.
Poses a lot of challenges like:
How do we tackle the classification in case of error rate?
How do we incorporate knowledge about costs and try to use the
minimum cost model without affecting our classification decision.
In case of multiple classifiers, how does the “super” classifier pool the
evidence to arrive at the best decision? Is the majority always right?
How would a classifier know when to base the decision on a minority
opinion?
The Design Cycle
Data collection
Feature Choice
Model Choice
Training
Evaluation
Computational
Complexity
The pattern recognition design cycle
Data collection
Probably the most time-intensive component of a PR project.
Can account for a large part of the cost of developing a PR system.
How many examples are enough?
Is the data set adequately large to yield accurate results?
Feature choice
Critical design step.
Requires basic prior knowledge
Does prior knowledge always yield relevant information?
Model choice
How do we determine when a hypothetical model differs from the true
model underlying the existing patterns?
Determining which model is best suited for the problem at hand.
Training
Using the data to determine the classifier – training the classifier.
Supervised, unsupervised and reinforcement learning.
Evaluation
How well does the trained model do?
Evaluation necessary to measure the performance of the system and also
to identify the need for improvements in its components.
Overfitting vs. generalization
The need to arrive at a best complexity for a model.
Conclusion
The number, magnitude and complexity of subproblems in PR is
overwhelming.
Though it is still an emerging field, we have various applications
successfully running.
There still remain several fascinating unsolved problems providing
immense opportunities for progress.