0% found this document useful (0 votes)

116 views63 pages

Data Mining

The document discusses data mining, including its definition, comparisons to related concepts, the knowledge discovery process, common techniques like association rules and clustering, and applications. Data mining is defined as extracting implicit and useful patterns from large datasets. It aims to discover valid, novel, useful and understandable patterns. The knowledge discovery process includes problem formulation, data collection/preparation, model building, evaluation and deployment. Common techniques include association rules, clustering, classification rules and neural networks. Applications include credit ratings, fraud detection and customer relationship management.

Uploaded by

Alpana Varnwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views63 pages

Data Mining

Uploaded by

Alpana Varnwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 63

By Rajeev Kumar

Agenda
Data Mining Definition
Data Mining Comparisons
Data Evolution
KDD Process
Data Mining Process
Data Mining Techniques
Applications
CASE STUDY-DM application in GIS
Pattern Recognition
Definition
 DATA MINING is defined as “the nontrivial extraction of implicit,
previously unknown, and potentially useful information from data“
and "the science of extracting useful information from large data
sets or databases" .

 It is also said to be the search for the relationships and global

patterns that exist in large databases but are hidden among vast
amounts of data.

 The patterns must be

 valid: hold on new data with some certainty
 novel: non-obvious to the system
 useful: should be possible to act on the item
 understandable: humans should be able to interpret the pattern
Why Data Mining?
Competitive Advantage!
“The secret of success is to know something that
nobody else knows.”
-Aristotle Onassis
PHOTO: HULTON-DEUTSCH COLL
 Human analysis skills are inadequate
 Volume and dimensionality of the data
 High data growth rate
 Availability of:
• Data
• Storage
• Computational power
• Off-the-shelf software
• Expertise
Why Data Mining?
 Credit ratings/targeted marketing:
 Given a database of 100,000 names, which persons are the least likely to default on
their credit cards?
 Identify likely responders to sales promotions

 Fraud detection
 Which types of transactions are likely to be fraudulent, given the demographics and
transactional history of a particular customer?
 Customer relationship management:
 Which of my customers are likely to be the most loyal, and which are most likely to
leave for a competitor? :

Data Mining helps extract such information

Data Mining works with Warehouse Data

Data Warehousing provides the

Enterprise with a memory

Data Mining provides the

Enterprise with intelligence
Evolution of Data Analysis
Evolutionary Step Business Question Enabling Product Providers Characteristics
Technologies

Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"

Data Access "What were unit Relational Oracle, Sybase, Retrospective,

(1980s) sales in New databases Informix, IBM, dynamic data
England last (RDBMS), Microsoft delivery at record
March?" Structured Query level
Language (SQL),
ODBC

Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses

Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,

(Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive
unit sales next multiprocessor SGI, SAS, NCR, information
month? Why?" computers, massive Oracle, numerous delivery
databases startups
The Knowledge Discovery Process
in data.

 Problem formulation
 Data collection
 subset data: sampling might hurt if highly skewed data
 feature selection: principal component analysis
 Pre-processing: cleaning
 name/address cleaning, different meanings (annual, yearly), duplicate
removal, supplying missing values
 Transformation:
 map complex objects e.g. time series data to features e.g. frequency
 Choosing mining task and mining method
 Result evaluation and Visualization

Knowledge discovery is an iterative process

The data mining Process

Why Should There be a Standard Process?

The data mining process must be reliable and repeatable by people with
little data mining background.

Process Standardization

CRISP-DM:
 CRoss Industry Standard Process for Data Mining
 Initiative launched Sept.1996

• CRISP-DM provides a uniform framework for

– guidelines
– experience documentation

• CRISP-DM is flexible to account for differences

– Different business/agency problems
– Different data
The data mining Process
Phases in the DM Process:
Phases

Business Understanding
understanding project objectives and data
mining problem identification
Data Understanding
capture and understand data for quality
issues
Data Preparation
data cleaning, merging data and deriving
attributes
Phases

Modeling
select the data mining technique and build
the model
Evaluation
evaluate results and approved model
Deployment
put the models into practice, monitoring and
maintenance
DM Techniques
Based on two models

Verification Model
-user makes hypothesis
-tests hypothesis to verify its validity

Discovery Model
-automatically discovering important information hidden in
the data
-data is sifted in search of frequently occurring patterns,
trends and generalizations
-no guidance from user
DM Techniques

1.Discovery of Association Rules

2.Clustering
3.Discovery of Classification Rules
4.Frequent Episodes
5.Deviation Detection
6.Neural Networks
7.Genetic Algorithms
8.Rough Set Techniques
9.Support Vector Machines
Association Rules
Purchasing of one product when another product is purchased
represents an AR
 Used mainly in retail stores to
-Assist in marketing
-Shelf management
-Inventory control
Support means how often X and Y occur together as a percentage of the
total transactions
Confidence means how much a particular item is dependant on the other
 Given set T of groups of items
Example: set of item sets purchased
 Goal: find all rules on itemsets of the form a-->b such that
 support of a and b > user threshold s
 conditional probability (confidence) of b given a > user
threshold c
 Example: Milk --> bread
 Purchase of product A --> service B
Clustering

Unsupervised learning when old data with class labels not

available e.g. when introducing a new product.
Group/cluster existing customers based on time series of
payment history such that similar customers in same
cluster.
Key requirement: Need a good measure of similarity
between instances.
Identify micro-markets and develop policies for each
Clustering- Applications
 Customer segmentation e.g. for targeted marketing
 Group/cluster existing customers based on time series of
payment history such that similar customers in same cluster.
 Identify micro-markets and develop policies for each

 Collaborative filtering:

Given database of user preferences, predict preference of new user

Example: predict what new movies you will like based on
 your past preferences
 others with similar past preferences
 their preferences for the new movies
Decision trees
 Tree where internal nodes are simple decision rules on
one or more attributes and leaf nodes are predicted
class labels.

Is company software?

Product On site job

development

Good Bad Bad Good

Decision tree
 Widely used learning method
 Easy to interpret: can be re-represented as if-then-else rules
 Does not require any prior knowledge of data distribution, works well
on noisy data.
 Has been applied to:
 classify medical patients based on the disease,
 equipment malfunction by cause,
 loan applicant by likelihood of payment.

· Pros
· Cons
+ Reasonable training
Cannot handle complicated
time
relationship between features
+ Fast application
simple decision boundaries
+ Easy to interpret
problems with lots of missing
+ Easy to implement
data
+ Can handle large
number of features
Neural networks
 Involves developing mathematical structures with the ability to
learn
 Best at identifying patterns or trends in data
 Well suited for prediction or forecasting needs
 Useful for learning complex data like handwriting, speech and
image recognition

· Pros · Cons
+ Can learn more complicated Slow training time
class boundaries Hard to interpret
+ Fast application Hard to implement: trial
+ Can handle large number of and error for choosing
features number of nodes
Data Mining Applications

 Some examples of “successes":

 1. Decision trees constructed from bank-loan histories to produce
algorithms to decide whether to grant a loan.
 2. Patterns of traveler behavior mined to manage the sale of
discounted seats on planes, rooms in hotels,etc.
 3. “Bread and Butter." Observation that customers who buy bread
are more likely to buy butter than average, allowed supermarkets
to place bread and butter nearby, knowing many customers would
walk between them.
 4. Skycat and Sloan Sky Survey: clustering sky objects by their
radiation levels in different bands allowed astronomers to
distinguish between galaxies, nearby stars, and many other kinds
of celestial objects.
 5. Comparison of the genotype of people with/without a condition
allowed the discovery of a set of genes that together account for
many cases of diabetes. This sort of mining has become much
more important as the human genome has fully been decoded
Issues and Challenges in DM

Limited Information
Noise or missing Data
User interaction
Uncertainty
Size,updates and irrelevant fields
Other Applications
 Banking: loan/credit card approval
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor.
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, financial transactions
 from an online stream of event identify fraudulent events
 Manufacturing and production:
 automatically adjust knobs when process parameter changes
 Medicine: disease outcome, effectiveness of treatments
 analyze patient disease history: find relationship between
diseases
 Molecular/Pharmaceutical: identify new drugs
 Scientific data analysis:
 identify new galaxies by searching for sub clusters
 Web site/store design and promotion:
 find affinity of visitor to pages and modify layout
Introduction
In USA, traffic crashes cause one death
every 13 minutes and an injury every 15
seconds
If current US crash rate remains
unchanged
* One child out of every 84 born today will die
violently in a motor vehicle crash

* Six out of every 10 children will be injured in a

highway crash over a lifetime

 The economic impact to the U.S. is

roughly $150 billion
Deaths and injuries on Florida’s highways
cost society $14 billion annually, ranking
Florida in the top five nationally (FDOT,
2003)
In addition, the insecurity on the roads has an important effect on the economic costs
associated with traffic accidents
 Over the last two decades, increasingly large amount of
transportation data have been stored electronically
 This volume is expected to continue to grow considerably in the future
 A tremendous amount of data pertain to transportation safety
 Despite this wealth of data, we have been unable to fully capitalize on
its value
 This is because the information implicit in the data is not easily
discernable without the use of proper data analysis techniques
 Advanced modeling and analysis of such data are needed to
understand relationships and dependencies
 Two Major challenges face transportation engineers today:
 How to extract the most relevant information from the vast amount of
available traffic data
 How to use the wealth of data to enhance our understanding of traffic
behavior
 Data mining is an emerging field that promotes the progress of data
analysis
Why Data Mining ?

When dealing with large and complex data sets, such as

transportation data, the use of data mining techniques seems useful
for knowledge discovery and in identifying relevant variables that
make strong contribution towards better understanding of safety
patterns, problems and causes.
Data mining applications in transportation is still in its infancy, but
have a high potential for growth

Applications of Data Mining in Transportation:

 Improve Traffic Signal Timing Plans

 Measure Freeway Performance

 Improve Aviation Safety

 Improve delivery and quality of traffic safety information

Developed Analytical Methodology
Step1- GIS: identify relevant freeway features at each accident location and
integrate them with crash database
Step2- Preliminary Statistical Analysis: better understanding of data
Step3- Data Mining Clustering Techniques: identify clusters of common
accidents, and conditions under which accidents are more likely to cause
death or injury
Step4- Data Mining Profiling Techniques: profile accidents in terms of
accident and freeway characteristics
Step5- Visualization Techniques: better understanding & presentation of
results
Clustering Analysis:

 Clustering = Unsupervised classification

 Place objects into clusters suggested by the data, not defined a priori
 Studies have shown that clustering methods are an important tool when
analyzing traffic accidents as these methods are able to identify groups of
road users, vehicles and road segments which would be suitable targets
for countermeasures
Demographic clustering is a distribution-based data mining technique
that provides fast and natural clustering of very large databases
The Data Mining Process:

1. Data Capture
2. Data Merging
3. Data Description
4. Statistical Analysis
5. Data Pre-Processing
 Data Transformation
 Data Cleaning
6. Building Data-Mining Models
Data Capture

 The 1999 crash data for MDC Freeways was utilized

 Crash data:
 Rich source of accident related information
 Contains 39 attributes that describe the accident: Roadway section,
date, time, accident type, driver age, lighting condition, traffic control,
and road condition
But
 This does not provide information about physical characteristics of
the roadway at the accident location, which is necessary to develop
appropriate countermeasures .
Data Merging
 Roadway Characteristics Inventory (RCI) database, contains various
physical and administrative related features to the roadway networks
 Speed limit, local name, number of lanes, shoulder type, median type, and
median width were extracted from the RCI database using GIS Software
package (Arc View)
 Arc Avenue was used to write a special script to merge crash data and
roadway attribute tables
 Spatial reference and analysis were performed to identify the roadway
features at each accident location
Crash and roadway characteristic data merging

Data Merging

Median Width Shoulder Data

Accidents

Median Width at each Shoulder Type at each

Accident Location Accident Location
Data Description
 Merged dataset contains 5,870 records (one for each accident)
 The 45 attributes describing each accident can be divided into seven
dimensions:
 Accident Information (e.g. number of injuries, number of fatalities,
accident type, point of impact)
 Road Information (e.g. road condition, road surface, road side)
 Traffic Information (e.g. accident lane number)
 Drivers Information (e.g. drivers age)
 Geographical Information (e.g. roadway section, area type)
 Environmental Conditions Information (e.g. lighting condition, weather
condition, date and time of the accident)
&
 Roadway feature Information (e.g. number of lanes, speed limit, local
name, median width)
Statistical Analysis
 The database:
 Large
 Noisy and/or missing data
 No clearly defined expectations about the kind of clusters to be
discovered
 Preliminary statistical analysis was performed
 Basic statistics, such as maximum, minimum, mean, variance, and
frequencies were calculated for numeric fields
 Frequencies for continuous numeric fields are calculated
 Frequencies for categorical fields and discrete numeric fields were
calculated
 These statistics helped in providing a better understanding of the data and
in speeding up the problem identification process
Data Pre-Processing
Data transformation
Datasets:
 Most variables are categorical
 Discrete numeric
 Continuous

 Data Cleaning
Irregularities were tracked, listed and corrected or removed from the
dataset
Building Data-Mining Models
 This process is iterative
 Several input variables and model parameters are explored
 The 45 variables were narrowed down to the most important ones
 Resulting in a set of variables that was used in building the data-mining
models
 Clustering- Input Variables:
 Area type, accident type, main contributing cause, site location,
highway name, driver Age, accident Lane, traffic control, and time of
the accident
Profiling -Input Variables:
 Other discrete and categorical variables and some interesting
continuous variables were input as supplementary variables
 Variables used to profile the clusters but not to define them
Building Data-Mining Models
 Model 1
The freeway accidents dataset involving one or more injuries
Modal values for each cluster variable given below.

Cluster 4 6 5 9 3 8 1 2 7
Percentage 36.15% 11.60% 11.11% 8.54% 7.95% 7.95% 5.90% 5.76% 5.03%

Local Name Palmetto Palmetto Palmetto I-95 I-95 I-95 I-95 I-95 Palmetto
Expressw Expressw Expressw Expressw
ay ay ay ay
Type of Rear-End Hit Conc. Rear-End Sideswipe Rear-End Angle Rear-End Angle Angle
Accident Barrier
Wall
Accident Time 4-7 PM 9PM-7AM 4-7 PM 1-4PM 4-7 PM 9PM-7AM 1-4PM 1-4PM 9PM-7AM

Site Location Not at Not at Exit Ramp Not at Not at Not at Bridge Not at At
Interchange Interchange Interchange Interchange Interchange Interchange Intersection
Accident Lane 1 Median Ramp 3 2 Side of Ramp Side of Ramp
Road Road
Contributing Careless Careless Careless Improper No Careless Careless Careless No
Cause Driving Act Driving Act Driving Act Lane Improper Driving Act Driving Act Driving Act Improper
Change Driving Act Driving Act
Driver 1 Age [25 –35] [25 –35] [15 –25] [35 –45] [25 –35] [25 –35] [25 –35] [25 –35] [25 –35]

Type of Control No Control No Control No Control No Control No Control No Control No Control No Control No Control
Model 2
The freeway accidents dataset that involved one or more
fatalities
 Model 2
The freeway accidents dataset that involved one or more fatalities

Detailed visualization of one of the clusters involving 26.67% of the population

The analytical methodology developed in this study:
 Provide an insight to identify roadway and drivers’ characteristics
that contributes to severe accidents on Miami-Dade County
Freeway System
 Provides information for transportation planners to improve
planning, operating, and managing the freeway system
 Could help interested agencies effectively allocate resources to
improve safety measures in those areas with high accident
frequency
Definiton:
“The assignment of a physical object or event to one of several
pre-specified categories” –Duda and Hart
“The science that concerns the description or classification
(recognition) of measurements” –Schalkoff
Pattern Recognition is concerned with answering the question
“What is this?” –Morse

It is the study of how machines can :

•observe the environment
• learn to distinguish patterns of interest from their
background
•make sound and reasonable decisions about the
categories of the patterns.
A pattern is an object, process or event that can be given a name.
A pattern class (or category) is a set of patterns sharing common attributes and usually
originating from the same source.
During recognition (or classification) given objects are assigned to prescribed classes.
A classifier is a machine which performs classification.

APPLICATIONS:
 Computer Vision
 Speech Recognition
 Automated Target Recognition
 Optical Character Recognition
 Seismic Analysis
 Man and Machine Diagnostics
 Fingerprint Identification
 Age Preprocessing / Segmentation
 Industrial Inspection
 Financial Forecast
 Medical Diagnosis
 ECG Signal Analysis
Emerging PR Applications
Problem Input Output
Speech recognition Speech waveforms Spoken words, speaker
identity
Non-destructive testing Ultrasound, eddy current, Presence/absence of flaw,
acoustic emission waveforms type of flaw
Detection and diagnosis of EKG, EEG waveforms Types of cardiac
disease conditions, classes of brain
conditions
Natural resource Multispectral images Terrain forms, vegetation
identification cover
Aerial reconnaissance Visual, infrared, radar images Tanks, airfields
Character recognition (page Optical scanned image Alphanumeric characters
readers, zip code, license
plate)
Emerging PR Applications (cont’d)

Problem Input Output

Identification and counting Slides of blood samples, micro- Type of cells

of cells sections of tissues
Inspection (PC boards, IC Scanned image (visible, Acceptable/unacceptable
masks, textiles) infrared)
Manufacturing 3-D images (structured light, Identify objects, pose,
laser, stereo) assembly
Web search Key words specified by a user Text relevant to the user
Fingerprint identification Input image from fingerprint Owner of the fingerprint,
sensors fingerprint classes
Online handwriting Query word written by a user Occurrence of the word in
retrieval the database
 Key Objectives:
 Process the sensed data to eliminate noise
 Given a sensed pattern, choose the best-fitting model for it and
then assign it to class associated with the model.

Classification v/s Clustering

• Classification (known categories)

• Clustering (creation of new categories)

Category “A”

Category “B”

Classification (Recognition) Clustering

(Supervised Classification) (Unsupervised Classification)
A basic pattern classification system contains
A sensor
A preprocessing mechanism
A feature extraction mechanism (manual or automated)
A classification algorithm
A set of examples (training set) already classified or described
Main PR Approaches:
Template matching
 The pattern to be recognized is matched against a stored
template while taking into account all allowable pose
(translation and rotation) and scale changes.
Statistical pattern recognition
 Focuses on the statistical properties of the patterns (i.e.,
probability densities).
Structural Pattern Recognition
 Describe complicated objects in terms of simple primitives and
structural relationships.
Syntactic pattern recognition
 Decisions consist of logical rules or grammars.
Artificial Neural Networks
 Inspired by biological neural network models.
Template Matching

Template

Input scene
• Patterns represented in a feature space
• Statistical model for pattern generation in feature space
• Basically adopted for numeric data.
Structural Pattern Recognition
 Describe complicated objects in terms of simple primitives and
structural relationships.
 Decision-making when features are non-numeric or structural

Scene
N
L
X Object Background
T
M Y Z
D E M N
D E
L T X Y Z
Artificial Neural Networks
 Massive parallelism is essential for complex pattern recognition
tasks (e.g., speech and image recognition)
 Human take only a few hundred ms for most cognitive tasks;
suggests parallel computation
A typical Pattern Recognition System:
 Sensing:
INPUT
The sensor converts images or
sounds or other physical inputs
SENSING into signal data.
The input to the PR system is a
SEGMENTATION transducer such as a camera or a
microphone. Therefore the
difficulty of the problem depends
FEATURE EXTRATION on the limitations and
characteristics of the transducer-
its bandwidth, resolution,
CLASSIFICATION sensitivity, distortion, signal to
noise ratio etc.,
 Segmentation:
POST-PROCESSING
Isolates sensed objects from the
background or from other data.
OUTPUT
• Feature extraction:
Measures object properties that are useful for the classification. It aims to
create discriminative features good for classification.
So, the aim is to extract features which are “good” for classification.
Good features:
Objects from the same class have similar feature values.
Objects from different classes have different values.
“good” features “bad” features
 Classification:

Takes in the feature vector from the feature extractor and assigns the object to a
category.
The variability of the feature values for objects in the same category may be due
to complexity of data or due to noise
How can the classifier best cope with the variability and what is the best
performance possible?
Is it possible to extract all the values of a particular feature of a particular input?
What should the classifier do in case some of the feature data is missing?

The classification consists of determining to which region a feature vector x belongs

to.
Borders between decision boundaries are called decision regions.
 Post-processing
A post-processor takes care of other considerations like the effects of
context and the costs of errors to decide on the appropriate action.

Poses a lot of challenges like:

 How do we tackle the classification in case of error rate?

 How do we incorporate knowledge about costs and try to use the

minimum cost model without affecting our classification decision.

 In case of multiple classifiers, how does the “super” classifier pool the
evidence to arrive at the best decision? Is the majority always right?
How would a classifier know when to base the decision on a minority
opinion?
The Design Cycle

 Data collection
 Feature Choice
 Model Choice
 Training
 Evaluation
 Computational
Complexity
The pattern recognition design cycle
 Data collection
Probably the most time-intensive component of a PR project.
Can account for a large part of the cost of developing a PR system.
How many examples are enough?
Is the data set adequately large to yield accurate results?

 Feature choice
Critical design step.
Requires basic prior knowledge
Does prior knowledge always yield relevant information?

 Model choice
How do we determine when a hypothetical model differs from the true
model underlying the existing patterns?
Determining which model is best suited for the problem at hand.
 Training
Using the data to determine the classifier – training the classifier.
Supervised, unsupervised and reinforcement learning.

 Evaluation
How well does the trained model do?
Evaluation necessary to measure the performance of the system and also
to identify the need for improvements in its components.
Overfitting vs. generalization
The need to arrive at a best complexity for a model.
Conclusion
 The number, magnitude and complexity of subproblems in PR is
overwhelming.
 Though it is still an emerging field, we have various applications
successfully running.
 There still remain several fascinating unsolved problems providing
immense opportunities for progress.

Introduction To Data Mining Unit1
No ratings yet
Introduction To Data Mining Unit1
37 pages
Decision Support and Business Intelligence Systems 9th Ed Efraim Turban
50% (2)
Decision Support and Business Intelligence Systems 9th Ed Efraim Turban
715 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Data Mining
No ratings yet
Data Mining
30 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Ai Unit-4
No ratings yet
Ai Unit-4
60 pages
Data Mining Chapter 1 Notes
100% (1)
Data Mining Chapter 1 Notes
40 pages
Big Data & BI Course Overview
No ratings yet
Big Data & BI Course Overview
3 pages
Lecture 2 Data Mining Functions
No ratings yet
Lecture 2 Data Mining Functions
40 pages
Recommendation Systems
No ratings yet
Recommendation Systems
29 pages
Build A Machine Learning Portfolio
No ratings yet
Build A Machine Learning Portfolio
18 pages
Artificial Intelligence For Security - Enhancing Protection
No ratings yet
Artificial Intelligence For Security - Enhancing Protection
373 pages
DM - Ai22c07 - Unit 5
No ratings yet
DM - Ai22c07 - Unit 5
101 pages
4-Confluence of Multiple Disciplines, Classifictaion, Integration-08-Feb-2021Material - I - 08-Feb-2021 - Mod1 - Confluence - Classifictaion
0% (1)
4-Confluence of Multiple Disciplines, Classifictaion, Integration-08-Feb-2021Material - I - 08-Feb-2021 - Mod1 - Confluence - Classifictaion
4 pages
K Means PDF
No ratings yet
K Means PDF
187 pages
Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Instant Download
100% (7)
Data Mining A Tutorial Based Primer 2nd Edition Richard J. Roiger Instant Download
81 pages
ML Course Outline
No ratings yet
ML Course Outline
4 pages
Machine Learning Algorithm Guide
100% (1)
Machine Learning Algorithm Guide
15 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Study Material For Reference
No ratings yet
Study Material For Reference
35 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Two Stage Job Title Identification-1
No ratings yet
Two Stage Job Title Identification-1
77 pages
Support Vector Machines
No ratings yet
Support Vector Machines
69 pages
E-Health Care Management
No ratings yet
E-Health Care Management
92 pages
PPS - Unit 3
No ratings yet
PPS - Unit 3
98 pages
Machine Learning Midterm Exam 2012
100% (1)
Machine Learning Midterm Exam 2012
17 pages
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
100% (1)
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
20 pages
Data Mining
No ratings yet
Data Mining
73 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
41 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
34 pages
01 Business Intelligence
100% (1)
01 Business Intelligence
16 pages
PPT1
No ratings yet
PPT1
93 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
II Semester - Data Mining and Business Intelligence
100% (1)
II Semester - Data Mining and Business Intelligence
2 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
1708443470801
No ratings yet
1708443470801
71 pages
Chapter 3
No ratings yet
Chapter 3
12 pages
Business Accounting Basics
No ratings yet
Business Accounting Basics
27 pages
Impact of CWG Scam
100% (1)
Impact of CWG Scam
19 pages
MidTermExam Solution
No ratings yet
MidTermExam Solution
8 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
8 Dimensional Modeling
No ratings yet
8 Dimensional Modeling
35 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
Comprog Q4 Week 1
No ratings yet
Comprog Q4 Week 1
47 pages
Module-2-Data Mining
No ratings yet
Module-2-Data Mining
48 pages
FDPI Study Guide March 2020 Exam PDF
No ratings yet
FDPI Study Guide March 2020 Exam PDF
71 pages
Data Mining
100% (3)
Data Mining
18 pages
Sample Questions:: Section I: Subjective Questions
No ratings yet
Sample Questions:: Section I: Subjective Questions
6 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 6
82 pages
Object-Oriented Methodologies Guide
No ratings yet
Object-Oriented Methodologies Guide
64 pages
Student Performance Analysis and Prediction in Classroom Learning: A Review of Educational Data Mining Studies
No ratings yet
Student Performance Analysis and Prediction in Classroom Learning: A Review of Educational Data Mining Studies
36 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Topic 1 Overview of Intelligent Systems
No ratings yet
Topic 1 Overview of Intelligent Systems
35 pages
Data Mining: Books
No ratings yet
Data Mining: Books
14 pages
My Financial Plan: Monthly Expenses Amount
No ratings yet
My Financial Plan: Monthly Expenses Amount
6 pages
Week 02 PDF
No ratings yet
Week 02 PDF
39 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
64 pages
UNIT3
No ratings yet
UNIT3
17 pages
CLTV PDF
No ratings yet
CLTV PDF
19 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
Web Mining PPT 4121
No ratings yet
Web Mining PPT 4121
18 pages
Simplified Portfolio Construction
No ratings yet
Simplified Portfolio Construction
66 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
Malhotra Mr05 PPT 20
100% (1)
Malhotra Mr05 PPT 20
41 pages
Overview of Data Warehousing and OLAP
No ratings yet
Overview of Data Warehousing and OLAP
12 pages
FedEx's Innovative HR Practices
0% (1)
FedEx's Innovative HR Practices
3 pages
Data Mining & Decision Trees
No ratings yet
Data Mining & Decision Trees
38 pages
EM4218E - Chapter 6
No ratings yet
EM4218E - Chapter 6
9 pages
Analysis of Large Web Sequences Using Aprioriall - Set Algorithm
No ratings yet
Analysis of Large Web Sequences Using Aprioriall - Set Algorithm
5 pages
01 - ML Introduction - Course Outline
No ratings yet
01 - ML Introduction - Course Outline
21 pages
Data Science Algorithm Guide
No ratings yet
Data Science Algorithm Guide
5 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Data Science Is New-Age Astrology: Astrology and Machine Learning
No ratings yet
Data Science Is New-Age Astrology: Astrology and Machine Learning
6 pages
Python & ML Bootcamp for Beginners
No ratings yet
Python & ML Bootcamp for Beginners
8 pages
Course Outline ADV 08 - Data Mining
No ratings yet
Course Outline ADV 08 - Data Mining
3 pages
Individual Assignment
100% (1)
Individual Assignment
3 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Intelligent Food Recommendation System Using Machine Learning
No ratings yet
Intelligent Food Recommendation System Using Machine Learning
4 pages
790 1549 1 PB 1
No ratings yet
790 1549 1 PB 1
9 pages
Types and Applications of Data Mining
No ratings yet
Types and Applications of Data Mining
11 pages
Data Mining Essentials for Analysts
No ratings yet
Data Mining Essentials for Analysts
35 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Chap8 Basic Cluster Analysis
No ratings yet
Chap8 Basic Cluster Analysis
98 pages
MSPPM DC 2020 2021 Student Handbook
No ratings yet
MSPPM DC 2020 2021 Student Handbook
17 pages
AI Statistical Methods Course
No ratings yet
AI Statistical Methods Course
23 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Unit 4 Analysis Model
No ratings yet
Unit 4 Analysis Model
74 pages
Managing Workplace Feedback
No ratings yet
Managing Workplace Feedback
2 pages
CV Slawski
No ratings yet
CV Slawski
6 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
Governance - by Sahiba Mehta
No ratings yet
Governance - by Sahiba Mehta
5 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

By Rajeev Kumar

 It is also said to be the search for the relationships and global

 The patterns must be

Data Mining helps extract such information

Data Warehousing provides the

Data Mining provides the

Data Access "What were unit Relational Oracle, Sybase, Retrospective,

Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective,

Knowledge discovery is an iterative process

Why Should There be a Standard Process?

• CRISP-DM provides a uniform framework for

• CRISP-DM is flexible to account for differences

1.Discovery of Association Rules

Unsupervised learning when old data with class labels not

Given database of user preferences, predict preference of new user

Product On site job

Good Bad Bad Good

 Some examples of “successes":

* Six out of every 10 children will be injured in a

 The economic impact to the U.S. is

When dealing with large and complex data sets, such as

Applications of Data Mining in Transportation:

 Measure Freeway Performance

 Improve Aviation Safety

 Improve delivery and quality of traffic safety information

 Clustering = Unsupervised classification

 The 1999 crash data for MDC Freeways was utilized

Median Width Shoulder Data

Median Width at each Shoulder Type at each

Detailed visualization of one of the clusters involving 26.67% of the population

It is the study of how machines can :

Problem Input Output

Identification and counting Slides of blood samples, micro- Type of cells

Classification v/s Clustering

• Classification (known categories)

Classification (Recognition) Clustering

The classification consists of determining to which region a feature vector x belongs

Poses a lot of challenges like:

 How do we incorporate knowledge about costs and try to use the

You might also like