0% found this document useful (0 votes)
4 views

lecture1

Uploaded by

maseattonima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

lecture1

Uploaded by

maseattonima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Mining Methods

Prof. Dr. C. Andersson

Basic Concepts

Data Issues in Data


Mining
Data Mining Methods:
Introduction

Prof. Dr. Christina Andersson

1/49
Data Mining Methods
Prof. Dr. C. Andersson
Contents

Basic Concepts

Data Issues in Data


Mining

1 Basic Concepts

2 Data Issues in Data Mining

2/49
Data Mining Methods
Prof. Dr. C. Andersson
What is data mining?

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

A process, consisting of certain steps, to discover


and analyze unknown patterns in huge amounts of
data.
Based on analytical methods.

3/49
Data Mining Methods
Prof. Dr. C. Andersson
Data mining is not ...

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

... a black box!


"If you’ve got terabytes of data, and you’re relying
on data mining to find interesting things in there
for you, you’ve lost before you’ve even begun"
Herb Edelstein

4/49
Data Mining Methods
Prof. Dr. C. Andersson
Definitions of data mining (1)

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

”Knowledge discovery in databases is the


non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data”.
(Fayyad, Piatetsky-Shapiro, Smyth, 1996)

5/49
Data Mining Methods
Prof. Dr. C. Andersson
Definitions of data mining (2)

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


”An information extraction activity whose goal is
Mining to discover hidden facts contained in databases.
Using a combination of machine learning,
statistical analysis, modeling techniques and
database technology, data mining finds patterns
and subtle relationships in data and infers rules
that allow the prediction of future results. Typical
applications include market segmentation,
customer profiling, fraud detection, evaluation of
retail promotions, and credit risk analysis.”
(http://twocrows.com/data-mining/dm-glossary/)

6/49
Data Mining Methods
Prof. Dr. C. Andersson
Definitions of data mining (3)

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

”Statistics at scale and speed”.


(Darryl Pregibon)

7/49
Data Mining Methods
Prof. Dr. C. Andersson
Concepts occurring in most definitions

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining previously unknown patterns
huge amounts of data
prediction
potentially useful new information - increase of
profit?
increase understanding
analytical methods
model-based

8/49
Data Mining Methods
Prof. Dr. C. Andersson
Broad and narrow definitions

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

Broad definition:
Traditional statistical methods included
Narrow definition:
Focusing on automated/heuristic methods
Knowledge discovery in databases:
A step in the KDD process

9/49
Data Mining Methods
Prof. Dr. C. Andersson
What is typical for data mining?

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
Multidisciplinarity ⇒ many synonyms
People cannot distinguish between data mining
and OLAP
Objective definition really important
Team work essential
Required knowledge of statistics underestimated

10/49
Data Mining Methods
Prof. Dr. C. Andersson
Multidisciplinarity

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

Confusing terminology
Case = row = observation
Target = response variable = output = dependent
variable

11/49
Data Mining Methods
Prof. Dr. C. Andersson
Multidisciplinarity

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
Data preparation
Use of databases and data warehouses
Statistics
Optimization
High-performance computing
Machine learning and artificial intelligence
Visualisation

12/49
Data Mining Methods
Prof. Dr. C. Andersson
Some of many application branches

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data Banks


Mining
Insurance companies
Credit card companies
Retail companies
Bioinformatics
Telecommunication
Governments (census, taxes)
Health systems
...

13/49
Data Mining Methods
Prof. Dr. C. Andersson
Some of many application areas

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
Risk assessment (ratings, credit scoring)
Marketing, CRM
Fraud, crime and terror detection
Churn prevention
Customer profiling, market segmentation
...

14/49
Data Mining Methods
Prof. Dr. C. Andersson
Example: Banking

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Data:
Mining Customer records
Transactional data
External data
Questions:
To which customers should we offer a credit?
Estimation of risk of default?
Is the behavior of the customer changing over
time?
To which customers should we present a special
offer?
...

15/49
Data Mining Methods
Prof. Dr. C. Andersson
Example: Insurance companies

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data Data:


Mining
Customer records
Transactional data
External data (e.g. doctors)
Questions:
Is the behavior of the customer changing over
time?
To which customers should we present a special
offer?
Fraud detection
...

16/49
Data Mining Methods
Prof. Dr. C. Andersson
Example: Telecommunication

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process Data:
Data Mining Methods
Customer records (personal information, products
Data Issues in Data
Mining bought, billings)
Transactional data (concerning each phone call)
Other data (network load, breakdowns, )
Questions:
Which customer group is more profitable than
other groups?
Is the behavior of the customer changing over
time?
To which customers should we present a special
offer?
Fraud detection
Churn identification and prevention
...

17/49
Data Mining Methods
Prof. Dr. C. Andersson
Example: Retail companies

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Data:
Mining Customer records (pay-back cards)
Questions:
Which products are bought together?
Which customer group is more profitable than
other groups?
Is the behavior of the customer changing over
time?
To which customers should we present a special
offer?
Fraud detection
...

18/49
Data Mining Methods
Prof. Dr. C. Andersson
Knowledge discovery in databases

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining Define business problem
Build data mining database
Explore data
Prepare data for modeling
Build model
Evaluate model
Deploy model and results
(Two Crows Corporation)

19/49
Data Mining Methods
Prof. Dr. C. Andersson
The data mining process

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
SEMMA methodology (SAS)
Sample
Explore
Modify
Model
Assess

20/49
Data Mining Methods
Prof. Dr. C. Andersson
Examples of data mining methods

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data Multiple linear regression


Mining
Logistic regression
Decision trees
Neural networks
Ensemble decision trees
Support vector machines
Discriminant analysis
Cluster analysis
Association analysis

21/49
Data Mining Methods
Prof. Dr. C. Andersson
How to know when to use which method(s)?

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

Basic guidance:
Predictive modeling (supervised)
Classification
(Point) Estimation
Pattern discovery (unsupervised)

22/49
Data Mining Methods
Prof. Dr. C. Andersson
Supervised vs. unsupervised learning

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Supervised:
Mining Step 1: Develop a prediction model using data
with known values of the target (response)
variable during the development.
Step 2: Use the model to predict (score) new
cases with unknown values of the target variable.
Unsupervised:
No target variable at all.
Example:
Cluster analysis to partition data into unknown
classes. Even the number of classes can be
unknown.

23/49
Data Mining Methods
Prof. Dr. C. Andersson
Supervised vs. unsupervised learning

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data Supervised:


Mining
Multiple linear regression
Logistic regression
Decision trees
Neural networks
Ensemble decision trees
Support vector machines
Discriminant analysis
Unsupervised:
Cluster analysis
Association analysis

24/49
Data Mining Methods
Prof. Dr. C. Andersson
Supervised learning: The data

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data Data to develop the model:


Mining
Input variables (predictors, explanatory variables,
independent variables)
Target variable (response variable, dependent
variable)

Data to use the model:


Input variables (predictors, explanatory variables,
independent variables)

25/49
Data Mining Methods
Prof. Dr. C. Andersson
Multiple linear regression

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
Y = β 0 + β 1 x1 + β 2 x2 + . . . + β k xk + ε
where

Y is the dependent variable


x1 , x2 , . . . , xk are predictors (independent
variables)
ε is the random error
β0 , β1 , . . . , βk are unknown regression coefficients

26/49
Data Mining Methods
Prof. Dr. C. Andersson
Logistic regression

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

Regression with a binary response variable


The probabilities of the different values of the
response variable are estimated

27/49
Data Mining Methods
Prof. Dr. C. Andersson
Decision trees

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
In a decision tree each node corresponds to a split
of an input variable
Example:
”if the weather is hot, then go swimming,
otherwise go skiing”
Based on the principle of recursive partitioning of
the space of input variables

28/49
Data Mining Methods
Prof. Dr. C. Andersson
Ensemble decision trees

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

Multiple models may produce better results when


working together, than in isolation (like multiple
experts)
Build multiple decision trees and combine them
into one single model

29/49
Data Mining Methods
Prof. Dr. C. Andersson
Neural networks

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
A data structure simulating the behavior of
neurons in a biological brain
Composed of interconnected units in different
layers
Messages are passed along the connections from
one unit to another
Messages can change based on the weight of the
connection and the value in the node

30/49
Data Mining Methods
Prof. Dr. C. Andersson
Support vector machines

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
A modeling method for classification (and
estimation) of linear and nonlinear data
Linear modeling is simpler, but it is not always
possible to find a line separating the positive and
negative classes
If we transform the data to higher dimensions it
eventually becomes possible to determine a linear
separating hyperplane

31/49
Data Mining Methods
Prof. Dr. C. Andersson
Discriminant analysis

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining

Use continuous input variables to predict the


classification of a categorical response variable
Supervised learning, i.e. classification of the
observations is known during the development of
the model (step 1 of supervised learning)

32/49
Data Mining Methods
Prof. Dr. C. Andersson
Cluster analysis

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
Groups observations into previous unknown
clusters
Observations in the same cluster should be as
similar as possible
Observations belonging to different clusters should
be as dissimilar as possible
Unsupervised learning, i.e. clusters not known in
advance

33/49
Data Mining Methods
Prof. Dr. C. Andersson
Association analysis

Basic Concepts
What is Data Mining?
Applications
The Data Mining Process
Data Mining Methods

Data Issues in Data


Mining
Identify items that often occur together in a given
set of data
Market basket analysis
Unsupervised learning
Example:
”if a customer buys nappies, the chance is 60%
that he/she also buys beer”

34/49
Data Mining Methods
Prof. Dr. C. Andersson
Typical features of data in data mining

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing Operational (not research)
Typical Problems with Data
in Data Mining
Passive (not actively controlled)
Huge amount of data
Dirty data
Dynamical

35/49
Data Mining Methods
Prof. Dr. C. Andersson
Data size

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
"We are drowning in data, but starving of
Data Organization
Data Warehousing
knowledge."
Typical Problems with Data
in Data Mining
J. Han
Terabytes of data
Billions of records
Thousands of potential input variables
Data usually collected for transaction processing,
not for data mining purposes

36/49
Data Mining Methods
Prof. Dr. C. Andersson
Data sources

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data
in Data Mining Observational ”studies”
Surveys
Experiments

37/49
Data Mining Methods
Prof. Dr. C. Andersson
Data sources

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Relational databases (often transactional data,
Data Organization
Data Warehousing
normalized into many tables with connecting keys,
Typical Problems with Data
in Data Mining updated frequently)
Data warehouses (decision support data, cleaned,
aggregated, historical data)
Internet (click-stream data, log files, HTML,
XML, e-mails)
Files

38/49
Data Mining Methods
Prof. Dr. C. Andersson
Recall: Types and scales of data

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size Variables can be:
Data Sources
Types and Scales of Data
Data Organization
Quantitative
Data Warehousing
Typical Problems with Data Qualitative
in Data Mining

Information scale:
Nominal
Ordinal
Interval
Ratio

39/49
Data Mining Methods
Prof. Dr. C. Andersson
Recall: Quantitative variables

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data
in Data Mining
Discrete
Continuous

40/49
Data Mining Methods
Prof. Dr. C. Andersson
Ordering of data

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data Important or not?
in Data Mining

Requirement: One variable monotonic increasing


or decreasing
Ex.: Time series, geographic locations, ...

41/49
Data Mining Methods
Prof. Dr. C. Andersson
Data organization

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data
in Data Mining
Data aggregation
Derived inputs (Ex.: car insurance and date)

42/49
Data Mining Methods
Prof. Dr. C. Andersson
Meta-data

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources Data about data
Types and Scales of Data
Data Organization
Data Warehousing
Describing the data
Typical Problems with Data
in Data Mining Ex.:
Structure of the data (types, names, formats,
roles)
Information about pre-processing of the data
(derivations, transformations, imputations)
Quality of the data
Data source

43/49
Data Mining Methods
Prof. Dr. C. Andersson
Data warehousing

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Database separated from the company’s
Types and Scales of Data
Data Organization
operational database for the purpose of analyzing
Data Warehousing
Typical Problems with Data the data
in Data Mining
Should be a solid platform of consolidated,
historical data
Offers a longer time horizon than operational
systems
Constructed by cleaning, standardizing and
integrating multiple heterogeneous data sources

44/49
Data Mining Methods
Prof. Dr. C. Andersson
Problems: Target variable

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data Looking for the target variable (Ex.: Fraud)
in Data Mining

Oversampling required
No modeling of rejected customers (What should
have happened?)

45/49
Data Mining Methods
Prof. Dr. C. Andersson
Problems: Dirty data

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Sources of errors
Data Sources
Types and Scales of Data Errors during data entry
Data Organization
Data Warehousing
Typical Problems with Data
Misinterpretation
in Data Mining
Out-of-date data
Missing values (Coding ok? MCAR? Complete
case analysis performed?)

Personal information escpecially prone to errors (data


entry errors, uncorrect answers, not collected for data
mining ...)

46/49
Data Mining Methods
Prof. Dr. C. Andersson
Problems: Dirty data

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Different kinds of dirty data
Data Size
Data Sources Incomplete data
Types and Scales of Data
Data Organization Missing attributes, missing attribute values,
Data Warehousing
Typical Problems with Data aggregated data, ...
in Data Mining

Inconsistent data
Coding, impossible values, out-of-range values, ...
Noisy data
Data with errors, outliers, random fluctuations

Garbage in, garbage out!

Pre-processing the data extremely important!

47/49
Data Mining Methods
Prof. Dr. C. Andersson
The curse of dimensionality

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data
in Data Mining
Dimension:
Number of input variables (degrees of freedom)
Strong increase of data needed as dimension
increases

48/49
Data Mining Methods
Prof. Dr. C. Andersson
Input variable reduction

Basic Concepts

Data Issues in Data


Mining
Some Features of Data in
Data Mining
Data Size
Data Sources
Types and Scales of Data
Data Organization
Data Warehousing
Typical Problems with Data
in Data Mining
Remove
redundancy
irrelevancy

49/49

You might also like