Unit 3 DW
Unit 3 DW
UNIT-3
Data Mining: Overview, Motivation, Definition & Functionalities, Data
Processing, Form of Data Pre-processing, Data Cleaning: Missing Values, Noisy
Data, (Binning, Clustering, Regression, Computer and Human inspection),
Inconsistent Data, Data Integration and Transformation. Data Reduction:-Data
Cube Aggregation, Dimensionality reduction, Data Compression, Numerosity
Reduction, Discretization and Concept hierarchy generation, Decision Tree
Data Mining
The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of
data is called Data Mining. Data mining is the process of sorting through large
data sets to identify patterns and relationships that can help solve business
problems through data analysis. Data mining techniques and tools enable
enterprises to predict future trends and make more-informed business decisions.
Data mining is the process of extracting and discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems
1
Types of Data Mining
Data mining can be performed on the following types of data:
1. Relational Database:A relational database is a collection of multiple data sets
formally organized by tables, records, and columns from which data can be
accessed in various ways without having to recognize the database tables.
2. Data Repositories:The Data Repository generally refers to a destination for
data storage.
3. Object-Relational Database: A combination of an object-oriented database
model and relational database model is called an object-relational model. It
supports Classes, Objects, Inheritance, etc.
4. Transactional Database:A transactional database refers to a database
management system (DBMS) that has the potential to undo a database transaction
if it is not performed appropriately
2
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities.
Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge.
2.Performance Issues:
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms− In order to
effectively extract the information from huge amount of data in databases; data
mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms.
Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. Data mining tasks can be classified into two categories:
descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the
database.
Predictive mining tasks perform inference on the current data in order to make
predictions.
Concept/Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. For example, in the Electronics
store, classes of items for sale include computers and printers, and concepts of
customers include big Spenders and budget Spenders.
3
Data characterization
Data characterization is a summarization of the general characteristics or features
of a target class of data.
Data discrimination
Data discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, are patterns that occur frequently in data. There are many kinds
of frequent patterns, including itemsets, subsequences, and substructures.
Association analysis
Suppose, as a marketing manager, you would like to determine which items are
frequently purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
Where X is a variable representing a customer. Confidence=50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as
well.
Support=1% means that 1% of all of the transactions under analysis showed that
computer and software were purchased together.
Classification:
There is a large variety of data mining systems available. Data mining systems
may integrate techniques from the following −
Spatial Data Analysis
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
DATA PREPROCESSING
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw
data. The quality of the data should be checked before applying machine learning
or data mining algorithms.
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values. There are
some techniques in data cleaning
5
minimum and maximum values of the bin values are taken and the values are
replaced by the closest boundary value.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
Partition using equal frequency approach:
- Bin 1 : 4, 8, 9, 15
- Bin 2 : 21, 21, 24, 25
- Bin 3 : 26, 28, 29, 34
6
Data integration: Data Integration is a data preprocessing technique that
combines data from multiple heterogeneous data sources into a coherent data
store and provides a unified view of the data. These sources may include multiple
data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema .
Major approaches for data integration – one is the “tight coupling approach”
and another is the “loose coupling approach”.
Tight Coupling:
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and
Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it
in a way the source database can understand, and then sends the query directly
to the source databases to obtain the result.
And the data only remains in the actual source databases.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are
explained in brief below.
1. Schema Integration:
Integrate metadata from different sources.
The real-world entities from multiple sources are referred to as the entity
identification problem.
2. Redundancy:
An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
Inconsistencies in attributes can also cause redundancies in the resulting data
set.
Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
This is the third critical issue in data integration.
Attribute values from different sources may differ for the same real-world
entity.
7
An attribute in one system may be recorded at a lower level of abstraction than
the “same” attribute in another.
Data reduction: This process helps in the reduction of the volume of the data
which makes the analysis easier yet produces the same or almost the same result.
This reduction also helps to reduce storage space. There are some of the techniques
in data reduction are:-
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute.the attribute having p-value greater than significance level can
be discarded.
Dimensionality reduction: This reduce the size of data by encoding
mechanisms.It can be lossy or lossless. If after reconstruction from compressed
data, original data can be retrieved, such reduction are called lossless reduction
else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).
Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this
reduction.
Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information
during compression it is called lossless compression. Whereas lossy
compression reduces information but it removes only the unnecessary
information.
Data Transformation:The change made in the format or the structure of the data
is called data transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset. By smoothing we can
find even a simple change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good the results are more relevant.
8
Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can set
an interval like (3 pm-5 pm, 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented in
a smaller range.
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
Data Discretization
Dividing the range of a continuous attribute into intervals.
Data discretization converts a large number of data values into smaller once,
so that data evaluation and data management becomes very easy.
Interval labels can then be used to replace actual data values.
Reduce the number of values for a given continuous attribute.
Some classification algorithms only accept categorically attributes.
This leads to a concise, easy-to-use, knowledge-level representation of
mining results.
9
Discretization techniques can be categorized based on which direction it
proceeds as follows:
1. Top-down Discretization -
If the process starts by first finding one or a few points called split points
or cut points to split the entire attribute range and then repeat this
recursively on the resulting intervals.
2. Bottom-up Discretization -
Starts by considering all of the continuous values as potential split-points.
Removes some by merging neighborhood values to form intervals, and
then recursively applies this process to the resulting intervals.
Concept Hierarchies
Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a Concept
Hierarchy.
Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by
concept hierarchies.
This organization provides users with the flexibility to view data from
different perspectives.
Data mining on a reduced data set means fewer input and output operations
and is more efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies
are typically applied before data mining, rather than during mining.
2] Histogram Analysis
It is an unsupervised discretization technique because histogram analysis
does not use class information.
Histograms partition the values for an attribute into disjoint ranges called
buckets.
It is also further classified into
o Equal-width histogram
o Equal frequency histogram
The histogram analysis algorithm can be applied recursively to each
partition to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a pre-specified number of concept levels has
been reached.
3] Cluster Analysis
Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numerical attribute of A
by partitioning the values of A into clusters or groups.
Clustering considers the distribution of A, as well as the closeness of data
points, and therefore can produce high-quality discretization results.
Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy.
4] Entropy-Based Discretization
Entropy-based discretization is a supervised, top-down splitting technique.
It explores class distribution information in its calculation and determination
of split points.
Let D consist of data instances defined by a set of attributes and a class-label
attribute.
The class-label attribute provides the class information per instance.
In this, the interval boundaries or split-points defined may help to improve
classification accuracy.
The entropy and information gain measures are used for decision tree
induction.
11
5] Interval Merge by χ2 Analysis
It is a bottom-up method.
Find the best neighboring intervals and merge them to form larger intervals
recursively.
The method is supervised in that it uses class information.
ChiMerge treats intervals as discrete categories.
The basic notion is that for accurate discretization, the relative class
frequencies should be fairly consistent within an interval.
Decision Tree
Decision Tree is a supervised learning method used in data mining for
classification and regression methods. It is a tree that helps us in decision-making
purposes. The decision tree creates classification or regression models as a tree
structure. It separates a data set into smaller subsets, and at the same time, the
decision tree is steadily developed. The final tree is a tree with the decision nodes
and leaf nodes. A decision node has at least two branches. The leaf nodes show a
classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root
node. Decision trees can deal with both categorical and numerical data.
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy_computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
12
Advantages of using decision trees:
Missing values in data also do not influence the process of building a choice
tree to any considerable extent.
Compared to other algorithms, decision trees need less exertion for data
preparation during pre-processing.
13
databases. KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and transformed in
order to get different and more appropriate results. Preprocessing of
databases consists of Data cleaning and Data Integration.
14
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to
capture transformations.
Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures.
Find interestingness score of each pattern.
Uses summarization and Visualization to make data understandable by
user.
7. Knowledge representation: Knowledge representation is defined as technique
which utilizes visualization tools to represent data mining results.
Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules,
etc
15
a. Data Sources There are so many documents present. That is a database, data
warehouse, World Wide Web (WWW). That are the actual sources of data.
Sometimes, data may reside even in plain text files or spreadsheets.
World Wide Web or the Internet is another big source of data.
b. Database or Data Warehouse Server The database server contains the actual
data that is ready to be processed. Hence, the server handles retrieving the relevant
data. That is based on the data mining request of the user.
c. Data Mining Engine In data mining system data mining engine is the core
component. As It consists a number of modules. That we used to perform data
mining tasks. That includes association, classification, characterization,
clustering, prediction, etc.
d. Pattern Evaluation Modules It interacts with the data mining engine to focus
the search towards interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data
mining system.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful
for guiding the search or evaluating the interestingness of the result patterns
17
Apriori Algorithm
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent itemsets in a dataset for Boolean association rule. Name of algorithm is
Apriori is because it uses prior knowledge of frequent itemset properties. We apply
a iterative approach or levelwise search where k-frequent itemsets are used to find
k+1 itemsets.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) Compare candidate set item’s support count with minimum support count
(here min_support=2 if support_count of candidate set items is less than
min_support then remove those items) this gives us itemset L1.
After removal support_count <2
18
Step 2 for K=2 pair of 2 itemset
19