0% found this document useful (0 votes)
48 views19 pages

Unit 3 DW

The document discusses data mining, including its definition, key features, types, advantages, disadvantages and major issues. It also discusses data preprocessing tasks such as data cleaning, integration, reduction and transformation. Common techniques for handling missing values and noisy data during cleaning are also covered.

Uploaded by

pratapshivamsid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views19 pages

Unit 3 DW

The document discusses data mining, including its definition, key features, types, advantages, disadvantages and major issues. It also discusses data preprocessing tasks such as data cleaning, integration, reduction and transformation. Common techniques for handling missing values and noisy data during cleaning are also covered.

Uploaded by

pratapshivamsid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

KCA012: Data Warehousing & Data Mining

UNIT-3
Data Mining: Overview, Motivation, Definition & Functionalities, Data
Processing, Form of Data Pre-processing, Data Cleaning: Missing Values, Noisy
Data, (Binning, Clustering, Regression, Computer and Human inspection),
Inconsistent Data, Data Integration and Transformation. Data Reduction:-Data
Cube Aggregation, Dimensionality reduction, Data Compression, Numerosity
Reduction, Discretization and Concept hierarchy generation, Decision Tree

Data Mining
The process of extracting information to identify patterns, trends, and useful data
that would allow the business to take the data-driven decision from huge sets of
data is called Data Mining. Data mining is the process of sorting through large
data sets to identify patterns and relationships that can help solve business
problems through data analysis. Data mining techniques and tools enable
enterprises to predict future trends and make more-informed business decisions.
Data mining is the process of extracting and discovering patterns in large data
sets involving methods at the intersection of machine learning, statistics,
and database systems

Key features of Data Mining


 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large data sets and databases

1
Types of Data Mining
Data mining can be performed on the following types of data:
1. Relational Database:A relational database is a collection of multiple data sets
formally organized by tables, records, and columns from which data can be
accessed in various ways without having to recognize the database tables.
2. Data Repositories:The Data Repository generally refers to a destination for
data storage.
3. Object-Relational Database: A combination of an object-oriented database
model and relational database model is called an object-relational model. It
supports Classes, Objects, Inheritance, etc.
4. Transactional Database:A transactional database refers to a database
management system (DBMS) that has the potential to undo a database transaction
if it is not performed appropriately

Advantages of Data Mining


 It helps gather reliable information
 Helps businesses make operational adjustments
 Helps to make informed decisions
 It helps detect risks and fraud
 Helps to analyse very large quantities of data quickly
 Helps to understand behaviours, trends and discover hidden patterns

Disadvantages of Data Mining


 Data Mining tools are complex and require training to use
 Rising privacy concerns
 Data mining requires large databases
 Expensive
Major Issues in data Mining:
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge.
 Interactive mining of knowledge at multiple levels of abstraction − It
allows users to focus the search for patterns from different angles.
 Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be used.
 Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should be

2
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities.
 Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge.

2.Performance Issues:
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms− In order to
effectively extract the information from huge amount of data in databases; data
mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms.

3.Diverse Data Types Issues:


 Handling of relational and complex types of data
 Mining information from heterogeneous databases and global information
systems

DATA MINING FUNCTIONALITIES

Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. Data mining tasks can be classified into two categories:
descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the
database.
Predictive mining tasks perform inference on the current data in order to make
predictions.
Concept/Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. For example, in the Electronics
store, classes of items for sale include computers and printers, and concepts of
customers include big Spenders and budget Spenders.
3
Data characterization
Data characterization is a summarization of the general characteristics or features
of a target class of data.
Data discrimination
Data discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, are patterns that occur frequently in data. There are many kinds
of frequent patterns, including itemsets, subsequences, and substructures.
Association analysis
Suppose, as a marketing manager, you would like to determine which items are
frequently purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
Where X is a variable representing a customer. Confidence=50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as
well.
Support=1% means that 1% of all of the transactions under analysis showed that
computer and software were purchased together.
Classification:
There is a large variety of data mining systems available. Data mining systems
may integrate techniques from the following −
 Spatial Data Analysis
 Information Retrieval
 Pattern Recognition
 Image Analysis
 Signal Processing
 Computer Graphics
 Web Technology
 Business
 Bioinformatics

DATA PREPROCESSING
Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw
data. The quality of the data should be checked before applying machine learning
or data mining algorithms.

Why is Data preprocessing important?


Preprocessing of data is mainly to check the data quality. The quality can be
checked by the following
4
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do
not match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:


1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation

Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values. There are
some techniques in data cleaning

Handling missing values:


 Standard values like “Not Available” or “NA” can be used to replace the missing
values.
 Missing values can also be filled manually but it is not recommended when that
dataset is big.
 The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.
 While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.
Noisy Data: Noisy generally means random error or containing unnecessary data
points. Here are some of the methods to handle noisy data.
 Binning: This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of bins. There
are three methods for smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using

5
minimum and maximum values of the bin values are taken and the values are
replaced by the closest boundary value.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
Partition using equal frequency approach:
- Bin 1 : 4, 8, 9, 15
- Bin 2 : 21, 21, 24, 25
- Bin 3 : 26, 28, 29, 34

Smoothing by bin means: (4+8+9+15)/4=9 , (21+21+24+25)/4=23 ..


- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:


- Bin 1: 4, 4, 4, 15 find minimum value=4 and max value=15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34 find minimum value=26 & max value=34

 Regression: Here data can be made smooth by fitting it to a regression function.


The regression used may be linear (having one independent variable) or multiple
(having multiple independent variable).The simplest form of regression is linear
regression which uses the formula of s straight line(y=b+wx).
 Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
 Combined computer and human inspection: Outliers may be identified through
a combination of computer and human inspection. In one application, for example,
an information – theoretic measure was used to help identify outlier patterns in a
handwritten character database for classification. The measure’s value reflected the
“surprise” content of the predicted character label with respect to the known label.
 Inconsistent Data There may be inconsistencies in the data recorded for some
transactions. Some data inconsistencies may be corrected manually using external
references. For example, errors made at data entry may be corrected by performing
a paper trace. This may be coupled with routines designed to help correct the
inconsistent use of codes. Knowledge engineering tools may also be used to detect
the violation of known data constraints. For example, known functional
dependencies between attributes can be used to find values contradicting the
functional constraints.

6
Data integration: Data Integration is a data preprocessing technique that
combines data from multiple heterogeneous data sources into a coherent data
store and provides a unified view of the data. These sources may include multiple
data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema .
Major approaches for data integration – one is the “tight coupling approach”
and another is the “loose coupling approach”.
Tight Coupling:
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and
Loading.
Loose Coupling:
 Here, an interface is provided that takes the query from the user, transforms it
in a way the source database can understand, and then sends the query directly
to the source databases to obtain the result.
 And the data only remains in the actual source databases.
Issues in Data Integration:
There are three issues to consider during data integration: Schema Integration,
Redundancy Detection, and resolution of data value conflicts. These are
explained in brief below.
1. Schema Integration:
 Integrate metadata from different sources.
 The real-world entities from multiple sources are referred to as the entity
identification problem.
2. Redundancy:
 An attribute may be redundant if it can be derived or obtained from another
attribute or set of attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data
set.
 Some redundancies can be detected by correlation analysis.
3. Detection and resolution of data value conflicts:
 This is the third critical issue in data integration.
 Attribute values from different sources may differ for the same real-world
entity.

7
 An attribute in one system may be recorded at a lower level of abstraction than
the “same” attribute in another.

Data reduction: This process helps in the reduction of the volume of the data
which makes the analysis easier yet produces the same or almost the same result.
This reduction also helps to reduce storage space. There are some of the techniques
in data reduction are:-
 Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
 Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute.the attribute having p-value greater than significance level can
be discarded.
 Dimensionality reduction: This reduce the size of data by encoding
mechanisms.It can be lossy or lossless. If after reconstruction from compressed
data, original data can be retrieved, such reduction are called lossless reduction
else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).
 Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this
reduction.
 Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information
during compression it is called lossless compression. Whereas lossy
compression reduces information but it removes only the unnecessary
information.

Data Transformation:The change made in the format or the structure of the data
is called data transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
 Smoothing: With the help of algorithms, we can remove noise from the dataset
and helps in knowing the important features of the dataset. By smoothing we can
find even a simple change that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set which is from multiple sources is integrated into with data
analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantity
of the data are good the results are more relevant.

8
 Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we can set
an interval like (3 pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented in
a smaller range.
 Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

Data Discretization
 Dividing the range of a continuous attribute into intervals.
 Data discretization converts a large number of data values into smaller once,
so that data evaluation and data management becomes very easy.
 Interval labels can then be used to replace actual data values.
 Reduce the number of values for a given continuous attribute.
 Some classification algorithms only accept categorically attributes.
 This leads to a concise, easy-to-use, knowledge-level representation of
mining results.

Data discretization example


 we have an attribute of age with the following values.
Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75
 Table: Before discretization
Attribute Age Age Age
10,11,13,14,17,19, 30, 31, 32, 38, 40, 42 70 , 72, 73, 75
After Discretization Young Mature Old

 Discretization techniques can be categorized based on whether it uses class


information or not such as follows:
o Supervised Discretization - This discretization process uses class
information.
o Unsupervised Discretization - This discretization process does not use
class information.

9
Discretization techniques can be categorized based on which direction it
proceeds as follows:

1. Top-down Discretization -
 If the process starts by first finding one or a few points called split points
or cut points to split the entire attribute range and then repeat this
recursively on the resulting intervals.

2. Bottom-up Discretization -
 Starts by considering all of the continuous values as potential split-points.
 Removes some by merging neighborhood values to form intervals, and
then recursively applies this process to the resulting intervals.

Concept Hierarchies
 Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a Concept
Hierarchy.
 Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
 In the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by
concept hierarchies.
 This organization provides users with the flexibility to view data from
different perspectives.
 Data mining on a reduced data set means fewer input and output operations
and is more efficient than mining on a larger data set.
 Because of these benefits, discretization techniques and concept hierarchies
are typically applied before data mining, rather than during mining.

Typical Methods of Discretization and Concept Hierarchy Generation for


Numerical Data
1] Binning
 Binning is a top-down splitting technique based on a specified number of
bins.
 Binning is an unsupervised discretization technique because it does not use
class information.
10
 In this, The sorted values are distributed into several buckets or bins and
then replaced with each bin value by the bin mean or median.
 It is further classified into
o Equal-width (distance) partitioning
o Equal-depth (frequency) partitioning

2] Histogram Analysis
 It is an unsupervised discretization technique because histogram analysis
does not use class information.
 Histograms partition the values for an attribute into disjoint ranges called
buckets.
 It is also further classified into
o Equal-width histogram
o Equal frequency histogram
 The histogram analysis algorithm can be applied recursively to each
partition to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a pre-specified number of concept levels has
been reached.

3] Cluster Analysis
 Cluster analysis is a popular data discretization method.
 A clustering algorithm can be applied to discretize a numerical attribute of A
by partitioning the values of A into clusters or groups.
 Clustering considers the distribution of A, as well as the closeness of data
points, and therefore can produce high-quality discretization results.
 Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy.

4] Entropy-Based Discretization
 Entropy-based discretization is a supervised, top-down splitting technique.
 It explores class distribution information in its calculation and determination
of split points.
 Let D consist of data instances defined by a set of attributes and a class-label
attribute.
 The class-label attribute provides the class information per instance.
 In this, the interval boundaries or split-points defined may help to improve
classification accuracy.
 The entropy and information gain measures are used for decision tree
induction.
11
5] Interval Merge by χ2 Analysis
 It is a bottom-up method.
 Find the best neighboring intervals and merge them to form larger intervals
recursively.
 The method is supervised in that it uses class information.
 ChiMerge treats intervals as discrete categories.
 The basic notion is that for accurate discretization, the relative class
frequencies should be fairly consistent within an interval.

Decision Tree
Decision Tree is a supervised learning method used in data mining for
classification and regression methods. It is a tree that helps us in decision-making
purposes. The decision tree creates classification or regression models as a tree
structure. It separates a data set into smaller subsets, and at the same time, the
decision tree is steadily developed. The final tree is a tree with the decision nodes
and leaf nodes. A decision node has at least two branches. The leaf nodes show a
classification or decision. We can't accomplish more split on leaf nodes-The
uppermost decision node in a tree that relates to the best predictor called the root
node. Decision trees can deal with both categorical and numerical data.
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy_computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.

12
Advantages of using decision trees:

 A decision tree does not need scaling of information.

 Missing values in data also do not influence the process of building a choice
tree to any considerable extent.

 A decision tree model is automatic and simple to explain to the technical


team as well as stakeholders.

 Compared to other algorithms, decision trees need less exertion for data
preparation during pre-processing.

 A decision tree does not require a standardization of data.

What Is The Use Of A Decision Tree?


Decision Tree is used to build classification and regression models. It is used to
create data models that will predict class labels or values for the decision-making
process. The models are built from the training dataset fed to the system
(supervised learning). Using a decision tree, we can visualize the decisions that
make it easy to understand and thus it is a popular data mining technique.

Knowledge Discovery Process(KDD)

Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit,


previously unknown and potentially useful information from data stored in

13
databases. KDD is an iterative process where evaluation measures can be
enhanced, mining can be refined, new data can be integrated and transformed in
order to get different and more appropriate results. Preprocessing of
databases consists of Data cleaning and Data Integration.

Steps in Knowledge Discovery Process(KDD)


1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant
data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure.

14
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to
capture transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by
user.
7. Knowledge representation: Knowledge representation is defined as technique
which utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules,
etc

Data Mining Architecture


The major components of any data mining system are data source, data warehouse
server, data mining engine, pattern evaluation module, graphical user interface and
knowledge base.

15
a. Data Sources There are so many documents present. That is a database, data
warehouse, World Wide Web (WWW). That are the actual sources of data.
Sometimes, data may reside even in plain text files or spreadsheets.
World Wide Web or the Internet is another big source of data.
b. Database or Data Warehouse Server The database server contains the actual
data that is ready to be processed. Hence, the server handles retrieving the relevant
data. That is based on the data mining request of the user.
c. Data Mining Engine In data mining system data mining engine is the core
component. As It consists a number of modules. That we used to perform data
mining tasks. That includes association, classification, characterization,
clustering, prediction, etc.
d. Pattern Evaluation Modules It interacts with the data mining engine to focus
the search towards interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data
mining system.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful
for guiding the search or evaluating the interestingness of the result patterns

DATA MINING SYSTEM CLASSIFICATION


16
A data mining system can be classified according to the following criteria −
(a) databases mined
(b) knowledge mined
(c) techniques utilized
(d) applications adapted.

Classification Based on the Databases Mined


We can classify a data mining system according to the kind of databases mined.
Database system can be classified according to different criteria such as data
models, types of data, etc. And the data mining system can be classified
accordingly. For example, if we classify a database according to the data model,
then we may have a relational, transactional, object-relational, or data warehouse
mining system.

Classification Based on the kind of Knowledge Mined


We can classify a data mining system according to the kind of knowledge mined. It
means the data mining system is classified on the basis of functionalities such as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis

Classification Based on the Techniques Utilized


We can classify a data mining system according to the kind of techniques used. We
can describe these techniques
according to the degree of user interaction involved or the methods of analysis
employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted.
These applications are as follows −
Finance
Telecommunications
DNA
Stock Markets
E-mail

17
Apriori Algorithm
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent itemsets in a dataset for Boolean association rule. Name of algorithm is
Apriori is because it uses prior knowledge of frequent itemset properties. We apply
a iterative approach or levelwise search where k-frequent itemsets are used to find
k+1 itemsets.

Apriori Algorithm Working


Suppose we have the following dataset that has various transactions, and from this
dataset, we need to find the frequent itemsets and generate the association rules
using the Apriori algorithm

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)

(II) Compare candidate set item’s support count with minimum support count
(here min_support=2 if support_count of candidate set items is less than
min_support then remove those items) this gives us itemset L1.
After removal support_count <2
18
Step 2 for K=2 pair of 2 itemset

after removal support_count<2

Step 3: for K=3 pair of 3 itemset

We have only one combination, i.e.,{A, B, C}

19

You might also like