0% found this document useful (0 votes)
133 views

DWDM Unit 2

This document discusses data warehousing and the multidimensional data model. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. It contrasts operational databases with data warehouses, noting that data warehouses contain historical data organized around dimensions and facts for analysis rather than transactions. It then explains the multidimensional data model used in data warehouses, including dimensions, dimension tables, facts, and fact tables organized into a data cube structure.

Uploaded by

21jr1a43d1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views

DWDM Unit 2

This document discusses data warehousing and the multidimensional data model. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. It contrasts operational databases with data warehouses, noting that data warehouses contain historical data organized around dimensions and facts for analysis rather than transactions. It then explains the multidimensional data model used in data warehouses, including dimensions, dimension tables, facts, and fact tables organized into a data cube structure.

Uploaded by

21jr1a43d1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

E.Padma Asst.

prof KITS

DATA WARE HOUSING AND DATA MINING


UNIT-2:

Data Warehouse and OLAP Technology for data mining: Data Ware House, Multi Dimensional
Data Model, Data Ware House Architecture, Data Marts, Data Warehouse Implementation, Future
Development Of Data Cube Technology, From Data Ware Housing To Data Miming, Data Cube
Computation And Data Generalization, Attribute Oriented Induction.
------------------------------------------------------------------------------------------------------------------------------
Data Warehouse

what exactly is a data warehouse?” Data warehouses have been defined in many ways, making it
difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database
that is maintained separately from an organization’s operational databases. Data warehouse systems
allow for the integration of a variety of application systems. They support information processing by
providing a solid platform of consolidated historical data for analysis.

According to William H. Inmon, a leading architect in the construction of data ware- house systems,
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in
support of management’s decision making process” . This short, but comprehensive definition
presents the major features of a data warehouse. The four keywords, subject-oriented, integrated,
time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such
as relational database systems, transaction processing systems, and file systems. Let’s take a closer
look at each of these key features.

Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction processing
of an organization, a data warehouse focuses on the modeling and analysis of data for decision
makers. Hence, data warehouses typically provide a simple and concise view around particular
subject issues by excluding data that are not useful in the decision support process.

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,


such as relational databases, flat files, and on-line transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.

Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5–
10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an
element of time.

Nonvolatile: A data warehouse is always a physically separate store of data trans- formed from the
application data found in the operational environment. Due to this separation, a data warehouse does

Department of CAI 1
E.Padma Asst.prof KITS
not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
only two operations in data accessing: initial loading of data and access of data.

Differences between Operational Database Systems and Data


Warehouses:

The major task of on-line operational database systems is to perform on-line trans- action and query
processing. These systems are called on-line transaction processing (OLTP) systems. They cover
most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing,
banking, payroll, registration, and accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data
analysis and decision making. These systems are known as on-line analytical processing (OLAP)
systems.
The major distinguishing features between OLTP and OLAP are summarized as follows:

Users and system orientation: An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology professionals. An OLAP system is
market-oriented and is used for data analysis by knowledge workers, including managers, executives,
and analysts.

Data contents: An OLTP system manages current data that, typically, are too detailed to be easily
used for decision making. An OLAP system manages large amounts of historical data, provides
facilities for summarization and aggregation, and stores and manages information at different levels
of granularity. These features make the data easier to use in informed decision making.

Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design. An OLAP system typically adopts either a star or snowflake
model and a subject- oriented database design.

View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP system
often spans multiple versions of a database schema, due to the evolutionary process of an
organization. OLAP systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP data are stored
on multiple storage media.

Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations.

Department of CAI 2
E.Padma Asst.prof KITS

Comparison between OLTP and OLAP systems.

Feature OLTP OLAP


Characteristic operational processing informational processing
Orientation Transaction Analysis
User clerk, DBA, database professional knowledge worker (e.g.,
manager,executive, analyst)
Function day-to-day operations long-term informational
requirements,decision support
DB design ER based, application-oriented star/snowflake, subject-oriented
Data current; guaranteed up-to-date historical; accuracy
maintainedover time
Summarization primitive, highly detailed summarized, consolidated
View detailed, flat relational summarized, multidimensional
Unit of work short, simple transaction complex query
Access read/write mostly read
Focus data in information out
Operations index/hash on primary key lots of scans
Number of
records Tens Millions
accessed
Number of users Thousands Hundreds
DB size 100 MB to GB 100 GB to TB
Priority high performance, high availability high flexibility, end-user autonomy
Metric transaction throughput query throughput, response time

Department of CAI 3
E.Padma Asst.prof KITS

A Multidimensional Data Model


Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube. In DBMS we have number of tables .that means rows and columns
we represent that in 2 dimensions. In data warehouse we have n dimensions to represent n
dimensions we are using the data cube or cube structure.

“What is a data cube?” A data cube allows data to be modeled and viewed in multiple dimensions. It
is defined by dimensions and facts. In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records.

Data cube contains n number of cuboids . cuboids is nothing but dimensions. . Each dimension may
have a table associated with it, called a dimension table, For example, a dimension table for item
may contain the attributes item name, brand, and type. Dimension tables can be specified by users or
experts.
A multidimensional data model is typically organized around a central theme, like sales, for
instance. This theme is represented by a fact table.. The fact table contains the names of the facts,
or measures, as well as keys to each of the related dimension tables. Facts are numerical measures.
dimensions. Examples of facts for a sales data warehouse include dollars sold (sales amount in
dollars), units sold (number of units sold), and amount budgeted.
A 2-D view of sales data for All Electronics according to the dimensions time and item, where the
sales are from branches located in the city of Vancouver. The measure displayed is dollars sold (in
thousands).
location = “Vancouver”
item (type)

home
time (quarter) entertain computer phone secu
ment rity
Q1 605 825 14 400
Q2 680 952 31 512
Q3 812 1023 30 501

Department of CAI 4
E.Padma Asst.prof KITS
Q4 927 1038 38 580

Now, suppose that we would like to view the sales data with a third dimension. Forinstance, suppose
we would like to view the data according to time and item, as well as location for the cities Chicago,
New York, Toronto, and Vancouver.

A 3-D view of sales data for All Electronics, according to the dimensions time, item, and
location. The measure displayed is dollars sold (in thousands).

location = “Chicago” location = “New York” location = “Toronto” location = “Vancouver”

item item item item

home home home home

time ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec. ent. comp. phone sec.

Q 854 88 89 623 1087 968 38 872 818 746 43 591 605 825 14 400
1 2
Q 943 89 64 698 1130 102 41 925 894 769 52 682 680 952 31 512
2 0 4
Q 92 59 789 1034 104 45 1002 940 795 58 728 812 30 501
3 103 4 8 102
2 3
Q 99 63 870 1142 109 54 984 978 864 59 784 927 38 580
4 112 2 1 103
9 8

A 3-D data cube representation of the data according to the dimensions time, item, and location. The
measure displayed is dollars sold (in thousands).

Department of CAI 5
E.Padma Asst.prof KITS

Suppose that we would now like to view our sales data with an additional fourth dimension, such as
supplier. Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a
series of 3-D cubes, as shown in Figure below If we continue in this way, we may display any n-D
data as a series of (n 1)-D “cubes.”

Lattice of cuboids, making up a 4-D data cube for the dimensions time, item, location, and
supplier. Each cuboid represents a different degree of summarization.

Department of CAI 6
E.Padma Asst.prof KITS

The cuboid that holds the lowest level of summarization is called the base cuboid.. The 0-D cuboid,
which holds the highest level of summarization, is called the apex cuboid. In our example, this is the
total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is typically denoted
by all.

Stars, Snowflakes, and Fact Constellations: Schemas for


Multidimensional Databases

The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model is
appropriate for on-line transaction processing. A data warehouse, however, requires a concise, subject-
oriented schema that facilitates on-line data analysis.

The most popular data model for a data warehouse is a multidimensional model. Such a model can exist
in the form of a star schema, a snowflake schema, or a fact constellation schema. Let’s look at each of
these schema types.

Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no

Department of CAI 7
E.Padma Asst.prof KITS
redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the
central fact table.

Below Star schema. A star schema for AllElectronics sales is shown in figure Sales are consid- ered
along four dimensions, namely, time, item, branch, and location. The schema contains a central fact table
for sales that contains keys to each of the four dimensions, along with two measures: dollars sold and
units sold. To minimize the size of the fact table, dimension identifiers (such as time key and item key)
are system-generated identifiers.

Notice that in the star schema, each dimension is represented by only one table, and each table contains
a set of attributes. For example, the location dimension table contains the attribute set location key,
street, city, province or state, country . This constraint may introduce some redundancy. For example,
“Vancouver” and “Victoria” are both cities in the Canadian province of British Columbia. Entries for
such cities in the location dimension table will create redundancy among the attributes province or state
and country, that is, (..., Vancouver, British Columbia, Canada) and (..., Victoria, British Columbia,
Canada)

Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The resulting
schema graph forms a shape similar to a snowflake.

Department of CAI 8
E.Padma Asst.prof KITS
The major difference between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form to reduce redundancies. Such a table is easy to maintain and
saves storage space. the snowflake structure can reduce the effectiveness of browsing. since more joins will be
needed to execute a query. Hence, although the snowflake schema reduces redundancy, it is not as popular as
the star schema in data warehouse design.

Snowflake schema. A snowflake schema for All Electronics sales is given in Figure. Here, the sales fact table is
identical to that of the star schema in Figure. The main difference between the two schemas is in the definition
of dimension tables.

The single dimension table for item in the star schema is normalized in the snowflake schema, resulting in new
item and supplier tables. For example, the item dimension table now contains the attributes item key, item
name, brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing
supplier_key and supplier_type information. Similarly, the single dimension table for location in the star schema
can be normalized into two new tables: location and city. The city key in the new location table links to the city
dimension.

Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.

Fact constellation: A fact constellation schema is shown in Figure This schema specifies two fact tables, sales
and shipping. The sales table definition is identical to that of the star schema Figure. The shipping table has five
dimensions, or keys: item_key,time key, shipper key, from location, and to location, and two measures:
dollars_cost andunits shipped. A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between both the sales and
shipping fact tables.

Department of CAI 9
E.Padma Asst.prof KITS

Example for Defining Star, Snowflake, and Fact Constellation


Schemas

A Data Mining Query Language, DMQL: Language Primitives


• Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:<measure_list>
• Dimension Definition (Dimension Table)
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
• Special Case (Shared Dimension Tables)
o First time as “cube definition”
define dimension <dimension_name>as <dimension_name_first_time> in cube
<cube_name_first_time>
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)define

Department of CAI 10
E.Padma Asst.prof KITS

dimension item as (item_key, item_name, brand, type, supplier_type) define


dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)

Defining a Snowflake Schema in DMQL


define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,
supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state,
country))

Defining a Fact Constellation in DMQL


define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)define
dimension item as (item_key, item_name, brand, type, supplier_type) define
dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)define
cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define
dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cubesales,
shipper_type)
define dimension from_location as location in cube salesdefine
dimension to_location as location in cube sales

Measures: Three Categories


Measure: a function evaluated on aggregated data corresponding to given dimension-value

Department of CAI 11
E.Padma Asst.prof KITS
pairs.
Measures can be:
1. distributive: if the measure can be calculated in a distributive manner.
o E.g., count(), sum(), min(), max().
2. algebraic: if it can be computed from arguments obtained by applyingdistributive aggregate
functions.
o E.g., avg()=sum()/count(), min_N(), standard_deviation().
3. holistic: if it is not algebraic.

o E.g., median(), mode(), rank().

A Concept Hierarchy
A Concept hierarchy defines a sequence of mappings from a set of low level Concepts to
higher level, more general Concepts. Concept hierarchies allow data to be handled at varying
levels of abstraction.

OLAP operations on multidimensional data.


1. Roll-up: The roll-up operation performs aggregation on a data cube, either by climbing-up
a concept hierarchy for a dimension or by dimension reduction. Figure shows the result of a
roll-up operation performed on the central cube by climbing up the concept hierarchy for
location. This hierarchy was defined as the total order
street< city < province or state <country.
When roll-up is performed by dimension reduction, one or more dimensions are removed from the
given cube. For example, consider a sales data cube containing only the two dimensions location and
time. Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of
the total sales by location, rather than by location and by time.

2. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
more detailed data. Drill-down can be realized by either stepping-down a concept hierarchy
for a dimension or introducing additional dimensions. Figure shows the result of a drill-
down operation performed on the central cube by stepping down a concept hierarchy for
time defined as day < month < quarter < year. Drill- down occurs by descending the time
hierarchy from the level of quarter to the more detailed level of month.
3. Slice and dice: The slice operation performs a selection on one dimension of the given
cube, resulting in a sub cube. Figure shows a slice operation where the sales data are selected
from the central cube for the dimension time using the criteria time=”Q2". The dice
operation defines a sub cube by performing a selection on twoor more dimensions.
4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in view in order to

Department of CAI 12
E.Padma Asst.prof KITS

provide an alternative presentation of the data. Figure shows a pivot operation where the item and
location axes in a 2-D slice are rotated.

Figure:Examples of typical OLAP operations on multidimensional data.

Data warehouse architecture


Steps for the Design and Construction of Data Warehouse
This subsection presents a business analysis framework for data warehouse design. The basic steps
involved in the design process are alsodescribed.

The Design of a Data Warehouse: A Business Analysis Framework Four different views

Department of CAI 13
E.Padma Asst.prof KITS

regarding the design of a data warehouse must beconsidered: the top-down view, the data source
view, the data warehouseview, the business query view.

 The top-down view allows the selection of relevant informationnecessary for the data
warehouse.
 The data source view exposes the information being captured,stored and managed by
operational systems.
 The data warehouse view includes fact tables and dimension tables
 Finally the business query view is the Perspective of data in thedata warehouse from the
viewpoint of the end user.
Three-tier Data warehouse architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure


1. The bottom tier is a warehouse database server that is almost always a relational database system.
Back-end tools and utilities are used to feed data into the bottom tier from operational databases or
other external sources .. These tools and utilities perform data extraction, cleaning, and
transformation as well as load and refresh functions to update the data warehouse.

Department of CAI 14
E.Padma Asst.prof KITS

The data are extracted using application program interfaces known as gateways. A gateway is
supported by the underlying DBMS and allows client programs to generate SQL code to be executed
at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Link- ing and Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This
tier also contains a metadata repository, which stores information about the data warehouse and its
contents.

2. The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations
on multidimensional data to standard relational operations.
(2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).

From the architecture point of view, there are three data warehouse models: the enter- prise
warehouse, the data mart, and the virtual warehouse.
 Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information providers, and
is cross-functional in scope. It typically contains detailed data as well as summarized
data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes,
or beyond.
 Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is connected to specific, selected subjects. For
example, a marketing data mart may connect its subjects to customer, item, and sales.
The data contained in data marts tend to be summarized. Depending on the source of
data, data marts can be categorized into the following two classes:

(i).Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated locally
within a particular department or geographic area.

(ii).Dependent data marts are sourced directly from enterprise datawarehouses.

 Virtual warehouse: A virtual warehouse is a set of views over operational databases.


For efficient query processing, only some of the possible summary views may be

Department of CAI 15
E.Padma Asst.prof KITS
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers.

Metadata repository
Metadata are data about data. When used in a data warehouse, metadata are the data that
define warehouse objects. Metadata are created for the data names and definitions of the
given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that have been added by
data cleaning or integration processes. A metadata repository should contain:

 A description of the structure of the data warehouse. This includes the warehouse schema,
view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and
contents;
 Operational metadata, which include data lineage (history of migrated data and the sequence
of transformations applied to it), currency of data (active, archived, or purged), and
monitoring information (warehouse usage statistics, error reports, and audit trails);
 the algorithms used for summarization, which include measure and dimensiondefinition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports;

 The mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security.
 Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh,
update, and replication cycles; and
 Business metadata, which include business terms and definitions, data ownership
information, and charging policies.

Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP

1. Relational OLAP (ROLAP)

■ Use relational or extended-relational DBMS to store and managewarehouse


data and OLAP middle ware to support missing pieces

■ Include optimization of DBMS backend, implementation of


aggregation navigation logic, and additional tools and services

■ greater scalability
2. Multidimensional OLAP (MOLAP)

Department of CAI 16
E.Padma Asst.prof KITS

■ Array-based multidimensional storage engine (sparse matrix


techniques)

■ fast indexing to pre-computed summarized data


3. Hybrid OLAP (HOLAP)

■ User flexibility, e.g., low level: relational, high-level: array


4. Specialized SQL servers

■ specialized support for SQL queries over star/snowflake schemas

DATA WAREHOUSE IMPLEMENTATION


Data warehouse is represented by data cube. The following things we must consider to implement
data warehouse.

1. Efficient cube computation techniques


2. Access methods
3. Query processing techniques

1. Efficient cube computation techniques:


 Data cube computation is an essential task in data warehouse implementation. The pre
computation of all or part of a data cube can greatly reduce the response time and enhance
the performance on-line analytical processing

 Computational cube operators used to apply for cube computation.

 It computes aggregates of overall subsets of the dimensions specified in the operations


The storage requirements are more expensive when many of the dimensions have associated
concept hierarchies, this problem is referred as curse of dimensionality.

 If the dimensions don’t have hierarchy, then the total number of cuboid is calculated by
Total number of cuboids for an n-dimensional data cube is 2n

Department of CAI 17
E.Padma Asst.prof KITS

 If the dimensions have hierarchy, then the total number of cuboid is calculated by

n
T= π(Li+1)
i=1

Where Li is number of levels associated with dimensions i. and +1 is for we are using
apex cuboid ( 0 level)
 As the number of dimensions, no.of conceptual hierarchies increases, the storage space
required will exceeds the actual data.
 So it is unrealistic to pre compute and materialize all of the cuboid of generated for a cube.

 If there are many cuboid, there is another option called partial materialization.

 There are three choices for data cube materialization;

1. No materialization
2. Full materialization
3. Partial materialization

1. No materialization: do not pre compute anhy of those non base cuboids . this leads to computing
expensive multi dimensional aggregates .

2. Full materialization: pre compute all the cuboids .the result lattice of cuboids is referred as the
full cube. the choice typically requires huge amounts of memory space In order store all the pre
computed cuboids .

3. Partial materialization : it selects the all the cuboids which is really relevant to the query
processing .

Department of CAI 18
E.Padma Asst.prof KITS

It should consider three factors,


a) Identify the subset of cuboid-other frequently referenced cuboids are based, iceberg
b) Exploit the materialized cuboid during query processing
c) Efficiently update the materialized cuboid- load, refresh and incremental update.

2. Indexing OLAP data


The bitmap indexing method is popular in OLAP products because it allows quick searching in data
cubes.
The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap index
for a given attribute, there is a distinct bit vector, By, for each value v in the domain of the attribute.
If the domain of a given attribute consists of n values, then n bits are needed for each entry in the
bitmap index. If the value is present in dimension that is bit 1. If the value is not present bit 0.

The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast,
join indexing registers the joinable rows of two relations from a relational database. For example, if
two relations R(RID;A) and S(B; SID) join on the attributes A and B, then the join index record
contains the pair (RID; SID), where RID and SID are record identifiers from the R and S relations,
respectively. the query processing require more than 2 cuboids we are using this join index.
3. Query processing techniques
1. Determine which operations should be performed on the available cuboids. This involves
transforming any selection, projection, roll-up (group-by) and drill-down operations specified
in the query into corresponding SQL and/or OLAP operations. For example, slicing and
dicing of a data cube may correspond to selection and/or projection operations on a
materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should beapplied. This
involves identifying all of the materialized cuboids that may potentially be used to answer the
query.

From Data Warehousing to Data mining Data


Warehouse Usage:
Three kinds of data warehouse applications
1. Information processing
supports querying, basic statistical analysis, and reportingusing crosstabs, tables, charts and graphs

2. Analytical processing

 multidimensional analysis of data warehouse data

 supports basic OLAP operations, slice-dice, drilling, pivoting

Department of CAI 19
E.Padma Asst.prof KITS

3. Data mining

 knowledge discovery from hidden patterns


 supports associations, constructing analytical models, performing classification and prediction, and
presenting themining results using visualization tools.
 Differences among the three tasks

From on-line analytical processing to on-line analytical mining.

On-Line Analytical Mining (OLAM) (also called OLAP mining), which integrates on-line analytical
processing (OLAP) with data mining and mining knowledge in multidimensional databases, is
particularly important for the following reasons.
1. High quality of data in data warehouses.

Most data mining tools need to work on integrated, consistent, and cleaned data, which requires
costly data cleaning, data transformation and data integration as preprocessing steps. A data
warehouse constructed by such preprocessing serves as a valuable source of high quality data for
OLAP as well as for data mining.
2. Available information processing infrastructure surrounding
data warehouses.
Comprehensive information processing and data analysis infrastructures have been or will be
systematically constructed surrounding data warehouses, which include accessing, integration,
consolidation, and transformation of multiple, heterogeneous databases, ODBC/OLEDB connections,
Web-accessing and service facilities, reporting and OLAP analysis tools.
3. OLAP-based exploratory data analysis.

Effective data mining needs exploratory data analysis. A user will often want to traverse
through a database, select portions of relevant data, analyze them at different granularities, and
present knowledge/results in different forms. On-line analytical mining provides facilities for
data mining on different subsets of data and at different levels of abstraction, by drilling,
pivoting, filtering, dicing and slicing on a data cube and on some intermediate data mining
results.

4. On-line selection of data mining functions.

By integrating OLAP with multiple data mining functions, on-line analytical mining provides users
with the exibility to select desired data mining functions and swap data mining tasks dynamically.

Architecture for on-line analytical mining


An OLAM engine performs analytical mining in data cubes in a similar manner as an OLAP engine
performs on-line analytical processing. An integrated OLAM and OLAP architecture is shown in Figure,

Department of CAI 20
E.Padma Asst.prof KITS

where the OLAM and OLAP engines both accept users' on-line queries via a User GUI API and work
with the data cube in the data analysis via a Cube API.

A metadata directory is used to guide the access of the data cube. The data cube can be constructed by
accessing and/or integrating multiple databases and/or by filtering a data warehouse via a Database API
which may support OLEDB or ODBC connections. Since an OLAM engine may perform multiple data
mining tasks, such as concept description, association, classification, prediction, clustering, time-series
analysis ,etc., it usually consists of multiple, integrated data mining modules and is more sophisticated
than an OLAP engine.

Figure: An integrated OLAM and OLAP architecture.

Department of CAI 21

You might also like