0% found this document useful (0 votes)
23 views22 pages

DWM Unit 2 Notes

Uploaded by

Triveni Patle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

DWM Unit 2 Notes

Uploaded by

Triveni Patle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

COURSE CODE: BIT33505

COURSE NAME: DATA WAREHOUSING AND MINING

V SEMESTER

UNIT 2 NOTES

1|Page
THREE TIER (MULTITIER) DATA WAREHOUSE ARCHITECTURE

Tier1 - Bottom Tier:


• The database of the Data warehouse servers as the bottom tier. It is usually a relational
database system.
• Data is cleansed, transformed, and loaded into this layer using back-end tools.
• The bottom tier is a warehouse database server that is almost always a relational database
system.
• Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources
• These tools and utilities perform data extraction, cleaning, and transformation, as well as load
and refresh functions to update the data warehouse .
• This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.

2|Page
Tier2 - Middle Tier:
• The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP (MOLAP).
• For a user, this application tier presents an abstracted view of the database. This layer also
acts as a mediator between the end-user and the database.
• OLAP model is an extended relational DBMS that maps operations on multidimensional data
to standard relational operations.
• A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

Tier3 - Top Tier:


• The top tier is a front-end client layer.
• Top tier is the tools and API that you connect and get data out from the data warehouse.
• It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining
tools.
• The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).

ENTERPRISE DATA WAREHOUSE


• An enterprise data warehouse is a centralized repository that stores and manages large volumes
of structured and unstructured data from various sources of an enterprise.
• An Enterprise Data Warehouse (EDW) stores and manages all the historical business data within
an organization.
• It is designed to support business intelligence, analytics, and reporting needs by providing a
unified view of the organization's data for analysis and decision-making purposes.
• The information usually comes from different systems like ERPs, CRMs, physical recordings,
and other flat files.

3|Page
Enterprise Data Warehouse Components

1. Data sources
These can include internal systems such as ERP, CRM, and billing systems, as well as external
data sources such as social media feeds, market research, and website analytics.

2. Data integration and ETL


The process of extracting, transforming, and loading data from various sources into the data
warehousing system is a critical component of the architecture. This process is commonly referred
to as ETL (Extract, Transform, Load), and it involves extracting data from source systems,
transforming it into a format that is consistent with the schema, and loading it into the system.

3. Staging area
In the case of ETL, the staging area is the place data is transformed before EDW. It is a temporary
area where data is cleansed, de-duplicated, split, joined, and converted into a unified format before
loading into the warehouse.

4. Storage layer.
The data is finally loaded into the storage space. It is the central repository where processed data
is stored. At this stage, all the general changes will be applied, so the data will be loaded into its
final model. This layer includes relational databases, a database management system, and
additional storage for metadata.

4|Page
5. Metadata module
Metadata is data about data. These are the explanations that give hints for users/administrators of
what subject/domain this information relates to. This module stores metadata such as the origin,
meaning, and usage of data. Metadata can be technical (e.g., source system) or business-oriented
(e.g., region of sales) and is managed separately.

6. Data marts
EDW can have a set of smaller subsections called data marts that are built specifically for a
particular subject area, business function, or group of users. For example, there can be a separate
data mart for marketing purposes and a data mart for a financial department.

7. Presentation layer
The final building block of an EDW comprises tools that give end users access to data. Also
known as the BI interface, this layer provides tools for data visualization, dashboards, reporting,
and access for analysis or machine learning tasks

Benefits Of Implementing EDW


1. Better data quality and accuracy:
The system cleans, organizes, and combines data into one place, giving a single, correct version
of the information. This avoids mistakes that happen when data is scattered in many places.
2. Better decision-making:
An EDW gives clear insights into how the business is doing. This helps organizations make
smarter decisions, find new opportunities, improve processes, and perform better.
3. Lower operational costs:
It simplifies how data is managed by removing duplicate data, connecting different sources, and
breaking data silos. This saves time and reduces effort and costs.
4. Scalable and flexible:
The system can grow or shrink to handle different amounts of data as business needs change,
making it easy to adapt to market changes.
5. Helps with compliance:
It ensures that data follows privacy rules and industry standards, helping organizations avoid
fines and legal problems.

Best Practices for EDW Implementation


1. Defining Clear Objectives and Scope-
Clear objectives and scope must be defined at the beginning. Understanding the purpose of the
system, the required data, and the right data structure creates a strong base for a successful data
warehouse.

5|Page
2. Choosing the Right Data Warehouse Architecture-
The data warehouse structure should match the business needs, data size, and number of sources.
The design must balance cost, complexity, and future growth..
3. Ensuring Data Governance and Quality-
Data governance and quality are critical to EDW implementation. Rules and processes should be
in place to keep data accurate, clean, and secure. High-quality data and proper management make
the system reliable.
4. Implementing Effective ETL Processes-
ETL (Extract, Transform, Load) involves collecting data from different sources, cleaning and
converting it into the correct format, and storing it in the warehouse. Well-designed ETL
processes reduce errors and duplicate data.

Types of Enterprise Data Warehouse

1. On-Premises Data Warehouse


• The on-premises data warehouse is located within a physically confined, secured data center,
isolated from the outside world.
• This isolation through the infrastructure lets companies have granular control of their data
management infrastructure including hardware, software, and tools.
• Traditionally this method is used by those companies that require maximum protection and
customization for their data storage plans.

Pros:
• High Security: The fact that the data is within your organization's firewall enhances security
and privacy to a very high level.
• Customization: Every little thing in your server is customizable, whether hardware, software,
or specific configuration to solve a unique problem.
• Performance: The on-premises EDWs can be configured for particular job types and that can
give the advantage of performance when querying.

Cons:
• High Upfront Cost: Hardware, software, technical setup, and maintenance are all expensive,
making it an expensive undertaking.
• Scalability Challenges: As data gets bigger, scaling storage and processing capacity is not
an inexpensive and resourceful task.
• Limited Agility: Data needs may change from very big to small over time. Therefore, the
infrastructure may need to be rehabilitated.

6|Page
2. Cloud Data Warehouse
• The cloud data warehouse is a powerful system offering a customized storage facility that can
be accessed from any internet-connected device.
• These solutions are hosted by cloud service providers such as Amazon Web Services (AWS),
Microsoft Azure, or Google Cloud Platform (GCP), which offer organizations a scalable and
accessible way of storing their data.

Pros:
• Scalability and Elasticity: A cloud platform allows rapid increase or decrease in the amount
of data storage and processing capacity required when the situation arises.
• Reduced Cost: It eliminates the need for the business to invest in hardware and IT staff and
enables it to pay for the software as it is used.
• Accessibility and Manageability: Data can be accessed from anywhere and even on the
move and thus help the remote team members to complete their tasks faster and work in teams
more effectively.

Cons:
• Security Considerations: Some companies might have concerns about entrusting sensitive
data to external servers.
• Vendor Lock-In: Migration from one platform to another of a cloud provider involves a
highly complex and costly process.
• Network Dependence: Performance might be affected by connection speed and reliability.

3. Hybrid Data Warehouse


• The hybrid data warehouse, as the name implies, is a blend of on-premises infrastructure to
ensure security and control along with cloud scalability and affordability.
• Sensitive data can reside on-premises, while the less critical information is stored on the
cloud.

Pros:
• Flexibility: Make your data storage plan depend on how secure it is and how much it will
cost. It is advisable to consider various types of data.
• Scalability: Utilize the cloud's wide availability and scale the data that is needed more
frequently, while keeping the private or sensitive data in the on-premise solutions.
• Phased Migration: Organizations that employ a hybrid deployment method can
perform cloud migration with a less disruptive step-by-step transition.

7|Page
Cons:
• Complexity: Managing a hybrid environment requires additional expertise and coordination
between on-premises and cloud infrastructure.
• Potential Vendor Management: The cloud can be scaled as per the requirements of the
business. On top of that, this was accompanied by a balancing act of different vendors for
both on-premises and cloud components, which can lead to more complications.
• Higher Costs: Outsourcing cloud solutions is less expensive than fully on-premise
alternatives, but the integration of both environments brings some additional expenditure.

DATA MART
• A data mart is a specialized subset of a data warehouse focused on a specific functional area
or department within an organization.
• It provides a simplified and targeted view of data, addressing specific reporting and
analytical needs.
• Data mart is such a storage component which is concerned on a specific department of an
organization.
• It is a subset of the data stored in the data warehouse.
• Data mart is focused only on particular function of an organization and it is maintained by
single authority only, e.g. finance, marketing.
• Data Marts are small in size and are flexible, typically holding relevant data for a specific
group of users, such as sales, marketing, or finance.
• They are organized around specific subjects, such as sales, customer data, or product
information, and are structured, transformed, and optimized for efficient querying and
analysis within the domain.

8|Page
Types of Data Marts

1. Dependent Data Marts


• Dependent Data Mart is created by extracting the data from central repository, Data
warehouse.
• First data warehouse is created by extracting data (through ETL tool) from external sources
and then data mart is created from data warehouse.
• Dependent data mart is created in top-down approach of data warehouse architecture. This
model of data mart is used by big organizations.

2. Independent Data Mart


• Independent Data Mart is created directly from external sources instead of data warehouse.
• First data mart is created by extracting data from external sources and then data warehouse
is created from the data present in data mart.
• Independent data mart is designed in bottom-up approach of data warehouse architecture.
• This model of data mart is used by small organizations and is cost effective comparatively.

9|Page
3. Hybrid Data Mart –
• This type of Data Mart is created by extracting data from operational source or from data
warehouse.
• 1 Path reflects accessing data directly from external sources and 2 Path reflects dependent data
model of data mart.

SCHEMA OF DATA WAREHOUSE

STAR SCHEMA
• A star schema organizes data in a database to make analysis easier and faster.
• The fact table sits at the center and stores measurable data such as sales, revenue, or quantities.
• Surrounding the fact table are dimension tables, which provide descriptive details like product
information, customer details, or dates.
• The arrangement of a central fact table connected to multiple dimension tables forms a star-like
shape.
• When drawn, the fact table is in the center and dimension tables radiate around it, giving the
design a star-like appearance.
• A star schema is a common way to design data warehouses and data marts. It is optimized for fast
data retrieval and analysis rather than transaction processing.

Structure:
• Fact Table (Center):
o Contains quantitative, numeric data such as sales amount, revenue, quantity, or profit.
o Includes keys (foreign keys) that link to dimension tables.
o Stores measures (facts) that are analyzed.

10 | P a g e
• Dimension Tables (Surrounding):
o Contain descriptive attributes (text or categorical data) that give context to the facts.
o Examples: customer details, product names, time/date, location.
o Dimension tables are usually denormalized (data is stored in a simple, flat form) to
make queries faster.

Example:
In a retail store data warehouse:
• Fact table: Sales (fields: Sale_ID, Date_ID, Product_ID, Customer_ID, Sales_Amount).
• Dimension tables: Date, Product, Customer, Store.

11 | P a g e
Features of Star Schema:
1. Central Fact Table:
Stores quantitative data (facts/measures) such as sales, revenue, profit, etc.
2. Dimension Tables:
Surround the fact table and provide descriptive information like product details, customer data,
location, and time.
3. Star-Like Structure:
The arrangement of one central fact table connected to multiple dimension tables forms a star
shape.
4. Denormalized Dimension Tables:
Dimension tables are usually denormalized (data stored in a flat structure) to simplify queries
and speed up performance.
5. Optimized for OLAP (Online Analytical Processing):
Designed for analytical queries, reporting, and data analysis, not for day-to-day transaction
processing.
6. Simple and Easy to Understand:
The straightforward structure makes it easier for business users and analysts to work with.
7. Fast Query Performance:
Reduces the number of joins needed, which speeds up query execution.
8. Supports Aggregation and Summarization:
Ideal for generating summaries, trends, and aggregated data for decision-making.

When to Use a Star Schema


1. Simple and Fast:
Use a star schema when you want an easy design that runs queries quickly.
2. Good for Small or Medium Data:
Best for smaller datasets where you need quick reports and simple analysis.
3. Fewer Tables to Join:
Since all dimension tables connect directly to the fact table, reports like “total sales by region” run very
fast.
4. Okay with Some Repeated Data:
If storing a bit of duplicate data is not a problem, the star schema works well.
5. Best for Simple Reports:
Perfect when you want speed over complexity, for example creating sales reports quickly without
complicated queries.

12 | P a g e
SNOWFLAKE SCHEMA
• A Snowflake Schema is another way to organize data in a data warehouse.
• It is similar to a star schema but more complex because the dimension tables are broken down
into smaller tables.
• This process is called normalization (splitting data into multiple related tables to remove
repetition).
• The structure looks like a snowflake because the main fact table connects to dimension tables,
and those dimension tables connect to other sub-dimension tables.

Structure
• In a snowflake schema, there is still a central fact table just like in a star schema.
• The fact table stores numerical data such as sales, revenue, or quantities, along with keys that link
to dimension tables.
• The dimension tables are further broken down (normalized) into smaller related tables.
• These smaller tables store detailed information and are linked in a hierarchy.
• As a result, instead of a single layer of dimensions around the fact table (as in a star), there are
multiple layers of related tables, giving the structure a snowflake-like shape.

13 | P a g e
Example:
• Fact table: Sales (with keys for Date_ID, Product_ID, Customer_ID, Store_ID).
• Dimensions:
o Product table links to a Category table and a Supplier table.
o Location table links to Country → State → City tables.

Features of a Snowflake Schema


1. Central Fact Table:
Stores measurable data (facts) such as sales, revenue, or quantities, with keys to link to
dimensions.
2. Normalized Dimension Tables:
Dimension tables are split into smaller related tables (normalized) to reduce redundancy.
3. Snowflake Shape:
Because dimensions branch out into sub-dimensions, the structure looks like a snowflake.
4. Less Data Redundancy:
Data is stored in smaller tables, so there is less repetition compared to a star schema.
5. Better Storage Efficiency:
Uses less storage space because data is not duplicated.
14 | P a g e
6. Complex Queries:
Queries involve more table joins, making them slower and more complex compared to star
schema.
7. Good for Hierarchical Data:
Works well when dimensions have multiple levels, like Country → State → City or Product
→ Category → Brand.
8. Used in OLAP Systems:
Mainly designed for analytical processing and reporting, not for transactional systems.

When to Use a Snowflake Schema


1. For Big and Complex Data:
Use a snowflake schema when you have a large amount of data with many details and levels
(like country → state → city).
2. Best for Large Companies:
Good for big companies that deal with huge data that changes often, such as customer data
or inventory.
3. Saves Storage Space:
Since it removes repeated data, it takes up less storage.
4. Handles Changes Easily:
If names or details change (like a region name), updates happen in one place and stay
correct everywhere.
5. Good for Detailed Reports:
Works well when your reports need to use lots of linked tables and detailed information.

Difference Between Star and Snowflake Schema


Feature Star Schema Snowflake Schema
One central fact table linked Fact table linked to dimension tables,
Structure
directly to dimension tables which are further split (normalized)
Query
Faster queries (fewer joins) Slower queries (more joins needed)
Performance
Simple, easy to design and More complex design with multiple
Design Complexity
understand levels of tables
Storage Uses more storage due to repeated Uses less storage because repeated
Requirement data data is removed
Data Redundancy Higher redundancy (duplicate data) Lower redundancy
Ease of Harder to maintain due to
Easier to maintain
Maintenance complexity
Better for very large and complex
Scalability Good for small to medium datasets
datasets

15 | P a g e
Best for quick, simple queries and Best for detailed analysis with many
Use Cases
BI reporting hierarchies and large data
Harder to update (data repeated in Easier to update (changes happen in
Updates/Changes
many places) one place)

CONCEPT HIERARCHY IN DATA MINING


• A concept hierarchy is a systematic arrangement of data concepts, organized from general to
specific.
• It allows the data to be viewed at multiple levels of abstraction, facilitating a better understanding
and analysis of the dataset.
• In a concept hierarchy, higher-level concepts represent more general abstractions, while lower-
level concepts denote more specific details.
• These hierarchies are crucial in data mining for simplifying complex data, enabling effective data
analysis, and improving the efficiency of the mining process.

Types of Concept Hierarchies:


1. Schema-Based Hierarchy: Derived from the schema structure or domain knowledge. For
example, geographical hierarchy:
o Country > State > City > Neighborhood

2. Set-Group Hierarchy: Derived from the grouping of values of attributes. For example, time
hierarchy:
o Year > Quarter > Month > Week > Day

16 | P a g e
3. Attribute-Oriented Hierarchy: Uses attribute values for constructing hierarchies. For
example, product hierarchy:
o Category > Subcategory > Brand > Product

ONLINE ANALYTICAL PROCESSING (OLAP)


• OLAP stands for Online Analytical Processing.
• It is a technology/software used to analyze data from multiple perspectives and dimensions.
• OLAP collects data from multiple sources at the same time and allows fast processing.
• It provides a platform to gain insights from large databases stored in data warehouses or data
marts.
• It helps analysts study data from different angles to create effective business strategies.
• OLAP performs high-speed, multidimensional analysis on large volumes of data.
• It is widely used for data mining, business intelligence, financial analysis, budgeting, sales
forecasting, and other complex analytical tasks.

Characteristics of OLAP
1. Fast: OLAP act as bridge between Data Warehouse and front-end. Hence helps in the better
accessibility of data yielding faster results.
2. Shared: OLAP operations drill-down or roll-up, it navigates between various dimensions in
multidimensional cube making it effective and efficient reporting system.
3. Data and Information: OLAP has calculation power for complex queries and data. It does
data visualization using graphs and charts.
4. Multidimensional Views: OLAP provides the ability to view data from multiple
perspectives. These views are often presented in the form of a data cube, which allows users
to easily drill down into data details.
5. Fast Query Performance: OLAP systems are designed to return query results quickly, even
when working with large volumes of data. This is achieved through optimized indexing and
pre-computation of data.
6. Aggregated Data: OLAP systems often work with aggregated data, which allows users to
analyze data at different levels of granularity. For example, you can analyze sales data by
region, by product category, or by time period.
7. Complex Calculations: OLAP supports complex calculations and data modeling. This
includes operations like trend analysis, forecasting, and statistical analysis.
8. Data Integration: OLAP tools integrate data from multiple sources, providing a unified view
of the organization's data. This integration ensures consistency and accuracy in reporting and
analysis.
9. User-Friendly Interfaces: OLAP tools often come with intuitive interfaces that allow users
to interact with the data easily, perform ad-hoc queries, and generate reports without needing
extensive technical knowledge.
17 | P a g e
10. Historical Data Analysis: OLAP systems are well-suited for analyzing historical data,
allowing businesses to identify trends, patterns, and anomalies over time.

Multidimensional Data Model


• A Multidimensional Data Model is defined as a model that allows data to be organized and
viewed in multiple dimensions, such as product, time and location
• It allows users to ask analytical questions associated with multiple dimensions which help
us know market or business trends.
• OLAP and data warehousing uses multi-dimensional databases.
• It represents data in the form of data cubes.
• Data cubes allow to model and view the data from many dimensions and perspectives.
• It is defined by dimensions and facts and is represented by a fact table. Facts are numerical
measures and fact tables contain measures of the related dimensional tables or names of the
facts.

Example of Multidimensional Data Model


Let us take the example of the data of a factory which sells products per quarter Q. Let us consider
the data according to item, time and location. Here is the table:

18 | P a g e
This data can be represented in the form of three dimensions conceptually, which is shown in the
image below:

OLAP CUBE
• An OLAP Cube, also known as a multidimensional cube or data cube.
• A OLAP cube is a way of storing data so it can be looked at from many directions.
• It allows users to analyze data from different perspectives by organizing it into a cube structure.
• Data is arranged like a 3D box (cube) with different sides (called dimensions).
• This lets business’s to study data from different views at the same time.
• It is very useful when working with large amounts of data.
• Example: You can check “product sales” by region and by quarter at the same time using a cube.

OLAP Cube operations


Consider the following OLAP cube for performing OLAP operations

19 | P a g e
1. Drill down:
• In drill-down operation, the less detailed data is converted into highly detailed data.
• It can be done by moving down in the concept hierarchy and adding a new dimension
• Example: the drill down operation on above cube, is performed by moving down in the
concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up:
• It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube.
• It can be done by climbing up in the concept hierarchy and reducing the dimensions
• Example: In the cube given, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).

20 | P a g e
3. Dice:
• It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
• Example: In the cube given, a sub-cube is selected by selecting following dimensions with
criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”

4. Slice:
• It selects a single dimension from the OLAP cube which results in a new sub-cube creation.
• Example: In the cube given, Slice is performed on the dimension Time = “Q1”.

5. Pivot:
• It is also known as rotation operation as it rotates the current view to get a new view of the
representation.
• Example: In the sub-cube obtained after the slice operation, performing pivot operation
gives a new view of it.

21 | P a g e
Example OLAP Cube and its Operations
Product Location Time Sales ($)
A New York Q1 50,000
A New York Q2 60,000
B Los Angeles Q1 70,000
B Los Angeles Q2 80,000

• Roll-up: Aggregate by product, location, or year.


• Drill-down: Break "Q1" into months (January, February, etc.).
• Slice: Extract only "New York" data.
• Dice: View sales for Product A in Q1 and Q2.
• Pivot: Swap "Time" and "Location" for better analysis.

22 | P a g e

You might also like