DWM Unit 2 Notes
DWM Unit 2 Notes
V SEMESTER
UNIT 2 NOTES
1|Page
THREE TIER (MULTITIER) DATA WAREHOUSE ARCHITECTURE
2|Page
Tier2 - Middle Tier:
• The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP (MOLAP).
• For a user, this application tier presents an abstracted view of the database. This layer also
acts as a mediator between the end-user and the database.
• OLAP model is an extended relational DBMS that maps operations on multidimensional data
to standard relational operations.
• A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.
3|Page
Enterprise Data Warehouse Components
1. Data sources
These can include internal systems such as ERP, CRM, and billing systems, as well as external
data sources such as social media feeds, market research, and website analytics.
3. Staging area
In the case of ETL, the staging area is the place data is transformed before EDW. It is a temporary
area where data is cleansed, de-duplicated, split, joined, and converted into a unified format before
loading into the warehouse.
4. Storage layer.
The data is finally loaded into the storage space. It is the central repository where processed data
is stored. At this stage, all the general changes will be applied, so the data will be loaded into its
final model. This layer includes relational databases, a database management system, and
additional storage for metadata.
4|Page
5. Metadata module
Metadata is data about data. These are the explanations that give hints for users/administrators of
what subject/domain this information relates to. This module stores metadata such as the origin,
meaning, and usage of data. Metadata can be technical (e.g., source system) or business-oriented
(e.g., region of sales) and is managed separately.
6. Data marts
EDW can have a set of smaller subsections called data marts that are built specifically for a
particular subject area, business function, or group of users. For example, there can be a separate
data mart for marketing purposes and a data mart for a financial department.
7. Presentation layer
The final building block of an EDW comprises tools that give end users access to data. Also
known as the BI interface, this layer provides tools for data visualization, dashboards, reporting,
and access for analysis or machine learning tasks
5|Page
2. Choosing the Right Data Warehouse Architecture-
The data warehouse structure should match the business needs, data size, and number of sources.
The design must balance cost, complexity, and future growth..
3. Ensuring Data Governance and Quality-
Data governance and quality are critical to EDW implementation. Rules and processes should be
in place to keep data accurate, clean, and secure. High-quality data and proper management make
the system reliable.
4. Implementing Effective ETL Processes-
ETL (Extract, Transform, Load) involves collecting data from different sources, cleaning and
converting it into the correct format, and storing it in the warehouse. Well-designed ETL
processes reduce errors and duplicate data.
Pros:
• High Security: The fact that the data is within your organization's firewall enhances security
and privacy to a very high level.
• Customization: Every little thing in your server is customizable, whether hardware, software,
or specific configuration to solve a unique problem.
• Performance: The on-premises EDWs can be configured for particular job types and that can
give the advantage of performance when querying.
Cons:
• High Upfront Cost: Hardware, software, technical setup, and maintenance are all expensive,
making it an expensive undertaking.
• Scalability Challenges: As data gets bigger, scaling storage and processing capacity is not
an inexpensive and resourceful task.
• Limited Agility: Data needs may change from very big to small over time. Therefore, the
infrastructure may need to be rehabilitated.
6|Page
2. Cloud Data Warehouse
• The cloud data warehouse is a powerful system offering a customized storage facility that can
be accessed from any internet-connected device.
• These solutions are hosted by cloud service providers such as Amazon Web Services (AWS),
Microsoft Azure, or Google Cloud Platform (GCP), which offer organizations a scalable and
accessible way of storing their data.
Pros:
• Scalability and Elasticity: A cloud platform allows rapid increase or decrease in the amount
of data storage and processing capacity required when the situation arises.
• Reduced Cost: It eliminates the need for the business to invest in hardware and IT staff and
enables it to pay for the software as it is used.
• Accessibility and Manageability: Data can be accessed from anywhere and even on the
move and thus help the remote team members to complete their tasks faster and work in teams
more effectively.
Cons:
• Security Considerations: Some companies might have concerns about entrusting sensitive
data to external servers.
• Vendor Lock-In: Migration from one platform to another of a cloud provider involves a
highly complex and costly process.
• Network Dependence: Performance might be affected by connection speed and reliability.
Pros:
• Flexibility: Make your data storage plan depend on how secure it is and how much it will
cost. It is advisable to consider various types of data.
• Scalability: Utilize the cloud's wide availability and scale the data that is needed more
frequently, while keeping the private or sensitive data in the on-premise solutions.
• Phased Migration: Organizations that employ a hybrid deployment method can
perform cloud migration with a less disruptive step-by-step transition.
7|Page
Cons:
• Complexity: Managing a hybrid environment requires additional expertise and coordination
between on-premises and cloud infrastructure.
• Potential Vendor Management: The cloud can be scaled as per the requirements of the
business. On top of that, this was accompanied by a balancing act of different vendors for
both on-premises and cloud components, which can lead to more complications.
• Higher Costs: Outsourcing cloud solutions is less expensive than fully on-premise
alternatives, but the integration of both environments brings some additional expenditure.
DATA MART
• A data mart is a specialized subset of a data warehouse focused on a specific functional area
or department within an organization.
• It provides a simplified and targeted view of data, addressing specific reporting and
analytical needs.
• Data mart is such a storage component which is concerned on a specific department of an
organization.
• It is a subset of the data stored in the data warehouse.
• Data mart is focused only on particular function of an organization and it is maintained by
single authority only, e.g. finance, marketing.
• Data Marts are small in size and are flexible, typically holding relevant data for a specific
group of users, such as sales, marketing, or finance.
• They are organized around specific subjects, such as sales, customer data, or product
information, and are structured, transformed, and optimized for efficient querying and
analysis within the domain.
8|Page
Types of Data Marts
9|Page
3. Hybrid Data Mart –
• This type of Data Mart is created by extracting data from operational source or from data
warehouse.
• 1 Path reflects accessing data directly from external sources and 2 Path reflects dependent data
model of data mart.
STAR SCHEMA
• A star schema organizes data in a database to make analysis easier and faster.
• The fact table sits at the center and stores measurable data such as sales, revenue, or quantities.
• Surrounding the fact table are dimension tables, which provide descriptive details like product
information, customer details, or dates.
• The arrangement of a central fact table connected to multiple dimension tables forms a star-like
shape.
• When drawn, the fact table is in the center and dimension tables radiate around it, giving the
design a star-like appearance.
• A star schema is a common way to design data warehouses and data marts. It is optimized for fast
data retrieval and analysis rather than transaction processing.
Structure:
• Fact Table (Center):
o Contains quantitative, numeric data such as sales amount, revenue, quantity, or profit.
o Includes keys (foreign keys) that link to dimension tables.
o Stores measures (facts) that are analyzed.
10 | P a g e
• Dimension Tables (Surrounding):
o Contain descriptive attributes (text or categorical data) that give context to the facts.
o Examples: customer details, product names, time/date, location.
o Dimension tables are usually denormalized (data is stored in a simple, flat form) to
make queries faster.
Example:
In a retail store data warehouse:
• Fact table: Sales (fields: Sale_ID, Date_ID, Product_ID, Customer_ID, Sales_Amount).
• Dimension tables: Date, Product, Customer, Store.
11 | P a g e
Features of Star Schema:
1. Central Fact Table:
Stores quantitative data (facts/measures) such as sales, revenue, profit, etc.
2. Dimension Tables:
Surround the fact table and provide descriptive information like product details, customer data,
location, and time.
3. Star-Like Structure:
The arrangement of one central fact table connected to multiple dimension tables forms a star
shape.
4. Denormalized Dimension Tables:
Dimension tables are usually denormalized (data stored in a flat structure) to simplify queries
and speed up performance.
5. Optimized for OLAP (Online Analytical Processing):
Designed for analytical queries, reporting, and data analysis, not for day-to-day transaction
processing.
6. Simple and Easy to Understand:
The straightforward structure makes it easier for business users and analysts to work with.
7. Fast Query Performance:
Reduces the number of joins needed, which speeds up query execution.
8. Supports Aggregation and Summarization:
Ideal for generating summaries, trends, and aggregated data for decision-making.
12 | P a g e
SNOWFLAKE SCHEMA
• A Snowflake Schema is another way to organize data in a data warehouse.
• It is similar to a star schema but more complex because the dimension tables are broken down
into smaller tables.
• This process is called normalization (splitting data into multiple related tables to remove
repetition).
• The structure looks like a snowflake because the main fact table connects to dimension tables,
and those dimension tables connect to other sub-dimension tables.
Structure
• In a snowflake schema, there is still a central fact table just like in a star schema.
• The fact table stores numerical data such as sales, revenue, or quantities, along with keys that link
to dimension tables.
• The dimension tables are further broken down (normalized) into smaller related tables.
• These smaller tables store detailed information and are linked in a hierarchy.
• As a result, instead of a single layer of dimensions around the fact table (as in a star), there are
multiple layers of related tables, giving the structure a snowflake-like shape.
13 | P a g e
Example:
• Fact table: Sales (with keys for Date_ID, Product_ID, Customer_ID, Store_ID).
• Dimensions:
o Product table links to a Category table and a Supplier table.
o Location table links to Country → State → City tables.
15 | P a g e
Best for quick, simple queries and Best for detailed analysis with many
Use Cases
BI reporting hierarchies and large data
Harder to update (data repeated in Easier to update (changes happen in
Updates/Changes
many places) one place)
2. Set-Group Hierarchy: Derived from the grouping of values of attributes. For example, time
hierarchy:
o Year > Quarter > Month > Week > Day
16 | P a g e
3. Attribute-Oriented Hierarchy: Uses attribute values for constructing hierarchies. For
example, product hierarchy:
o Category > Subcategory > Brand > Product
Characteristics of OLAP
1. Fast: OLAP act as bridge between Data Warehouse and front-end. Hence helps in the better
accessibility of data yielding faster results.
2. Shared: OLAP operations drill-down or roll-up, it navigates between various dimensions in
multidimensional cube making it effective and efficient reporting system.
3. Data and Information: OLAP has calculation power for complex queries and data. It does
data visualization using graphs and charts.
4. Multidimensional Views: OLAP provides the ability to view data from multiple
perspectives. These views are often presented in the form of a data cube, which allows users
to easily drill down into data details.
5. Fast Query Performance: OLAP systems are designed to return query results quickly, even
when working with large volumes of data. This is achieved through optimized indexing and
pre-computation of data.
6. Aggregated Data: OLAP systems often work with aggregated data, which allows users to
analyze data at different levels of granularity. For example, you can analyze sales data by
region, by product category, or by time period.
7. Complex Calculations: OLAP supports complex calculations and data modeling. This
includes operations like trend analysis, forecasting, and statistical analysis.
8. Data Integration: OLAP tools integrate data from multiple sources, providing a unified view
of the organization's data. This integration ensures consistency and accuracy in reporting and
analysis.
9. User-Friendly Interfaces: OLAP tools often come with intuitive interfaces that allow users
to interact with the data easily, perform ad-hoc queries, and generate reports without needing
extensive technical knowledge.
17 | P a g e
10. Historical Data Analysis: OLAP systems are well-suited for analyzing historical data,
allowing businesses to identify trends, patterns, and anomalies over time.
18 | P a g e
This data can be represented in the form of three dimensions conceptually, which is shown in the
image below:
OLAP CUBE
• An OLAP Cube, also known as a multidimensional cube or data cube.
• A OLAP cube is a way of storing data so it can be looked at from many directions.
• It allows users to analyze data from different perspectives by organizing it into a cube structure.
• Data is arranged like a 3D box (cube) with different sides (called dimensions).
• This lets business’s to study data from different views at the same time.
• It is very useful when working with large amounts of data.
• Example: You can check “product sales” by region and by quarter at the same time using a cube.
19 | P a g e
1. Drill down:
• In drill-down operation, the less detailed data is converted into highly detailed data.
• It can be done by moving down in the concept hierarchy and adding a new dimension
• Example: the drill down operation on above cube, is performed by moving down in the
concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up:
• It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube.
• It can be done by climbing up in the concept hierarchy and reducing the dimensions
• Example: In the cube given, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).
20 | P a g e
3. Dice:
• It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
• Example: In the cube given, a sub-cube is selected by selecting following dimensions with
criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice:
• It selects a single dimension from the OLAP cube which results in a new sub-cube creation.
• Example: In the cube given, Slice is performed on the dimension Time = “Q1”.
5. Pivot:
• It is also known as rotation operation as it rotates the current view to get a new view of the
representation.
• Example: In the sub-cube obtained after the slice operation, performing pivot operation
gives a new view of it.
21 | P a g e
Example OLAP Cube and its Operations
Product Location Time Sales ($)
A New York Q1 50,000
A New York Q2 60,000
B Los Angeles Q1 70,000
B Los Angeles Q2 80,000
22 | P a g e