0% found this document useful (0 votes)
46 views43 pages

Module 2

The document discusses data warehousing, including defining data warehouses and their characteristics. It also covers techniques used in data warehouses like data cleaning, integration, transformation, loading and refreshing. The document compares operational data stores and data warehouses and OLTP vs OLAP systems.

Uploaded by

Řöbîñ Ĺèé
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views43 pages

Module 2

The document discusses data warehousing, including defining data warehouses and their characteristics. It also covers techniques used in data warehouses like data cleaning, integration, transformation, loading and refreshing. The document compares operational data stores and data warehouses and OLTP vs OLAP systems.

Uploaded by

Řöbîñ Ĺèé
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Data Warehousing

Module : 2
Data Warehousing
Data Warehouses gather information from multiple sources and save them under a schema that is living on
the identified site
Schema:
a representation of a plan or theory in the form of an outline or model.
• It is a collection of integrated databases designed to support a DSS.
According to Inmon’s (father of data warehousing) definition:
• It is a collection of integrated, subject-oriented databases designed to support the DSS function, where each
unit of data is non-volatile and relevant to some moment in time.
Subject oriented data bases typically provides information on a topic (such as a sales inventory or supply chain)
rather than company operations. Time-variant: Time variant keys (e.g., for the date, month, time) are typically
present.
• Operational data Store: An operational data store (ODS) is a type of database that's often used as an interim ( Not
final) logical area for a data warehouse. ODSes are designed to integrate data from multiple sources for lightweight
data processing activities such as operational reporting and real-time analysis.
Characteristic Operational Data Store Data Warehouse

How is it built? One application or subject area at a Typically multiple subject areas at a
time time
Area of support? Day-to-day business operations Decision support for managerial
activities
Currency of data? Up-to-the-minute, real time. Typically represent a static point in
time
Typical unit for analysis? Small, manageable, transaction Large, unpredictable, variable units.
level units.
Design focus? High-performance, limited flexibility High flexibility, high performance.
Data Warehouses uses diverse techniques and processes :
1. Data Cleanup:
Data Cleaning is the way of preparing statistics for analysis
with the help of getting rid of or enhancing incorrect,
incomplete, irrelevant, duplicate or irregularly formatted
information.
Data Warehouses uses diverse techniques and processes :
2. Data Integration :
Data integration is the process of integrating data from different assets right
into a unified view.
• The integration steps include refinement, ETL mapping, and conversion.
• Data integration permits analytics tools to create powerful and cheap
enterprise intelligence.
• In a data integration procedure, the client sends a request for information to
the master server.
• The master server prepares the vital records from internal and external assets.
• Extracts facts from sources and then integrates them into a single information
set.
• It is then returned again to the client for use.
Data Warehouses uses diverse techniques and processes :
3. Data Transformation:
• It is the art of converting information from one layout or shape
to another layout and is critical for data integration and
information management tasks.
• Data transformation has different capabilities: We may have to
alter the record types based on desires of our project, enrich or
aggregate the records by removing invalid or duplicate data.
Data Warehouses uses diverse techniques and processes :
3.Data Transformation:
Generally, technique consists of 2 stages:
•In the first step, we should:
•Perform an information search that identifies assets and data types.
•Determine the structure and information changes that occur.
•Mapping data to discover how character fields are mapped, edited, inserted, filtered, and stored.
•In the second step, we must:
•Extract data from the original source.
•The size of the supply can range from a connected tool to a database or streaming resources,
including telemetry or logging files from clients who use our web application .
•Telemetry is the automatic recording and transmission of data from remote or
inaccessible sources to an IT system in a different location for monitoring and analysis.
•Send data to target site that may be a database or data warehouse to manages structured /
unstructured records
Data Warehouses uses diverse techniques and processes :
4.Loading Data
• Data loading is task of copying and loading data from a report, folder or application to
a database or similar utility.
• This is done via copying digital data from the source and pasting or loading records into
a data warehouse or processing tools.
• Such information is loaded in a different format than the original location of the source
Data Warehouses uses diverse techniques and processes :
5. Data Refreshing : In this process, data stored in the warehouse is
periodically refreshed so that they maintain its integrity.
• A data warehouse is a modeled as a “Data Cube” in which every dimension
represents an attribute or different set of attributes of data and each cell is
used to store the value.
• Data is gathered from various sources such as Hospitals, Banks,
Organizations and many more and goes through a process called
ETL(Extract, Transform, Load).
• Extract: This process reads data from various sources.
• Transform: It transforms data stored inside databases into data cubes so
that it can be loaded inside warehouse.
• Load: It is a process of writing transformed data into data warehouse.
Characteristics of Data Warehouse

• Subject oriented: Data are organized based on how the users


refer to them.
• Integrated: All inconsistencies regarding naming convention
and value representations are removed.
• Nonvolatile: A data warehouse is kept separate from the
operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
• Time variant: Data are not current but normally time series.
Characteristics of Data Warehouse

• Summarized Operational data are mapped into a decision-usable


format
Operational data is a form of strategic data that captures information
on the internal functions and processes of a business.
• Large volume. Time series data sets are normally quite large.
• Not normalized. DW data can be, and often are, redundant.
• Metadata. Data about data are stored.
• Data sources. Data come from internal and external unintegrated
operational systems.
OLTP vs. OLAP

• OLTP: On Line Transaction Processing


 Describes processing at operational sites
• OLAP: On Line Analytical Processing
 Describes processing at warehouse
OLTP vs. Data Warehouse

• OLTP systems are tuned for known transactions and workloads


while workload is not known a priori in a data warehouse
• Special data organization, access methods and implementation
methods are needed to support data warehouse queries
(typically multidimensional queries)
 e.g., average amount spent on phone calls between 9AM-
5PM in Pune during the month of December
OLTP vs Data Warehouse

OLTP Warehouse (DSS)


Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized and refined
Current up to date Snapshot data
Isolated Data Integrated Data
Repetitive access Ad-hoc access
Clerical User Knowledge User (Manager)
Transaction throughput is the Query throughput is the performance
performance metric metric
Thousands of users Hundreds of users
Managed in entirety Managed by subsets
Data Warehouse Architecture

A data-warehouse is a heterogeneous collection of different data sources organized under a unified


schema.
Data mining: the practice of analysing large databases in order to generate new information.
• There are 2 approaches for constructing data-warehouse:
• Top-down approach, Bottom-up approach are explained as 
• 1. Top-down approach: 
Data Warehouse Architecture

• External Sources – 
External source is a source from where data is collected irrespective of
the type of data.
• Data can be structured, semi structured /unstructured.
• Stage Area – 
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into data
warehouse.
• For this purpose, it is recommended to use ETL tool. 
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into standard format. 
• L(Load): Data is loaded into data warehouse after transforming it into the
standard format
Data Warehouse Architecture
• Data-warehouse – 
After cleansing of data, it is stored in the data warehouse as
central repository.
• It actually stores the meta data and the actual data gets stored
in the data marts. 
• Data Marts
• Data mart is also a part of storage component.
• It stores the information of a particular function of an
organization, handled by single authority.
• There can be as many number of data marts in an organization
depending upon the functions.
• We can also say that data mart contains subset of the data stored
in data warehouse. 
Data Warehouse Architecture
Data Mining – The practice of analyzing data present in data warehouse
and is finds hidden patterns present in the database / data warehouse
using data mining algorithms.
• In this approach, data warehouse acts as a central repository for
complete organization and data marts are created from it after data
warehouse has been created. 
Advantages of Top-Down Approach   
• Since the data marts are created from the data warehouse, this provides
consistent dimensional view of data marts. 
• This model is a strongest model for business changes.
• So big organizations prefer to follow this approach. 
• Creating data mart from data warehouse is easy. 
Disadvantages of Top-Down Approach  
• cost, time taken in designing and its maintenance is high.
Data Warehouse Architecture
Bottom up Approach :
Data Warehouse Architecture
Bottom up Approach :
• First, the data is extracted from external sources (same as
happens in top-down approach). 
• Then, the data go through the staging area and loaded into
data marts instead of data warehouse.
• The data marts are created first and provide reporting
capability.
• It addresses a single business area. 
• These data marts are then integrated into data warehouse. 
• The data marts are created first and provide a thin view for
analyses and data warehouse is created after complete data
marts have been created. 
Data Warehouse Architecture
Advantages of Bottom-Up Approach   
• As the data marts are created first, the reports are quickly
generated.
• We can accommodate more number of data marts and in this
way data warehouse can be extended. 
• Also, the cost and time taken in designing this model is low
comparatively. 
Disadvantage of Bottom-Up Approach  
• This model is not strong as top-down approach as dimensional
view of data marts is not consistent.
• We may not get the holistic view of the system.
ETL Process in Data Warehouse

• It stands for Extract, Transform and Load.
• It is a process in which an ETL tool extracts the data from
various data source systems, transforms it in the staging area,
and finally, loads it into the Data Warehouse system,
ETL Process in Data Warehouse

Extraction: The first step of the ETL process is extraction.


• Here data from various sources is extracted which can be in
various formats like relational databases, No SQL, XML, and
flat files into the staging area.
• It is important to extract the data from various source systems
and store it into the staging area first and not directly into the
data warehouse because the extracted data is in various
formats and can be corrupted also.
• Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult.
• This is one of the most important steps of ETL process
ETL Process in Data Warehouse

Transformation: (the second step of ETL)


In this step, a set of rules or functions are applied on the extracted data to
convert it into a single standard format. It may involve following
processes/tasks: 
• Filtering – loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values with some default values, mapping
U.S.A, United States, and America into USA, etc.
• Joining – joining multiple attributes into one.
• (day – month – year into DOB, DOJ, …)
• Splitting – splitting a single attribute into many.
• (Reg. No : year of joining, UG/PG, campus, …)
• Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).
ETL Process in Data Warehouse

Loading:  (Final step of ETL)


• The transformed data is loaded into data warehouse.
• Sometimes the data is updated by loading into the data
warehouse very frequently and sometimes it is done after
longer but regular intervals.
• The rate and period of loading solely depends on the
requirements and varies from system to system.
ETL Process in Data Warehouse

• ETL process can also use the pipelining concept i.e. as soon as
some data is extracted, it can be transformed and during that
period new data can be extracted.
• While the transformed data is being loaded into the data
warehouse, extracted data can be transformed.
• Diagram of pipelining of ETL process is shown below:

Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL, and MarkLogic.
Business Process Management (BPM)
• Business process management (BPM), is a discipline that uses various tools
and methods to design, model, execute, monitor, and optimize business
processes.
• A business process coordinates the behavior of people, systems, information,
and things to produce business outcomes in support of a business strategy.
• BPM focuses on putting a consistent, automated process in place for routine
transactions and human interactions.
• It helps to reduce costs by decreasing waste and rework, increasing the
overall efficiency of the team.
• Organizations engaged in BPM can follow one of the BPM methodologies, say,
Six Sigma and Lean.
Types of Business Process Management Systems
• BPM systems can be categorized into 2 types based on the purpose
that they serve.
System-Centric BPM (or Integration-Centric BPM)
• These systems handles processes that primarily depend on existing
business systems (e.g., HRMS, CRM, ERP) without much human
involvement.
• A system-centric BPM software has extensive integrations and API
access to be able to create fast and efficient business processes.
• An example of an integration-centric process is online banking,
which can include different software systems coming together.
Types of Business Process Management Systems
Human-Centric BPM
• Human-centric BPM considers the people first, supported by
various automation functions.
• These are processes that are executed by humans, and
automation does not easily replace them.
• These often have a lot of approvals and tasks performed by
individuals.
• Examples of human-centric processes include providing
customer service, handling complaints, on-boarding employees,
conducting e-commerce activities, and filing expense reports.
• HR activities – advertisement, sorting, selection, …
BPM Lifecycle: The 5 Steps in Business Process Management
BPM Lifecycle: The 5 Steps in Business Process Management
Step 1: Design
– Business analysts review current business rules, interview the various stakeholders, discuss desired outcomes
with management.
– The goal of this stage is to gain an understanding of the business rules and ensure if the results are in
alignment with the organizational goals. (to bring students to campus)

Step 2: Model
– Modeling refers to identifying, defining, and making a representation of new processes to support the
current business rules for various stakeholders. (e.g IoE implementation)

Step 3: Execute
– Execute the business process by testing it live with a small group of users first and then open it up to all
users.
– (reopening campus with only PG students first)
– In the case of automated workflows, artificially throttle the process to minimize errors.
BPM Lifecycle: The 5 Steps in Business Process Management
Step 4: Monitor - Establish Key Performance Indicators (KPIs) and track
metrics against them using reports or dashboards.
– It’s essential to focus on the macro or micro indicators – an entire process vs.
process segments.
– (Challenges faced in handling hostel, classes, labs, …)
Step 5: Optimize - With an effective reporting system in place, an
organization can effectively steer operations toward optimization or
process improvement.
– (offering few courses in online mode)
– Business Process Optimization (BPO) is the redesign of processes to streamline
and improve process efficiency and strengthen the alignment of individual
business processes with a comprehensive strategy. (keep parents, students,
teachers happy without compromising quality)
Benefits of Implementing BPM
• BPM helps organizations move toward total digital
transformation and help them realize bigger organizational
goals. (using paperless tickets in INOX)
1. Improved Business Agility.
2. Reduced Costs and Higher Revenues.
3. Higher Efficiency.
4.Better Visibility.
5. Compliance, Safety, and Security
Business Process Automation (BPA) vs (BPM)

• BPA and BPM are related, and in some ways complementary,


but they’re not the same.
• BPA is about automating processes, while BPM is about
managing processes, which may or may not involve
automation.
• All BPA can be considered to be a form of BPM, but not all BPM
may include BPA.
• Business Process Automation (BPA) refers to any method that is
used to streamline business processes through automation.
• It can use wide variety of applications and tools that aim to
achieve gains in productivity, agility, efficiency, and compliance
in the day-to-day tasks of a business
Business Process Automation (BPA)

• Common examples of processes that benefit from BPA include:


• Employee onboarding and offboarding, Vacation requests, HR
requests, Expense filing, New equipment requests and IT support
requests
• Business processes that are suitable for automation are typically
those that are started by a specific, triggering event.
• For example, the filing of an expense report may trigger a pre-
defined series of steps that ends when the employee receiving
reimbursement in their bank account.
• BPM is a systematic approach to improve business processes.
• This generally leads to a happier, more productive workforce, which
tends to result in happier customers, higher revenues, and lower
costs.
What Is a BPM Methodology?

• A BPM methodology is the way our organization approaches


our processes.
• It defines not only the steps we follow when optimizing or
building a process but also who will be responsible for
achieving the expected gains.
• The Vital Background Questions of BPM Methodology
• Who does BPM?
• Are we bringing in outside consultants or only using internal
process owners?
• How will we train and educate these people?
• Who do these consultants report to?
What Is a BPM Methodology?

• How much authority does the BPM team have to change


things?
• Are all of their recommendations implemented automatically?
• If not, under what circumstances are they accepted or
rejected?
• What are the primary goals for BPM?
• Cost reduction, time saved, greater clarity, happier customers?
• Who decides the primary outcome for each process?
• How will BPM efforts be evaluated?
• What primary and secondary metrics will be used to judge
their success?
What Is a BPM Methodology?

• Which processes are addressed first?


• How do we find where the greatest impact might be?
• How often are business processes analyzed?
• Are they on a regular rotation, or only when certain metrics dip
below acceptable levels?
Basic BPM Methodology

1. See the Current Condition – How is it done now? Who is


responsible for it? What works? What doesn’t?
2. Identify Possible Changes – What limitations exist? What is the
best “could-be” process?
3. Design a New Process – Does it need to be digitized? What
systems need to be linked? Will a BPM platform help?
4. Test for Improvements – What key metrics will show what has
changed for the better or worse?
5. Monitor and Repeat – Is the process optimized, or can it be
even better?
DMEMO

This is one of popular and best known in BPM world.


• Design – How are things done to complete a process?
• Model – What improvements can we make?
• Execute – How do adjustments actually work in-flight?
• Monitor – Is the process actually getting better?
• Optimize – What more can be done?
DMAIC

We may recognize this methodology from Six Sigma


• Define – Define the problem.
• Measure – Collect data on the current process.
• Analyze – Interpret the data and make suggestions.
• Improve – Make changes to the process.
• Control – Reduce deviations and make further improvements.
Data Analytics Lifecycle

• Phase 1: Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communicate Results
• Phase 6: Operationalize
– The data analytic lifecycle is designed for Big Data problems and data
science projects
– The cycle is iterative to portray a real project
– Work can return to earlier phases as new information is uncovered
Data Analytics Lifecycle

You might also like