Module 2
Module 2
Module : 2
Data Warehousing
Data Warehouses gather information from multiple sources and save them under a schema that is living on
the identified site
Schema:
a representation of a plan or theory in the form of an outline or model.
• It is a collection of integrated databases designed to support a DSS.
According to Inmon’s (father of data warehousing) definition:
• It is a collection of integrated, subject-oriented databases designed to support the DSS function, where each
unit of data is non-volatile and relevant to some moment in time.
Subject oriented data bases typically provides information on a topic (such as a sales inventory or supply chain)
rather than company operations. Time-variant: Time variant keys (e.g., for the date, month, time) are typically
present.
• Operational data Store: An operational data store (ODS) is a type of database that's often used as an interim ( Not
final) logical area for a data warehouse. ODSes are designed to integrate data from multiple sources for lightweight
data processing activities such as operational reporting and real-time analysis.
Characteristic Operational Data Store Data Warehouse
How is it built? One application or subject area at a Typically multiple subject areas at a
time time
Area of support? Day-to-day business operations Decision support for managerial
activities
Currency of data? Up-to-the-minute, real time. Typically represent a static point in
time
Typical unit for analysis? Small, manageable, transaction Large, unpredictable, variable units.
level units.
Design focus? High-performance, limited flexibility High flexibility, high performance.
Data Warehouses uses diverse techniques and processes :
1. Data Cleanup:
Data Cleaning is the way of preparing statistics for analysis
with the help of getting rid of or enhancing incorrect,
incomplete, irrelevant, duplicate or irregularly formatted
information.
Data Warehouses uses diverse techniques and processes :
2. Data Integration :
Data integration is the process of integrating data from different assets right
into a unified view.
• The integration steps include refinement, ETL mapping, and conversion.
• Data integration permits analytics tools to create powerful and cheap
enterprise intelligence.
• In a data integration procedure, the client sends a request for information to
the master server.
• The master server prepares the vital records from internal and external assets.
• Extracts facts from sources and then integrates them into a single information
set.
• It is then returned again to the client for use.
Data Warehouses uses diverse techniques and processes :
3. Data Transformation:
• It is the art of converting information from one layout or shape
to another layout and is critical for data integration and
information management tasks.
• Data transformation has different capabilities: We may have to
alter the record types based on desires of our project, enrich or
aggregate the records by removing invalid or duplicate data.
Data Warehouses uses diverse techniques and processes :
3.Data Transformation:
Generally, technique consists of 2 stages:
•In the first step, we should:
•Perform an information search that identifies assets and data types.
•Determine the structure and information changes that occur.
•Mapping data to discover how character fields are mapped, edited, inserted, filtered, and stored.
•In the second step, we must:
•Extract data from the original source.
•The size of the supply can range from a connected tool to a database or streaming resources,
including telemetry or logging files from clients who use our web application .
•Telemetry is the automatic recording and transmission of data from remote or
inaccessible sources to an IT system in a different location for monitoring and analysis.
•Send data to target site that may be a database or data warehouse to manages structured /
unstructured records
Data Warehouses uses diverse techniques and processes :
4.Loading Data
• Data loading is task of copying and loading data from a report, folder or application to
a database or similar utility.
• This is done via copying digital data from the source and pasting or loading records into
a data warehouse or processing tools.
• Such information is loaded in a different format than the original location of the source
Data Warehouses uses diverse techniques and processes :
5. Data Refreshing : In this process, data stored in the warehouse is
periodically refreshed so that they maintain its integrity.
• A data warehouse is a modeled as a “Data Cube” in which every dimension
represents an attribute or different set of attributes of data and each cell is
used to store the value.
• Data is gathered from various sources such as Hospitals, Banks,
Organizations and many more and goes through a process called
ETL(Extract, Transform, Load).
• Extract: This process reads data from various sources.
• Transform: It transforms data stored inside databases into data cubes so
that it can be loaded inside warehouse.
• Load: It is a process of writing transformed data into data warehouse.
Characteristics of Data Warehouse
• External Sources –
External source is a source from where data is collected irrespective of
the type of data.
• Data can be structured, semi structured /unstructured.
• Stage Area –
Since the data, extracted from the external sources does not follow a
particular format, so there is a need to validate this data to load into data
warehouse.
• For this purpose, it is recommended to use ETL tool.
• E(Extracted): Data is extracted from External data source.
• T(Transform): Data is transformed into standard format.
• L(Load): Data is loaded into data warehouse after transforming it into the
standard format
Data Warehouse Architecture
• Data-warehouse –
After cleansing of data, it is stored in the data warehouse as
central repository.
• It actually stores the meta data and the actual data gets stored
in the data marts.
• Data Marts
• Data mart is also a part of storage component.
• It stores the information of a particular function of an
organization, handled by single authority.
• There can be as many number of data marts in an organization
depending upon the functions.
• We can also say that data mart contains subset of the data stored
in data warehouse.
Data Warehouse Architecture
Data Mining – The practice of analyzing data present in data warehouse
and is finds hidden patterns present in the database / data warehouse
using data mining algorithms.
• In this approach, data warehouse acts as a central repository for
complete organization and data marts are created from it after data
warehouse has been created.
Advantages of Top-Down Approach
• Since the data marts are created from the data warehouse, this provides
consistent dimensional view of data marts.
• This model is a strongest model for business changes.
• So big organizations prefer to follow this approach.
• Creating data mart from data warehouse is easy.
Disadvantages of Top-Down Approach
• cost, time taken in designing and its maintenance is high.
Data Warehouse Architecture
Bottom up Approach :
Data Warehouse Architecture
Bottom up Approach :
• First, the data is extracted from external sources (same as
happens in top-down approach).
• Then, the data go through the staging area and loaded into
data marts instead of data warehouse.
• The data marts are created first and provide reporting
capability.
• It addresses a single business area.
• These data marts are then integrated into data warehouse.
• The data marts are created first and provide a thin view for
analyses and data warehouse is created after complete data
marts have been created.
Data Warehouse Architecture
Advantages of Bottom-Up Approach
• As the data marts are created first, the reports are quickly
generated.
• We can accommodate more number of data marts and in this
way data warehouse can be extended.
• Also, the cost and time taken in designing this model is low
comparatively.
Disadvantage of Bottom-Up Approach
• This model is not strong as top-down approach as dimensional
view of data marts is not consistent.
• We may not get the holistic view of the system.
ETL Process in Data Warehouse
• It stands for Extract, Transform and Load.
• It is a process in which an ETL tool extracts the data from
various data source systems, transforms it in the staging area,
and finally, loads it into the Data Warehouse system,
ETL Process in Data Warehouse
• ETL process can also use the pipelining concept i.e. as soon as
some data is extracted, it can be transformed and during that
period new data can be extracted.
• While the transformed data is being loaded into the data
warehouse, extracted data can be transformed.
• Diagram of pipelining of ETL process is shown below:
Most commonly used ETL tools are Sybase, Oracle Warehouse builder, CloverETL, and MarkLogic.
Business Process Management (BPM)
• Business process management (BPM), is a discipline that uses various tools
and methods to design, model, execute, monitor, and optimize business
processes.
• A business process coordinates the behavior of people, systems, information,
and things to produce business outcomes in support of a business strategy.
• BPM focuses on putting a consistent, automated process in place for routine
transactions and human interactions.
• It helps to reduce costs by decreasing waste and rework, increasing the
overall efficiency of the team.
• Organizations engaged in BPM can follow one of the BPM methodologies, say,
Six Sigma and Lean.
Types of Business Process Management Systems
• BPM systems can be categorized into 2 types based on the purpose
that they serve.
System-Centric BPM (or Integration-Centric BPM)
• These systems handles processes that primarily depend on existing
business systems (e.g., HRMS, CRM, ERP) without much human
involvement.
• A system-centric BPM software has extensive integrations and API
access to be able to create fast and efficient business processes.
• An example of an integration-centric process is online banking,
which can include different software systems coming together.
Types of Business Process Management Systems
Human-Centric BPM
• Human-centric BPM considers the people first, supported by
various automation functions.
• These are processes that are executed by humans, and
automation does not easily replace them.
• These often have a lot of approvals and tasks performed by
individuals.
• Examples of human-centric processes include providing
customer service, handling complaints, on-boarding employees,
conducting e-commerce activities, and filing expense reports.
• HR activities – advertisement, sorting, selection, …
BPM Lifecycle: The 5 Steps in Business Process Management
BPM Lifecycle: The 5 Steps in Business Process Management
Step 1: Design
– Business analysts review current business rules, interview the various stakeholders, discuss desired outcomes
with management.
– The goal of this stage is to gain an understanding of the business rules and ensure if the results are in
alignment with the organizational goals. (to bring students to campus)
Step 2: Model
– Modeling refers to identifying, defining, and making a representation of new processes to support the
current business rules for various stakeholders. (e.g IoE implementation)
Step 3: Execute
– Execute the business process by testing it live with a small group of users first and then open it up to all
users.
– (reopening campus with only PG students first)
– In the case of automated workflows, artificially throttle the process to minimize errors.
BPM Lifecycle: The 5 Steps in Business Process Management
Step 4: Monitor - Establish Key Performance Indicators (KPIs) and track
metrics against them using reports or dashboards.
– It’s essential to focus on the macro or micro indicators – an entire process vs.
process segments.
– (Challenges faced in handling hostel, classes, labs, …)
Step 5: Optimize - With an effective reporting system in place, an
organization can effectively steer operations toward optimization or
process improvement.
– (offering few courses in online mode)
– Business Process Optimization (BPO) is the redesign of processes to streamline
and improve process efficiency and strengthen the alignment of individual
business processes with a comprehensive strategy. (keep parents, students,
teachers happy without compromising quality)
Benefits of Implementing BPM
• BPM helps organizations move toward total digital
transformation and help them realize bigger organizational
goals. (using paperless tickets in INOX)
1. Improved Business Agility.
2. Reduced Costs and Higher Revenues.
3. Higher Efficiency.
4.Better Visibility.
5. Compliance, Safety, and Security
Business Process Automation (BPA) vs (BPM)
• Phase 1: Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communicate Results
• Phase 6: Operationalize
– The data analytic lifecycle is designed for Big Data problems and data
science projects
– The cycle is iterative to portray a real project
– Work can return to earlier phases as new information is uncovered
Data Analytics Lifecycle