Oracle 10.1 DataWarehouse Guide
Oracle 10.1 DataWarehouse Guide
December 2003
Oracle Database Data Warehousing Guide, 10g Release 1 (10.1) Part No. B10736-01 Copyright 2001, 2003 Oracle Corporation. All rights reserved. Primary Author: Paul Lane Viv Schupmann, Ingrid Stuart (Change Data Capture)
Contributing Authors:
Contributors: Patrick Amor, Hermann Baer, Mark Bauer, Subhransu Basu, Srikanth Bellamkonda, Randy Bello, Tolga Bozkaya, Lucy Burgess, Rushan Chen, Benoit Dageville, John Haydu, Lilian Hobbs, Hakan Jakobsson, George Lumpkin, Alex Melidis, Valarie Moore, Cetin Ozbutun, Ananth Raghavan, Jack Raitto, Ray Roccaforte, Sankar Subramanian, Gregory Smith, Murali Thiyagarajan, Ashish Thusoo, Thomas Tong, Jean-Francois Verrier, Gary Vincent, Andreas Walter, Andy Witkowski, Min Xiao, Tsae-Feng Yu The Programs (which include both the software and documentation) contain proprietary information of Oracle Corporation; they are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright, patent and other intellectual and industrial property laws. Reverse engineering, disassembly or decompilation of the Programs, except to the extent required to obtain interoperability with other independently created software or as specied by law, is prohibited. The information contained in this document is subject to change without notice. If you nd any problems in the documentation, please report them to us in writing. Oracle Corporation does not warrant that this document is error-free. Except as may be expressly permitted in your license agreement for these Programs, no part of these Programs may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Oracle Corporation. If the Programs are delivered to the U.S. Government or anyone licensing or using the programs on behalf of the U.S. Government, the following notice is applicable: Restricted Rights Notice Programs delivered subject to the DOD FAR Supplement are "commercial computer software" and use, duplication, and disclosure of the Programs, including documentation, shall be subject to the licensing restrictions set forth in the applicable Oracle license agreement. Otherwise, Programs delivered subject to the Federal Acquisition Regulations are "restricted computer software" and use, duplication, and disclosure of the Programs shall be subject to the restrictions in FAR 52.227-19, Commercial Computer Software - Restricted Rights (June, 1987). Oracle Corporation, 500 Oracle Parkway, Redwood City, CA 94065. The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherently dangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup, redundancy, and other measures to ensure the safe use of such applications if the Programs are used for such purposes, and Oracle Corporation disclaims liability for any damages caused by such use of the Programs. Oracle is a registered trademark, and Express, Oracle8i, Oracle9i, Oracle Store, PL/SQL, Pro*C, and SQL*Plus are trademarks or registered trademarks of Oracle Corporation. Other names may be trademarks of their respective owners.
Contents
Send Us Your Comments .............................................................................................................. xxix Preface........................................................................................................................................................ xxxi
Audience ............................................................................................................................................ xxxii Organization...................................................................................................................................... xxxii Related Documentation .................................................................................................................. xxxiv Conventions....................................................................................................................................... xxxv Documentation Accessibility ....................................................................................................... xxxviii
iii
Data Warehouse Architecture (Basic) ........................................................................................ 1-5 Data Warehouse Architecture (with a Staging Area) .............................................................. 1-6 Data Warehouse Architecture (with a Staging Area and Data Marts) ................................. 1-6
Part II 2
Logical Design
Part III 3
Physical Design
iv
Partition Bounds for Range Partitioning ................................................................................. Comparing Partitioning Keys with Partition Bounds.................................................... MAXVALUE......................................................................................................................... Nulls ...................................................................................................................................... DATE Datatypes ................................................................................................................. Multicolumn Partitioning Keys ......................................................................................... Implicit Constraints Imposed by Partition Bounds ........................................................ Index Partitioning ....................................................................................................................... Local Partitioned Indexes ................................................................................................... Global Partitioned Indexes................................................................................................. Summary of Partitioned Index Types............................................................................... The Importance of Nonprefixed Indexes ......................................................................... Performance Implications of Prefixed and Nonprefixed Indexes ................................ Guidelines for Partitioning Indexes.................................................................................. Physical Attributes of Index Partitions.............................................................................
5-31 5-31 5-31 5-32 5-32 5-33 5-33 5-33 5-34 5-37 5-39 5-40 5-40 5-41 5-42
Indexes
Using Bitmap Indexes in Data Warehouses................................................................................... Benefits for Data Warehousing Applications .......................................................................... Cardinality ..................................................................................................................................... Bitmap Indexes and Nulls ........................................................................................................... Bitmap Indexes on Partitioned Tables ....................................................................................... Using Bitmap Join Indexes in Data Warehouses...................................................................... Four Join Models for Bitmap Join Indexes......................................................................... Bitmap Join Index Restrictions and Requirements ........................................................... Using B-Tree Indexes in Data Warehouses .................................................................................. Using Index Compression ............................................................................................................... Choosing Between Local Indexes and Global Indexes ............................................................. 6-2 6-2 6-3 6-5 6-6 6-6 6-6 6-9 6-10 6-10 6-11
Integrity Constraints
Why Integrity Constraints are Useful in a Data Warehouse ...................................................... Overview of Constraint States.......................................................................................................... Typical Data Warehouse Integrity Constraints ............................................................................. UNIQUE Constraints in a Data Warehouse ............................................................................. FOREIGN KEY Constraints in a Data Warehouse................................................................... 7-2 7-3 7-3 7-4 7-5
vi
RELY Constraints ......................................................................................................................... Integrity Constraints and Parallelism........................................................................................ Integrity Constraints and Partitioning ...................................................................................... View Constraints ..........................................................................................................................
vii
Materialized View Restrictions.......................................................................................... General Query Rewrite Restrictions ................................................................................. Refresh Options........................................................................................................................... General Restrictions on Fast Refresh ................................................................................ Restrictions on Fast Refresh on Materialized Views with Joins Only ......................... Restrictions on Fast Refresh on Materialized Views with Aggregates........................ Restrictions on Fast Refresh on Materialized Views with UNION ALL..................... Achieving Refresh Goals .................................................................................................... Refreshing Nested Materialized Views ............................................................................ ORDER BY Clause ...................................................................................................................... Materialized View Logs ............................................................................................................. Using the FORCE Option with Materialized View Logs............................................... Using Oracle Enterprise Manager ............................................................................................ Using Materialized Views with NLS Parameters .................................................................. Adding Comments to Materialized Views ............................................................................. Registering Existing Materialized Views..................................................................................... Choosing Indexes for Materialized Views................................................................................... Dropping Materialized Views........................................................................................................ Analyzing Materialized View Capabilities ................................................................................. Using the DBMS_MVIEW.EXPLAIN_MVIEW Procedure ................................................... DBMS_MVIEW.EXPLAIN_MVIEW Declarations.......................................................... Using MV_CAPABILITIES_TABLE.................................................................................. MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details ............................................... MV_CAPABILITIES_TABLE Column Details........................................................................
8-24 8-25 8-25 8-27 8-27 8-27 8-29 8-30 8-30 8-31 8-31 8-33 8-33 8-33 8-33 8-34 8-36 8-37 8-37 8-37 8-38 8-38 8-40 8-42
viii
Rolling Materialized Views......................................................................................................... Materialized Views in OLAP Environments................................................................................. OLAP Cubes .................................................................................................................................. Partitioning Materialized Views for OLAP ............................................................................ Compressing Materialized Views for OLAP.......................................................................... Materialized Views with Set Operators .................................................................................. Examples of Materialized Views Using UNION ALL ................................................... Materialized Views and Models .................................................................................................... Invalidating Materialized Views ................................................................................................... Security Issues with Materialized Views..................................................................................... Querying Materialized Views with Virtual Private Database ............................................. Using Query Rewrite with Virtual Private Database..................................................... Restrictions with Materialized Views and Virtual Private Database .......................... Altering Materialized Views ..........................................................................................................
9-9 9-9 9-9 9-10 9-11 9-11 9-11 9-13 9-14 9-14 9-15 9-16 9-16 9-17
10
Dimensions
What are Dimensions? ..................................................................................................................... Creating Dimensions ....................................................................................................................... Dropping and Creating Attributes with Columns ................................................................ Multiple Hierarchies .................................................................................................................. Using Normalized Dimension Tables ................................................................................... Viewing Dimensions...................................................................................................................... Using Oracle Enterprise Manager.......................................................................................... Using the DESCRIBE_DIMENSION Procedure................................................................... Using Dimensions with Constraints........................................................................................... Validating Dimensions .................................................................................................................. Altering Dimensions...................................................................................................................... Deleting Dimensions ..................................................................................................................... 10-2 10-4 10-8 10-9 10-10 10-11 10-11 10-11 10-12 10-12 10-14 10-14
Part IV 11
ix
Daily Operations in Data Warehouses .................................................................................... 11-3 Evolution of the Data Warehouse ............................................................................................ 11-4
12
13
14
15
xi
Complete Refresh...................................................................................................................... Fast Refresh................................................................................................................................ Partition Change Tracking (PCT) Refresh............................................................................. ON COMMIT Refresh .............................................................................................................. Manual Refresh Using the DBMS_MVIEW Package........................................................... Refresh Specific Materialized Views with REFRESH.......................................................... Refresh All Materialized Views with REFRESH_ALL_MVIEWS ..................................... Refresh Dependent Materialized Views with REFRESH_DEPENDENT......................... Using Job Queues for Refresh ................................................................................................. When Fast Refresh is Possible................................................................................................. Recommended Initialization Parameters for Parallelism ................................................... Monitoring a Refresh................................................................................................................ Checking the Status of a Materialized View......................................................................... Scheduling Refresh ................................................................................................................... Tips for Refreshing Materialized Views with Aggregates.................................................. Tips for Refreshing Materialized Views Without Aggregates ........................................... Tips for Refreshing Nested Materialized Views .................................................................. Tips for Fast Refresh with UNION ALL ............................................................................... Tips After Refreshing Materialized Views............................................................................ Using Materialized Views with Partitioned Tables ................................................................. Fast Refresh with Partition Change Tracking....................................................................... PCT Fast Refresh Scenario 1............................................................................................. PCT Fast Refresh Scenario 2............................................................................................. PCT Fast Refresh Scenario 3............................................................................................. Fast Refresh with CONSIDER FRESH...................................................................................
15-15 15-15 15-15 15-16 15-16 15-17 15-18 15-19 15-20 15-21 15-21 15-21 15-22 15-22 15-23 15-26 15-27 15-28 15-28 15-29 15-29 15-29 15-31 15-32 15-33
16
xii
Asynchronous ........................................................................................................................... HotLog ................................................................................................................................ AutoLog .............................................................................................................................. Change Sets...................................................................................................................................... Valid Combinations of Change Sources and Change Sets ................................................. Change Tables.................................................................................................................................. Getting Information About the Change Data Capture Environment .................................. Preparing to Publish Change Data.............................................................................................. Creating a User to Serve As a Publisher ............................................................................... Granting Privileges and Roles to the Publisher ............................................................ Creating a Default Tablespace for the Publisher .......................................................... Password Files and Setting the REMOTE_LOGIN_PASSWORDFILE Parameter . Determining the Mode in Which to Capture Data .............................................................. Setting Initialization Parameters for Change Data Capture Publishing........................... Initialization Parameters for Synchronous Publishing ................................................ Initialization Parameters for Asynchronous HotLog Publishing............................... Initialization Parameters for Asynchronous AutoLog Publishing ............................ Determining the Current Setting of an Initialization Parameter................................ Retaining Initialization Parameter Values When a Database Is Restarted ............... Adjusting Initialization Parameter Values When Oracle Streams Values Change . Publishing Change Data ............................................................................................................... Performing Synchronous Publishing..................................................................................... Performing Asynchronous HotLog Publishing ................................................................... Performing Asynchronous AutoLog Publishing ................................................................. Subscribing to Change Data......................................................................................................... Considerations for Asynchronous Change Data Capture ...................................................... Asynchronous Change Data Capture and Redo Log Files................................................. Asynchronous Change Data Capture and Supplemental Logging................................... Datatypes and Table Structures Supported for Asynchronous Change Data Capture . Managing Published Data ............................................................................................................ Managing Asynchronous Change Sets.................................................................................. Creating Asynchronous Change Sets with Starting and Ending Dates .................... Enabling and Disabling Asynchronous Change Sets................................................... Stopping Capture on DDL for Asynchronous Change Sets........................................ Recovering from Errors Returned on Asynchronous Change Sets............................
16-11 16-12 16-13 16-14 16-15 16-16 16-16 16-18 16-18 16-19 16-19 16-19 16-20 16-21 16-21 16-21 16-22 16-25 16-25 16-25 16-27 16-27 16-30 16-35 16-42 16-47 16-48 16-50 16-51 16-52 16-52 16-52 16-53 16-54 16-55
xiii
Managing Change Tables ........................................................................................................ Creating Change Tables.................................................................................................... Understanding Change Table Control Columns .......................................................... Understanding TARGET_COLMAP$ and SOURCE_COLMAP$ Values ................ Controlling Subscriber Access to Change Tables ......................................................... Purging Change Tables of Unneeded Data ................................................................... Dropping Change Tables.................................................................................................. Considerations for Exporting and Importing Change Data Capture Objects ................. Impact on Subscriptions When the Publisher Makes Changes ......................................... Implementation and System Conguration .............................................................................. Synchronous Change Data Capture Restriction on Direct-Path INSERT.........................
16-58 16-59 16-60 16-62 16-64 16-65 16-67 16-67 16-70 16-71 16-72
17
SQLAccess Advisor
Overview of the SQLAccess Advisor in the DBMS_ADVISOR Package ............................. 17-2 Overview of Using the SQLAccess Advisor ........................................................................... 17-4 SQLAccess Advisor Repository......................................................................................... 17-7 Using the SQLAccess Advisor........................................................................................................ 17-7 SQLAccess Advisor Flowchart ................................................................................................. 17-8 SQLAccess Advisor Privileges.................................................................................................. 17-9 Creating Tasks........................................................................................................................... 17-10 SQLAccess Advisor Templates............................................................................................... 17-10 Creating Templates................................................................................................................... 17-11 Workload Objects...................................................................................................................... 17-12 Managing Workloads............................................................................................................... 17-12 Linking a Task and a Workload.............................................................................................. 17-13 Defining the Contents of a Workload .................................................................................... 17-14 SQL Tuning Set .................................................................................................................. 17-14 Loading a User-Defined Workload................................................................................. 17-15 Loading a SQL Cache Workload ..................................................................................... 17-16 Using a Hypothetical Workload...................................................................................... 17-17 Using a Summary Advisor 9i Workload........................................................................ 17-18 SQLAccess Advisor Workload Parameters ................................................................... 17-19 SQL Workload Journal............................................................................................................. 17-20 Adding SQL Statements to a Workload ................................................................................ 17-20 Deleting SQL Statements from a Workload.......................................................................... 17-21
xiv
Changing SQL Statements in a Workload ............................................................................ Maintaining Workloads........................................................................................................... Setting Workload Attributes............................................................................................ Resetting Workloads ......................................................................................................... Removing a Link Between a Workload and a Task ..................................................... Removing Workloads .............................................................................................................. Recommendation Options....................................................................................................... Generating Recommendations ............................................................................................... EXECUTE_TASK Procedure............................................................................................ Viewing the Recommendations.............................................................................................. Access Advisor Journal............................................................................................................ Stopping the Recommendation Process................................................................................ Canceling Tasks ................................................................................................................. Marking Recommendations.................................................................................................... Modifying Recommendations ................................................................................................ Generating SQL Scripts............................................................................................................ When Recommendations are No Longer Required............................................................. Performing a Quick Tune ........................................................................................................ Managing Tasks ........................................................................................................................ Updating Task Attributes................................................................................................. Deleting Tasks.................................................................................................................... Setting DAYS_TO_EXPIRE .............................................................................................. Using SQLAccess Advisor Constants.................................................................................... Examples of Using the SQLAccess Advisor ......................................................................... Recommendations From a User-Defined Workload.................................................... Generate Recommendations Using a Task Template .................................................. Filter a Workload from the SQL Cache .......................................................................... Evaluate Current Usage of Indexes and Materialized Views ..................................... Tuning Materialized Views for Fast Refresh and Query Rewrite......................................... DBMS_ADVISOR.TUNE_MVIEW Procedure ..................................................................... TUNE_MVIEW Syntax and Operations......................................................................... Accessing TUNE_MVIEW Output Results.................................................................... USER_TUNE_MVIEW and DBA_TUNE_MVIEW Views........................................... Script Generation DBMS_ADVISOR Function and Procedure .................................. Fast Refreshable with Optimized Sub-Materialized View ..........................................
17-22 17-22 17-23 17-23 17-23 17-24 17-24 17-25 17-26 17-26 17-32 17-32 17-32 17-33 17-33 17-34 17-36 17-36 17-37 17-37 17-38 17-38 17-38 17-39 17-39 17-42 17-44 17-46 17-47 17-48 17-48 17-50 17-50 17-50 17-56
xv
Query Rewrite
Overview of Query Rewrite............................................................................................................ 18-2 Cost-Based Rewrite..................................................................................................................... 18-3 When Does Oracle Rewrite a Query? ...................................................................................... 18-4 Enabling Query Rewrite.................................................................................................................. 18-5 Initialization Parameters for Query Rewrite .......................................................................... 18-5 Controlling Query Rewrite........................................................................................................ 18-6 Accuracy of Query Rewrite ....................................................................................................... 18-7 Query Rewrite Hints ........................................................................................................... 18-8 Privileges for Enabling Query Rewrite.................................................................................... 18-9 Sample Schema and Materialized Views ................................................................................ 18-9 How Oracle Rewrites Queries ...................................................................................................... 18-11 Text Match Rewrite Methods.................................................................................................. 18-11 Text Match Capabilities .................................................................................................... 18-13 General Query Rewrite Methods............................................................................................ 18-13 When are Constraints and Dimensions Needed? ......................................................... 18-13 Join Back.............................................................................................................................. 18-14 Rollup Using a Dimension ............................................................................................... 18-16 Compute Aggregates ........................................................................................................ 18-17 Filtering the Data ............................................................................................................... 18-18 Dropping Selections in the Rewritten Query ................................................................ 18-24 Handling of HAVING Clause in Query Rewrite.......................................................... 18-25 Handling Expressions in Query Rewrite ....................................................................... 18-25 Handling IN-Lists in Query Rewrite .............................................................................. 18-26 Checks Made by Query Rewrite............................................................................................. 18-28 Join Compatibility Check ................................................................................................. 18-28 Data Sufficiency Check ..................................................................................................... 18-33 Grouping Compatibility Check ....................................................................................... 18-34 Aggregate Computability Check..................................................................................... 18-34 Other Cases for Query Rewrite............................................................................................... 18-34 Query Rewrite Using Partially Stale Materialized Views ........................................... 18-35
xvi
Query Rewrite Using Nested Materialized Views ....................................................... Query Rewrite When Using GROUP BY Extensions ................................................... Hint for Queries with Extended GROUP BY ................................................................ Query Rewrite with Inline Views ................................................................................... Query Rewrite with Selfjoins........................................................................................... Query Rewrite and View Constraints ............................................................................ Query Rewrite and Expression Matching...................................................................... Date Folding Rewrite ........................................................................................................ Partition Change Tracking (PCT) Rewrite ............................................................................ PCT Rewrite Based on LIST Partitioned Tables............................................................ PCT and PMARKER ......................................................................................................... PCT Rewrite with Materialized Views Based on Range-List Partitioned Tables .... PCT Rewrite Using Rowid as Pmarker .......................................................................... Query Rewrite and Bind Variables ........................................................................................ Query Rewrite Using Set Operator Materialized Views..................................................... UNION ALL Marker......................................................................................................... Did Query Rewrite Occur?............................................................................................................ Explain Plan............................................................................................................................... DBMS_MVIEW.EXPLAIN_REWRITE Procedure ............................................................... DBMS_MVIEW.EXPLAIN_REWRITE Syntax .............................................................. Using REWRITE_TABLE ................................................................................................. Using a Varray ................................................................................................................... EXPLAIN_REWRITE Benefit Statistics .......................................................................... Support for Query Text Larger than 32KB in EXPLAIN_REWRITE ......................... Design Considerations for Improving Query Rewrite Capabilities .................................... Query Rewrite Considerations: Constraints......................................................................... Query Rewrite Considerations: Dimensions ........................................................................ Query Rewrite Considerations: Outer Joins ......................................................................... Query Rewrite Considerations: Text Match ......................................................................... Query Rewrite Considerations: Aggregates ......................................................................... Query Rewrite Considerations: Grouping Conditions ....................................................... Query Rewrite Considerations: Expression Matching........................................................ Query Rewrite Considerations: Date Folding ...................................................................... Query Rewrite Considerations: Statistics.............................................................................. Advanced Rewrite Using Equivalences .....................................................................................
18-38 18-39 18-44 18-44 18-45 18-46 18-49 18-49 18-52 18-52 18-55 18-57 18-59 18-61 18-62 18-64 18-65 18-65 18-66 18-66 18-67 18-69 18-71 18-71 18-72 18-72 18-73 18-73 18-73 18-73 18-74 18-74 18-74 18-74 18-75
xvii
19
20
xviii
GROUP_ID Function................................................................................................................ GROUPING SETS Expression ..................................................................................................... GROUPING SETS Syntax........................................................................................................ Composite Columns....................................................................................................................... Concatenated Groupings............................................................................................................... Concatenated Groupings and Hierarchical Data Cubes..................................................... Considerations when Using Aggregation.................................................................................. Hierarchy Handling in ROLLUP and CUBE........................................................................ Column Capacity in ROLLUP and CUBE............................................................................. HAVING Clause Used with GROUP BY Extensions .......................................................... ORDER BY Clause Used with GROUP BY Extensions ....................................................... Using Other Aggregate Functions with ROLLUP and CUBE ........................................... Computation Using the WITH Clause ....................................................................................... Working with Hierarchical Cubes in SQL ................................................................................. Specifying Hierarchical Cubes in SQL .................................................................................. Querying Hierarchical Cubes in SQL .................................................................................... SQL for Creating Materialized Views to Store Hierarchical Cubes ........................... Examples of Hierarchical Cube Materialized Views....................................................
20-16 20-17 20-19 20-20 20-22 20-24 20-26 20-26 20-27 20-27 20-28 20-28 20-28 20-29 20-29 20-30 20-31 20-32
21
xix
Treatment of NULLs as Input to Window Functions.......................................................... Windowing Functions with Logical Offset ........................................................................... Centered Aggregate Function................................................................................................. Windowing Aggregate Functions in the Presence of Duplicates ...................................... Varying Window Size for Each Row ..................................................................................... Windowing Aggregate Functions with Physical Offsets.................................................... FIRST_VALUE and LAST_VALUE Functions ..................................................................... Reporting Aggregate Functions ................................................................................................... RATIO_TO_REPORT Function .............................................................................................. LAG/LEAD Functions .................................................................................................................... LAG/LEAD Syntax .................................................................................................................. FIRST/LAST Functions.................................................................................................................. FIRST/LAST Syntax ................................................................................................................. FIRST/LAST As Regular Aggregates .................................................................................... FIRST/LAST As Reporting Aggregates ................................................................................ Inverse Percentile Functions......................................................................................................... Normal Aggregate Syntax ....................................................................................................... Inverse Percentile Example Basis .................................................................................... As Reporting Aggregates ................................................................................................. Inverse Percentile Restrictions ................................................................................................ Hypothetical Rank and Distribution Functions ....................................................................... Hypothetical Rank and Distribution Syntax......................................................................... Linear Regression Functions......................................................................................................... REGR_COUNT Function......................................................................................................... REGR_AVGY and REGR_AVGX Functions ......................................................................... REGR_SLOPE and REGR_INTERCEPT Functions ............................................................. REGR_R2 Function ................................................................................................................... REGR_SXX, REGR_SYY, and REGR_SXY Functions .......................................................... Linear Regression Statistics Examples................................................................................... Sample Linear Regression Calculation .................................................................................. Frequent Itemsets............................................................................................................................ Other Statistical Functions............................................................................................................ Descriptive Statistics................................................................................................................. Hypothesis Testing - Parametric Tests .................................................................................. Crosstab Statistics .....................................................................................................................
21-16 21-16 21-18 21-19 21-20 21-21 21-21 21-22 21-24 21-25 21-25 21-26 21-26 21-26 21-27 21-28 21-28 21-28 21-30 21-31 21-32 21-32 21-33 21-34 21-34 21-34 21-35 21-35 21-35 21-35 21-36 21-37 21-37 21-37 21-38
xx
Hypothesis Testing - Non-Parametric Tests ......................................................................... Non-Parametric Correlation ................................................................................................... WIDTH_BUCKET Function ......................................................................................................... WIDTH_BUCKET Syntax........................................................................................................ User-Dened Aggregate Functions ............................................................................................. CASE Expressions........................................................................................................................... Creating Histograms With User-Defined Buckets............................................................... Data Densication for Reporting ................................................................................................ Partition Join Syntax................................................................................................................. Sample of Sparse Data ............................................................................................................. Filling Gaps in Data.................................................................................................................. Filling Gaps in Two Dimensions ............................................................................................ Filling Gaps in an Inventory Table ........................................................................................ Computing Data Values to Fill Gaps..................................................................................... Time Series Calculations on Densied Data............................................................................. Period-to-Period Comparison for One Time Level: Example............................................ Period-to-Period Comparison for Multiple Time Levels: Example .................................. Creating a Custom Member in a Dimension: Example ......................................................
21-38 21-39 21-39 21-39 21-42 21-43 21-44 21-45 21-45 21-46 21-47 21-48 21-50 21-52 21-53 21-55 21-56 21-62
22
xxi
Rules ........................................................................................................................................... Single Cell References ....................................................................................................... Multi-Cell References on the Right Side ........................................................................ Multi-Cell References on the Left Side ........................................................................... Use of the ANY Wildcard................................................................................................. Nested Cell References ..................................................................................................... Order of Evaluation of Rules .................................................................................................. Differences Between Update and Upsert .............................................................................. Treatment of NULLs and Missing Cells................................................................................ Use Defaults for Missing Cells and NULLs................................................................... Qualifying NULLs for a Dimension ............................................................................... Reference Models...................................................................................................................... Advanced Topics in SQL Modeling ............................................................................................ FOR Loops ................................................................................................................................. Iterative Models ........................................................................................................................ Rule Dependency in AUTOMATIC ORDER Models.......................................................... Ordered Rules ........................................................................................................................... Unique Dimensions Versus Unique Single References....................................................... Rules and Restrictions when Using SQL for Modeling ...................................................... Performance Considerations with SQL Modeling ................................................................... Parallel Execution ..................................................................................................................... Aggregate Computation .......................................................................................................... Using EXPLAIN PLAN to Understand Model Queries...................................................... Using ORDERED FAST: Example................................................................................... Using ORDERED: Example ............................................................................................. Using ACYCLIC FAST: Example .................................................................................... Using ACYCLIC: Example ............................................................................................... Using CYCLIC: Example .................................................................................................. Examples of SQL Modeling ..........................................................................................................
22-17 22-18 22-18 22-19 22-20 22-20 22-21 22-22 22-23 22-25 22-26 22-26 22-30 22-30 22-34 22-35 22-37 22-38 22-40 22-42 22-42 22-43 22-45 22-45 22-45 22-46 22-46 22-47 22-47
23
xxii
Manageability ...................................................................................................................... Backup and Recovery ......................................................................................................... Security ................................................................................................................................. Oracle Data Mining Overview....................................................................................................... Enabling Data Mining Applications ........................................................................................ Data Mining in the Database .................................................................................................... Data Preparation.................................................................................................................. Model Building .................................................................................................................... Model Evaluation ................................................................................................................ Model Apply (Scoring) ....................................................................................................... ODM Programmatic Interfaces ......................................................................................... ODM Java API ..................................................................................................................... ODM PL/SQL Packages..................................................................................................... ODM Sequence Similarity Search (BLAST) ............................................................................
23-3 23-3 23-4 23-4 23-5 23-5 23-6 23-6 23-7 23-7 23-7 23-7 23-8 23-8
24
xxiii
Parallel Queries on Object Types .................................................................................... Parallel DDL .............................................................................................................................. DDL Statements That Can Be Parallelized..................................................................... CREATE TABLE ... AS SELECT in Parallel ................................................................... Recoverability and Parallel DDL..................................................................................... Space Management for Parallel DDL ............................................................................. Storage Space When Using Dictionary-Managed Tablespaces .................................. Free Space and Parallel DDL ........................................................................................... Parallel DML.............................................................................................................................. Advantages of Parallel DML over Manual Parallelism ............................................... When to Use Parallel DML............................................................................................... Enabling Parallel DML...................................................................................................... Transaction Restrictions for Parallel DML..................................................................... Rollback Segments............................................................................................................. Recovery for Parallel DML............................................................................................... Space Considerations for Parallel DML ......................................................................... Lock and Enqueue Resources for Parallel DML ........................................................... Restrictions on Parallel DML ........................................................................................... Data Integrity Restrictions................................................................................................ Trigger Restrictions ........................................................................................................... Distributed Transaction Restrictions .............................................................................. Examples of Distributed Transaction Parallelization................................................... Parallel Execution of Functions .............................................................................................. Functions in Parallel Queries ........................................................................................... Functions in Parallel DML and DDL Statements.......................................................... Other Types of Parallelism ...................................................................................................... Initializing and Tuning Parameters for Parallel Execution .................................................... Using Default Parameter Settings .......................................................................................... Setting the Degree of Parallelism for Parallel Execution .................................................... How Oracle Determines the Degree of Parallelism for Operations .................................. Hints and Degree of Parallelism...................................................................................... Table and Index Definitions............................................................................................. Default Degree of Parallelism .......................................................................................... Adaptive Multiuser Algorithm ....................................................................................... Minimum Number of Parallel Execution Servers.........................................................
24-15 24-15 24-15 24-16 24-17 24-17 24-18 24-18 24-19 24-20 24-21 24-22 24-23 24-24 24-24 24-24 24-25 24-25 24-26 24-27 24-27 24-27 24-28 24-29 24-29 24-29 24-30 24-31 24-32 24-33 24-33 24-34 24-34 24-35 24-35
xxiv
Limiting the Number of Available Instances ................................................................ Balancing the Workload .......................................................................................................... Parallelization Rules for SQL Statements.............................................................................. Rules for Parallelizing Queries........................................................................................ Rules for UPDATE, MERGE, and DELETE................................................................... Rules for INSERT ... SELECT........................................................................................... Rules for DDL Statements ................................................................................................ Rules for [CREATE | REBUILD] INDEX or [MOVE | SPLIT] PARTITION ........... Rules for CREATE TABLE AS SELECT ......................................................................... Summary of Parallelization Rules................................................................................... Enabling Parallelism for Tables and Queries ....................................................................... Degree of Parallelism and Adaptive Multiuser: How They Interact ................................ How the Adaptive Multiuser Algorithm Works .......................................................... Forcing Parallel Execution for a Session ............................................................................... Controlling Performance with the Degree of Parallelism .................................................. Tuning General Parameters for Parallel Execution .................................................................. Parameters Establishing Resource Limits for Parallel Operations.................................... PARALLEL_MAX_SERVERS .......................................................................................... Increasing the Number of Concurrent Users ................................................................ Limiting the Number of Resources for a User .............................................................. PARALLEL_MIN_SERVERS ........................................................................................... SHARED_POOL_SIZE ..................................................................................................... Computing Additional Memory Requirements for Message Buffers ....................... Adjusting Memory After Processing Begins ................................................................. PARALLEL_MIN_PERCENT.......................................................................................... Parameters Affecting Resource Consumption ..................................................................... PGA_AGGREGATE_TARGET........................................................................................ PARALLEL_EXECUTION_MESSAGE_SIZE ............................................................... Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL Parameters Related to I/O ...................................................................................................... DB_CACHE_SIZE ............................................................................................................. DB_BLOCK_SIZE .............................................................................................................. DB_FILE_MULTIBLOCK_READ_COUNT................................................................... DISK_ASYNCH_IO and TAPE_ASYNCH_IO.............................................................. Monitoring and Diagnosing Parallel Execution Performance...............................................
24-35 24-36 24-37 24-37 24-38 24-40 24-41 24-41 24-42 24-43 24-45 24-45 24-46 24-46 24-47 24-47 24-47 24-48 24-49 24-49 24-49 24-50 24-51 24-53 24-55 24-55 24-56 24-56 24-56 24-59 24-60 24-60 24-60 24-60 24-61
xxv
Is There Regression? ................................................................................................................. Is There a Plan Change?........................................................................................................... Is There a Parallel Plan? ........................................................................................................... Is There a Serial Plan? .............................................................................................................. Is There Parallel Execution? .................................................................................................... Is the Workload Evenly Distributed?..................................................................................... Monitoring Parallel Execution Performance with Dynamic Performance Views........... V$PX_BUFFER_ADVICE ................................................................................................. V$PX_SESSION.................................................................................................................. V$PX_SESSTAT ................................................................................................................. V$PX_PROCESS ................................................................................................................ V$PX_PROCESS_SYSSTAT ............................................................................................. V$PQ_SESSTAT................................................................................................................. V$FILESTAT....................................................................................................................... V$PARAMETER ................................................................................................................ V$PQ_TQSTAT .................................................................................................................. V$SESSTAT and V$SYSSTAT.......................................................................................... Monitoring Session Statistics .................................................................................................. Monitoring System Statistics................................................................................................... Monitoring Operating System Statistics................................................................................ Afnity and Parallel Operations.................................................................................................. Affinity and Parallel Queries .................................................................................................. Affinity and Parallel DML....................................................................................................... Miscellaneous Parallel Execution Tuning Tips ......................................................................... Setting Buffer Cache Size for Parallel Operations................................................................ Overriding the Default Degree of Parallelism...................................................................... Rewriting SQL Statements....................................................................................................... Creating and Populating Tables in Parallel .......................................................................... Creating Temporary Tablespaces for Parallel Sort and Hash Join .................................... Size of Temporary Extents ............................................................................................... Executing Parallel SQL Statements ........................................................................................ Using EXPLAIN PLAN to Show Parallel Operations Plans............................................... Additional Considerations for Parallel DML ....................................................................... PDML and Direct-Path Restrictions................................................................................ Limitation on the Degree of Parallelism ........................................................................
24-62 24-63 24-63 24-63 24-64 24-64 24-65 24-65 24-65 24-65 24-65 24-66 24-66 24-66 24-66 24-67 24-68 24-68 24-70 24-71 24-71 24-72 24-72 24-73 24-74 24-74 24-74 24-75 24-76 24-76 24-77 24-77 24-78 24-78 24-79
xxvi
Using Local and Global Striping ..................................................................................... Increasing INITRANS....................................................................................................... Limitation on Available Number of Transaction Free Lists for Segments ............... Using Multiple Archivers................................................................................................. Database Writer Process (DBWn) Workload................................................................. [NO]LOGGING Clause .................................................................................................... Creating Indexes in Parallel .................................................................................................... Parallel DML Tips..................................................................................................................... Parallel DML Tip 1: INSERT............................................................................................ Parallel DML Tip 2: Direct-Path INSERT....................................................................... Parallel DML Tip 3: Parallelizing INSERT, MERGE, UPDATE, and DELETE ........ Incremental Data Loading in Parallel.................................................................................... Updating the Table in Parallel......................................................................................... Inserting the New Rows into the Table in Parallel....................................................... Merging in Parallel............................................................................................................ Using Hints with Query Optimization.................................................................................. FIRST_ROWS(n) Hint .............................................................................................................. Enabling Dynamic Sampling ..................................................................................................
24-79 24-79 24-79 24-80 24-80 24-80 24-81 24-83 24-83 24-83 24-84 24-85 24-86 24-87 24-87 24-87 24-88 24-88
Glossary Index
xxvii
xxviii
Oracle Corporation welcomes your comments and suggestions on the quality and usefulness of this publication. Your input is an important part of the information used for revision.
s s s s s
Did you nd any errors? Is the information clearly presented? Do you need more information? If so, where? Are the examples correct? Do you need more examples? What features did you like most about this manual?
If you nd any errors or have any other suggestions for improvement, please indicate the title and part number of the documentation and the chapter, section, and page number (if available). You can send comments to us in the following ways:
s s s
Electronic mail: [email protected] FAX: (650)506-7227. Attn: Server Technologies Documentation Manager Postal service: Oracle Corporation Server Technologies Documentation 500 Oracle Parkway, Mailstop 4op11 Redwood Shores, CA 94065 USA
If you would like a reply, please give your name, address, telephone number, and electronic mail address (optional). If you have problems with the software, please contact your local Oracle Support Services.
xxix
xxx
Preface
This manual provides information about Oracle's data warehousing capabilities. This preface contains these topics:
s
Note: The Oracle Data Warehousing Guide contains information that describes the features and functionality of the Oracle Database Standard Edition, Oracle Database Enterprise Edition, and Oracle Database Personal Edition products. These products have the same basic features. However, several advanced features are available only with the Oracle Database Enterprise Edition or Oracle Database Personal Edition, and some of these are optional. For example, to create partitioned tables and indexes, you must have the Oracle Database Enterprise Edition or Oracle Database Personal Edition.
xxxi
Audience
This guide is intended for database administrators, system administrators, and database application developers who design, maintain, and use data warehouses. To use this document, you need to be familiar with relational database concepts, basic Oracle server concepts, and the operating system environment under which you are running Oracle.
Organization
This document contains:
Part 1: Concepts
Chapter 1, "Data Warehousing Concepts" This chapter contains an overview of data warehousing concepts.
xxxii
Chapter 7, "Integrity Constraints" This chapter describes how to use integrity constraints in data warehouses. Chapter 8, "Basic Materialized Views" This chapter introduces basic materialized views concepts. Chapter 9, "Advanced Materialized Views" This chapter describes how to use materialized views in data warehouses. Chapter 10, "Dimensions" This chapter describes how to use dimensions in data warehouses.
xxxiii
Related Documentation
For more information, see these Oracle resources:
s
Many of the examples in this book use the sample schemas of the seed database, which is installed by default when you install Oracle. Refer to Oracle Database Sample Schemas for information on how these schemas were created and how you can use them yourself. Printed documentation is available for sale in the Oracle Store at
xxxiv
http://oraclestore.oracle.com/
To download free release notes, installation documentation, white papers, or other collateral, please visit the Oracle Technology Network (OTN). You must register online before using OTN; registration is free and can be done at
http://otn.oracle.com/membership/
If you already have a username and password for OTN, then you can go directly to the documentation section of the OTN Web site at
http://otn.oracle.com/documentation
The Data Warehouse Toolkit by Ralph Kimball (John Wiley and Sons, 1996) Building the Data Warehouse by William Inmon (John Wiley and Sons, 1996)
Conventions
This section describes the conventions used in the text and code examples of this documentation set. It describes:
s
Conventions in Text
We use various conventions in text to help you more quickly identify special terms. The following table describes those conventions and provides examples of their use.
Convention Bold Meaning Example
Bold typeface indicates terms that are When you specify this clause, you create an dened in the text or terms that appear in index-organized table. a glossary, or both. Italic typeface indicates book titles or emphasis. Oracle Database Concepts Ensure that the recovery catalog and target database do not reside on the same disk.
Italics
xxxv
Meaning Uppercase monospace typeface indicates elements supplied by the system. Such elements include parameters, privileges, datatypes, RMAN keywords, SQL keywords, SQL*Plus or utility commands, packages and methods, as well as system-supplied column names, database objects and structures, usernames, and roles. Lowercase monospace typeface indicates executables, lenames, directory names, and sample user-supplied elements. Such elements include computer and database names, net service names, and connect identiers, as well as user-supplied database objects and structures, column names, packages and classes, usernames and roles, program units, and parameter values.
Example You can specify this clause only for a NUMBER column. You can back up the database by using the BACKUP command. Query the TABLE_NAME column in the USER_ TABLES data dictionary view. Use the DBMS_STATS.GENERATE_STATS procedure. Enter sqlplus to open SQL*Plus. The password is specied in the orapwd le. Back up the datales and control les in the /disk1/oracle/dbs directory. The department_id, department_name, and location_id columns are in the hr.departments table. Set the QUERY_REWRITE_ENABLED initialization parameter to TRUE.
Note: Some programmatic elements use a mixture of UPPERCASE and lowercase. Connect as oe user. Enter these elements as shown. The JRepUtil class implements these methods. lowercase Lowercase italic monospace font italic represents placeholders or variables. monospace (fixed-width) font You can specify the parallel_clause. Run Uold_release.SQL where old_release refers to the release you installed prior to upgrading.
The following table describes typographic conventions used in code examples and provides examples of their use.
xxxvi
Convention [ ] { }
Meaning Brackets enclose one or more optional items. Do not enter the brackets. Braces enclose two or more items, one of which is required. Do not enter the braces.
A vertical bar represents a choice of two {ENABLE | DISABLE} or more options within brackets or braces. [COMPRESS | NOCOMPRESS] Enter one of the options. Do not enter the vertical bar. Horizontal ellipsis points indicate either:
s
...
That we have omitted parts of the code that are not directly related to the example That you can repeat a portion of the code
CREATE TABLE ... AS subquery; SELECT col1, col2, ... , coln FROM employees;
. . .
Vertical ellipsis points indicate that we have omitted several lines of code not directly related to the example.
SQL> SELECT NAME FROM V$DATAFILE; NAME -----------------------------------/fsl/dbs/tbs_01.dbf /fs1/dbs/tbs_02.dbf . . . /fsl/dbs/tbs_09.dbf 9 rows selected.
Other notation
You must enter symbols other than acctbal NUMBER(11,2); brackets, braces, vertical bars, and ellipsis acct CONSTANT NUMBER(4) := 3; points as shown. Italicized text indicates placeholders or variables for which you must supply particular values. Uppercase typeface indicates elements supplied by the system. We show these terms in uppercase in order to distinguish them from terms you dene. Unless terms appear in brackets, enter them in the order and with the spelling shown. However, because these terms are not case sensitive, you can enter them in lowercase. CONNECT SYSTEM/system_password DB_NAME = database_name SELECT last_name, employee_id FROM employees; SELECT * FROM USER_TABLES; DROP TABLE hr.employees;
Italics
UPPERCASE
xxxvii
Convention lowercase
Meaning Lowercase typeface indicates programmatic elements that you supply. For example, lowercase indicates names of tables, columns, or les. Note: Some programmatic elements use a mixture of UPPERCASE and lowercase. Enter these elements as shown.
Example SELECT last_name, employee_id FROM employees; sqlplus hr/hr CREATE USER mjones IDENTIFIED BY ty3MU9;
Documentation Accessibility
Our goal is to make Oracle products, services, and supporting documentation accessible, with good usability, to the disabled community. To that end, our documentation includes features that make information available to users of assistive technology. This documentation is available in HTML format, and contains markup to facilitate access by the disabled community. Standards will continue to evolve over time, and Oracle is actively engaged with other market-leading technology vendors to address technical obstacles so that our documentation can be accessible to all of our customers. For additional information, visit the Oracle Accessibility Program Web site at
http://www.oracle.com/accessibility/
JAWS, a Windows screen reader, may not always correctly read the code examples in this document. The conventions for writing code require that closing braces should appear on an otherwise empty line; however, JAWS may not always read a line of text that consists solely of a bracket or brace.
Accessibility of Code Examples in Documentation Accessibility of Links to External Web Sites in Documentation This documentation may contain links to Web sites of other companies or organizations that Oracle does not own or control. Oracle neither evaluates nor makes any representations regarding the accessibility of these Web sites.
xxxviii
xxxix
SQL Model Calculations The MODEL clause enables you to specify complex formulas while avoiding multiple joins and UNION clauses. This clause supports OLAP queries such as share of ancestor and prior period comparisons, as well as calculations typically done in large spreadsheets. The MODEL clause provides building blocks for budgeting, forecasting, and statistical applications.
See Also: Chapter 22, "SQL for Modeling"
SQLAccess Advisor The SQLAccess Advisor tool and its related DBMS_ADVISOR package offer improved capabilities for recommending indexing and materialized view strategies.
See Also: Chapter 17, "SQLAccess Advisor"
Materialized Views The TUNE_MVIEW procedure shows how to specify a materialized view so that it is fast refreshable and can use advanced types of query rewrite.
Materialized View Refresh Enhancements Materialized view refresh has new optimizations for data warehousing and OLAP environments. The enhancements include more efcient calculation and update techniques, support for nested refresh, along with improved cost analysis.
See Also:
Query Rewrite Enhancements Query rewrite performance and capabilities have been improved.
See Also: Chapter 18, "Query Rewrite"
Partitioning Enhancements
xl
You can now use partitioning with index-organized tables. Also, materialized views in OLAP are able to use partitioning. You can now use hash-partitioned global indexes.
See Also: Chapter 5, "Parallelism and Partitioning in Data Warehouses"
s
Change Data Capture Oracle now supports asynchronous change data capture as well as synchronous change data capture.
See Also: Chapter 16, "Change Data Capture"
ETL Enhancements Oracle's extraction, transformation, and loading capabilities have been improved with several MERGE improvements and better external table capabilities.
See Also: Chapter 11, "Overview of Extraction, Transformation, and Loading"
Storage Management Oracle Managed Files has simplied the administration of a database by providing functionality to automatically create and manage les, so the database administrator no longer needs to manage each database le. Automatic Storage Management provides additional functionality for managing not only les, but also the disks. In addition, you can now use ultralarge data les.
See Also: Chapter 4, "Hardware and I/O Considerations in Data Warehouses"
xli
xlii
Part I
Concepts
This section introduces basic data warehousing concepts. It contains the following chapter:
s
1
Data Warehousing Concepts
This chapter provides an overview of the Oracle data warehousing implementation. It includes:
s
Note that this book is meant as a supplement to standard texts about data warehousing. This book focuses on Oracle-specic material and does not reproduce in detail material of a general nature. Two standard texts are:
s
The Data Warehouse Toolkit by Ralph Kimball (John Wiley and Sons, 1996) Building the Data Warehouse by William Inmon (John Wiley and Sons, 1996)
and Loading" A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
s
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a data warehouse that concentrates on sales. Using this data warehouse, you can answer questions such as "Who was our best customer for this item last year?" This ability to dene a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the data warehouse, data should not change. This is logical because the purpose of a data warehouse is to enable you to analyze what has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
Many
Joins
Some
Normalized DBMS
Duplicated Data
Denormalized DBMS
Rare
Common
One major difference between the types of system is that data warehouses are not usually in third normal form (3NF), a type of data normalization common in OLTP environments.
Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
s
Workload Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predened operations. Your applications might be specically tuned or designed to support only these operations.
Data modications A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modication techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modication statements to the database. The OLTP database is always up to date, and reects the current state of each business transaction.
Schema design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
Typical operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
Historical data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts)
Analysis
Flat Files
Mining
In Figure 12, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something such as August sales. A summary in an Oracle database is called a materialized view.
Data Sources
Staging Area
Warehouse
Users
Analysis
Operational System
Flat Files
Mining
Figure 14
Data Sources
Staging Area
Warehouse
Data Marts
Users
Purchasing
Analysis
Operational System
Flat Files
Inventory
Mining
Part II
Logical Design
This section deals with the issues in logical design in a data warehouse. It contains the following chapter:
s
2
Logical Design in Data Warehouses
This chapter explains how to create a logical design for a data warehousing environment and includes the following topics:
s
Logical Versus Physical Design in Data Warehouses Creating a Logical Design Data Warehousing Schemas Data Warehousing Objects
The specic data content Relationships within and between groups of data The system environment supporting your data warehouse The data transformations required The frequency with which data is refreshed
The logical design is more conceptual and abstract than the physical design. In the logical design, you look at the logical relationships among the objects. In the physical design, you look at the most effective way of storing and retrieving the objects as well as handling them from a transportation and backup/recovery perspective. Orient your design toward the needs of the end users. End users typically want to perform analysis and look at aggregated data, rather than at individual transactions. However, end users might not know what they need until they see it. In addition, a well-planned design allows for growth and changes as the needs of users change and evolve. By beginning with the logical design, you focus on the information requirements and save the implementation details for later.
information. In relational databases, an entity often maps to a table. An attribute is a component of an entity that helps dene the uniqueness of the entity. In relational databases, an attribute maps to a column. To be sure that your data is consistent, you need to use unique identiers. A unique identier is something you add to tables so that you can differentiate between the same item when it appears in different places. In a physical design, this is usually a primary key. While entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications, the technique is still useful for data warehouse design in the form of dimensional modeling. In dimensional modeling, instead of seeking to discover atomic units of information (such as entities and attributes) and all of the relationships between them, you identify which information belongs to a central fact table and which information belongs to its associated dimension tables. You identify business subjects or elds of data, dene relationships between business subjects, and name the attributes for each subject.
See Also: Chapter 10, "Dimensions" for further information regarding dimensions
Your logical design should result in (1) a set of entities and attributes corresponding to fact tables and dimension tables and (2) a model of operational data from your source into subject-oriented information in your target data warehouse schema. You can create the logical design using a pen and paper, or you can use a design tool such as Oracle Warehouse Builder (specically designed to support modeling the ETL process) or Oracle Designer (a general purpose modeling tool).
See Also: Oracle Designer and Oracle Warehouse Builder documentation sets
warehouse model may require some changes to adapt it to your system parameterssize of machine, number of users, storage capacity, type of network, and software.
Star Schemas
The star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown in Figure 21.
Figure 21 Star Schema
products
times
sales (amount_sold, quantity_sold) Fact Table customers Dimension Table channels Dimension Table
The most natural way to model a data warehouse is as a star schema, where only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema optimizes performance by keeping queries simple and providing fast response time. All the information about each level is stored in one row.
Note: Oracle Corporation recommends that you choose a star
Other Schemas
Some schemas in data warehousing environments use third normal form rather than star schemas. Another schema that is sometimes useful is the snowake schema, which is a star schema with normalized dimensions in a tree structure.
See Also: Chapter 19, "Schema Modeling Techniques" for further information regarding star and snowake schemas in data warehouses and Oracle Database Concepts for further conceptual material
Fact Tables
A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.
Dimension Tables
A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to dene data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to dene a navigational drill path and to establish a family structure. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchiesone for product categories and one for product suppliers. Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benets of a data warehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. Levels A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specic, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies. Level Relationships Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specic information. They dene the parent-child relationship between the levels in a hierarchy.
Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known.
region
subregion
country_name
customer
See Also: Chapter 10, "Dimensions" and Chapter 18, "Query Rewrite" for further information regarding hierarchies
Unique Identiers
Unique identiers are specied for one distinct record in a dimension table. Articial unique identiers are often used to avoid the potential problem of unique identiers changing. Unique identiers are represented with the # character. For example, #customer_id.
Relationships
Relationships guarantee business integrity. An example is that if a business sells something, there is obviously a customer and a product. Designing a relationship between the sales information in the fact table and the dimension tables products and customers enforces the business rules in databases.
Hierarchy
Part III
Physical Design
This section deals with the physical design of a data warehouse. It contains the following chapters:
s
Chapter 3, "Physical Design in Data Warehouses" Chapter 4, "Hardware and I/O Considerations in Data Warehouses" Chapter 5, "Parallelism and Partitioning in Data Warehouses" Chapter 6, "Indexes" Chapter 7, "Integrity Constraints" Chapter 8, "Basic Materialized Views" Chapter 9, "Advanced Materialized Views" Chapter 10, "Dimensions"
3
Physical Design in Data Warehouses
This chapter describes the physical design of a data warehousing environment, and includes the following topics:
s
Chapter 5, "Parallelism and Partitioning in Data Warehouses" for further information regarding partitioning Oracle Database Concepts for further conceptual material regarding all design matters
Physical Design
During the logical design phase, you dened a model for your data warehouse consisting of entities, attributes, and relationships. The entities are linked together using relationships. Attributes are used to describe the entities. The unique identier (UID) distinguishes between one instance of an entity and another. Figure 31 illustrates a graphical way of distinguishing between logical and physical designs.
Physical Design
Figure 31
Logical
Entities
Tables
Indexes
Relationships
Materialized Views
Dimensions
During the physical design process, you translate the expected schemas into actual database structures. At this time, you have to map:
s
Entities to tables Relationships to foreign key constraints Attributes to columns Primary unique identiers to primary key constraints Unique identiers to unique key constraints
Physical Design
Some of these structures require disk space. Others exist only in the data dictionary. Additionally, the following structures may be created for performance improvement:
s
Tablespaces
A tablespace consists of one or more datales, which are physical structures within the operating system you are using. A datale is associated with only one tablespace. From a design perspective, tablespaces are containers for physical design structures. Tablespaces need to be separated by differences. For example, tables should be separated from their indexes and small tables should be separated from large tables. Tablespaces should also represent logical business units if possible. Because a tablespace is the coarsest granularity for backup and recovery or the transportable tablespaces mechanism, the logical business design affects availability and maintenance operations. You can now use ultralarge data les, a signicant improvement in very large databases.
See Also: Chapter 4, "Hardware and I/O Considerations in Data
Physical Design
Partitioning large tables improves performance because each partitioned piece is more manageable. Typically, you partition based on transaction dates in a data warehouse. For example, each month, one month's worth of data can be assigned its own partition.
Table Compression
You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables. To reduce disk use and memory use (specically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a cost in CPU overhead. Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
See Also: Chapter 5, "Parallelism and Partitioning in Data Warehouses" and Chapter 15, "Maintaining the Data Warehouse"
Views
A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the output of a query and treats it as a table. Views do not require any space in the database.
See Also: Oracle Database Concepts
Integrity Constraints
Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables. Integrity constraints in data warehousing differ from constraints in OLTP environments. In OLTP environments, they primarily prevent the insertion of invalid data into a record, which is not a big problem in data warehousing environments because accuracy has already been guaranteed. In data warehousing environments, constraints are only used for query rewrite. NOT NULL constraints are particularly common in data warehouses. Under some specic circumstances, constraints need
Physical Design
space in the database. These constraints are in the form of the underlying unique index.
See Also: Chapter 7, "Integrity Constraints"
Data Warehouse"
Materialized Views
Materialized views are query results that have been stored in advance so long-running calculations are not necessary when you actually execute your SQL statements. From a physical design point of view, materialized views resemble tables or partitioned tables and behave like indexes in that they are used transparently and improve performance.
See Also: Chapter 10, "Dimensions" Chapter 8, "Basic
Materialized Views"
Dimensions
A dimension is a schema object that denes hierarchical relationships between columns or column sets. A hierarchical relationship is a functional dependency from one level of a hierarchy to the next one. A dimension is a container of logical relationships and does not require any space in the database. A typical dimension is city, state (or province), region, and country.
See Also: Chapter 10, "Dimensions"
4
Hardware and I/O Considerations in Data Warehouses
This chapter explains some of the hardware and I/O issues in a data warehousing environment and includes the following topics:
s
Congure I/O for Bandwidth not Capacity Stripe Far and Wide Use Redundancy Test the I/O System Before Building the Database Plan for Growth
The I/O conguration used by a data warehouse will depend on the characteristics of the specic storage and server capabilities, so the material in this chapter is only intended to provide guidelines for designing and tuning an I/O system.
See Also: Oracle Database Performance Tuning Guide for additional
As an example, consider a 200GB data mart. Using 72GB drives, this data mart could be built with as few as six drives in a fully-mirrored environment. However, six drives might not provide enough I/O bandwidth to handle a medium number of concurrent users on a 4-CPU server. Thus, even though six drives provide sufcient storage, a larger number of drives may be required to provide acceptable performance for this system. While it may not be practical to estimate the I/O bandwidth that will be required by a data warehouse before a system is built, it is generally practical with the guidance of the hardware manufacturer to estimate how much I/O bandwidth a given server can potentially utilize, and ensure that the selected I/O conguration will be able to successfully feed the server. There are many variables in sizing the I/O systems, but one basic rule of thumb is that your data warehouse system should have multiple disks for each CPU (at least two disks for each CPU at a bare minimum) in order to achieve optimal performance.
Use Redundancy
Because data warehouses are often the largest database systems in a company, they have the most disks and thus are also the most susceptible to the failure of a single disk. Therefore, disk redundancy is a requirement for data warehouses to protect against a hardware failure. Like disk-striping, redundancy can be achieved in many ways using software or hardware. A key consideration is that occasionally a balance must be made between redundancy and performance. For example, a storage system in a RAID-5 conguration may be less expensive than a RAID-0+1 conguration, but it may not perform as well, either. Redundancy is necessary for any data warehouse, but the approach to redundancy may vary depending upon the performance and cost constraints of each data warehouse.
Storage Management
Storage Management
Two features to consider for managing disks are Oracle Managed Files and Automatic Storage Management. Without these features, a database administrator must manage the database les, which, in a data warehouse, can be hundreds or even thousands of les. Oracle Managed Files simplies the administration of a database by providing functionality to automatically create and manage les, so the database administrator no longer needs to manage each database le. Automatic Storage Management provides additional functionality for managing not only les but also the disks. With Automatic Storage Management, the database administrator would administer a small number of disk groups. Automatic Storage Management handles the tasks of striping and providing disk redundancy, including rebalancing the database les when new disks are added to the system.
See Also:
5
Parallelism and Partitioning in Data Warehouses
Data warehouses often contain large tables and require techniques both for managing these large tables and for providing good query performance across these large tables. This chapter discusses two key methodologies for addressing these needs: parallelism and partitioning. These topics are discussed:
s
Large table scans and joins Creation of large indexes Partitioned index scans Bulk inserts, updates, and deletes Aggregations and copying
You can also use parallel execution to access object types within an Oracle database. For example, use parallel execution to access LOBs (large objects). Parallel execution benets systems that have all of the following characteristics:
s
Symmetric multi-processors (SMP), clusters, or massively parallel systems Sufcient I/O bandwidth Underutilized or intermittently used CPUs (for example, systems where CPU usage is typically less than 30%)
Granules of Parallelism
Sufcient memory to support additional memory-intensive processes such as sorts, hashing, and I/O buffers
If your system lacks any of these characteristics, parallel execution might not signicantly improve performance. In fact, parallel execution can reduce system performance on overutilized systems or systems with small I/O bandwidth.
See Also: Chapter 24, "Using Parallel Execution" for further information regarding parallel execution requirements
Granules of Parallelism
Different parallel operations use different types of parallelism. The optimal physical database layout depends on the parallel operations that are most prevalent in your application or even of the necessity of using partitions. The basic unit of work in parallelism is a called a granule. Oracle Database divides the operation being parallelized (for example, a table scan, table update, or index creation) into granules. Parallel execution processes execute the operation one granule at a time. The number of granules and their size correlates with the degree of parallelism (DOP). It also affects how well the work is balanced across query server processes. There is no way you can enforce a specic granule strategy as Oracle Database makes this decision internally.
deleting portions of data) might inuence partition layout more than performance considerations.
Partition Granules
When partition granules are used, a query server process works on an entire partition or subpartition of a table or index. Because partition granules are statically determined by the structure of the table or index when a table or index is created, partition granules do not give you the exibility in parallelizing an operation that block granules do. The maximum allowable DOP is the number of partitions. This might limit the utilization of the system and the load balancing across parallel execution servers. When partition granules are used for parallel access to a table or index, you should use a relatively large number of partitions (ideally, three times the DOP), so that Oracle can effectively balance work across the query server processes. Partition granules are the basic unit of parallel index range scans and of parallel operations that modify multiple partitions of a partitioned table or index. These operations include parallel creation of partitioned indexes, and parallel creation of partitioned tables.
See Also: Oracle Database Concepts for information on disk
Types of Partitioning
This section describes the partitioning features that signicantly enhance data access and improve overall application performance. This is especially true for applications that access tables and indexes with millions of rows and many gigabytes of data.
Partitioned tables and indexes facilitate administrative operations by enabling these operations to work on subsets of data. For example, you can add a new partition, organize an existing partition, or drop a partition and cause less than a second of interruption to a read-only application. Using the partitioning methods described in this section can help you tune SQL statements to avoid unnecessary index and table scans (using partition pruning). You can also improve the performance of massive join operations when large amounts of data (for example, several million rows) are joined together by using partition-wise joins. Finally, partitioning data greatly improves manageability of very large databases and dramatically reduces the time required for administrative tasks such as backup and restore. Granularity can be easily added or removed to the partitioning scheme by splitting partitions. Thus, if a table's data is skewed to ll some partitions more than others, the ones that contain more data can be split to achieve a more even distribution. Partitioning also allows one to swap partitions with a table. By being able to easily add, remove, or swap a large amount of data quickly, swapping can be used to keep a large amount of data that is being loaded inaccessible until loading is completed, or can be used as a way to stage data between different phases of use. Some examples are current day's transactions or online archives.
See Also: Oracle Database Concepts for an introduction to the ideas behind partitioning
Partitioning Methods
Oracle offers four partitioning methods:
s
Each partitioning method has different advantages and design considerations. Thus, each method is more appropriate for a particular situation. Range Partitioning Range partitioning maps data to partitions based on ranges of partition key values that you establish for each partition. It is the most common type of partitioning and is often used with dates. For example, you might want to partition sales data into monthly partitions.
Range partitioning maps rows to partitions based on ranges of column values. Range partitioning is dened by the partitioning specication for a table or index in PARTITION BY RANGE(column_list) and by the partitioning specications for each individual partition in VALUES LESS THAN(value_list), where column_ list is an ordered list of columns that determines the partition to which a row or an index entry belongs. These columns are called the partitioning columns. The values in the partitioning columns of a particular row constitute that row's partitioning key. value_list is an ordered list of values for the columns in the column list. Each value must be either a literal or a TO_DATE or RPAD function with constant arguments. Only the VALUES LESS THAN clause is allowed. This clause species a non-inclusive upper bound for the partitions. All partitions, except the rst, have an implicit low value specied by the VALUES LESS THAN literal on the previous partition. Any binary values of the partition key equal to or higher than this literal are added to the next higher partition. Highest partition being where MAXVALUE literal is dened. Keyword, MAXVALUE, represents a virtual innite value that sorts higher than any other value for the data type, including the null value. The following statement creates a table sales_range that is range partitioned on the sales_date eld:
CREATE TABLE sales_range (salesman_id NUMBER(5), salesman_name VARCHAR2(30), sales_amount NUMBER(10), sales_date DATE) COMPRESS PARTITION BY RANGE(sales_date) (PARTITION sales_jan2000 VALUES PARTITION sales_feb2000 VALUES PARTITION sales_mar2000 VALUES PARTITION sales_apr2000 VALUES
Note: This table was created with the COMPRESS keyword, thus all partitions inherit this attribute.
See Also: Oracle Database SQL Reference for partitioning syntax and the Oracle Database Administrator's Guide for more examples
Hash Partitioning Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to a partitioning key that you identify. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size. Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning is a good and easy-to-use alternative to range partitioning when data is not historical and there is no obvious column or column list where logical range partition pruning can be advantageous. Oracle Database uses a linear hashing algorithm and to prevent data from clustering within specic partitions, you should dene the number of partitions by a power of two (for example, 2, 4, 8). The following statement creates a table sales_hash, which is hash partitioned on the salesman_id eld:
CREATE TABLE sales_hash (salesman_id NUMBER(5), salesman_name VARCHAR2(30), sales_amount NUMBER(10), week_no NUMBER(2)) PARTITION BY HASH(salesman_id) PARTITIONS 4;
See Also: Oracle Database SQL Reference for partitioning syntax and the Oracle Database Administrator's Guide for more examples
partitions. List Partitioning List partitioning enables you to explicitly control how rows map to partitions. You do this by specifying a list of discrete values for the partitioning column in the description for each partition. This is different from range partitioning, where a range of values is associated with a partition and with hash partitioning, where you have no control of the row-to-partition mapping. The advantage of list partitioning is that you can group and organize unordered and unrelated sets of data in a natural way. The following example creates a list partitioned table grouping states according to their sales regions:
CREATE TABLE sales_list (salesman_id NUMBER(5), salesman_name VARCHAR2(30), sales_state VARCHAR2(20),
sales_amount NUMBER(10), sales_date DATE) PARTITION BY LIST(sales_state) (PARTITION sales_west VALUES('California', 'Hawaii') COMPRESS, PARTITION sales_east VALUES('New York', 'Virginia', 'Florida'), PARTITION sales_central VALUES('Texas', 'Illinois'));
Partition sales_west is furthermore created as a single compressed partition within sales_list. For details about partitioning and compression, see "Partitioning and Table Compression" on page 5-16. An additional capability with list partitioning is that you can use a default partition, so that all rows that do not map to any other partition do not generate an error. For example, modifying the previous example, you can create a default partition as follows:
CREATE TABLE sales_list (salesman_id NUMBER(5), salesman_name VARCHAR2(30), sales_state VARCHAR2(20), sales_amount NUMBER(10), sales_date DATE) PARTITION BY LIST(sales_state) (PARTITION sales_west VALUES('California', 'Hawaii'), PARTITION sales_east VALUES ('New York', 'Virginia', 'Florida'), PARTITION sales_central VALUES('Texas', 'Illinois'), PARTITION sales_other VALUES(DEFAULT));
See Also: Oracle Database SQL Reference for partitioning syntax, "Partitioning and Table Compression" on page 5-16 for information regarding data segment compression, and the Oracle Database Administrator's Guide for more examples
Composite Partitioning Composite partitioning combines range and hash or list partitioning. Oracle Database rst distributes data into partitions according to boundaries established by the partition ranges. Then, for range-hash partitioning, Oracle uses a hashing algorithm to further divide the data into subpartitions within each range partition. For range-list partitioning, Oracle divides the data into subpartitions within each range partition based on the explicit list you chose.
Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying tables. You can create both local and global indexes on a table partitioned by range, hash, or composite methods. Local indexes inherit the partitioning attributes of their related tables. For example, if you create a local index on a composite table, Oracle automatically partitions the local index using the composite method. See Chapter 6, "Indexes" for more information.
When to Use Range Partitioning When to Use Hash Partitioning When to Use List Partitioning When to Use Composite Range-Hash Partitioning When to Use Composite Range-List Partitioning
When to Use Range Partitioning Range partitioning is a convenient method for partitioning historical data. The boundaries of range partitions dene the ordering of the partitions in the tables or indexes. Range partitioning organizes data by time intervals on a column of type DATE. Thus, most SQL statements accessing range partitions focus on timeframes. An example of this is a SQL statement similar to "select data from a particular period in time." In such a scenario, if each partition represents data for one month, the query "nd data of month 98-DEC" needs to access only the December partition of year 98. This reduces the amount of data scanned to a fraction of the total data available, an optimization method called partition pruning. Range partitioning is also ideal when you periodically load new data and purge old data. It is easy to add or drop partitions. It is common to keep a rolling window of data, for example keeping the past 36 months' worth of data online. Range partitioning simplies this process. To add data from a new month, you load it into a separate table, clean it, index it, and then add it to the range-partitioned table using the EXCHANGE PARTITION statement, all while the original table remains online. Once you add the new partition, you can drop the trailing month with the DROP PARTITION statement. The alternative to using the DROP PARTITION statement can be to archive the partition and make it read only, but this works only when your partitions are in separate tablespaces.
Very large tables are frequently scanned by a range predicate on a good partitioning column, such as ORDER_DATE or PURCHASE_DATE. Partitioning the table on that column enables partition pruning. You want to maintain a rolling window of data. You cannot complete administrative operations, such as backup and restore, on large tables in an allotted time frame, but you can divide them into smaller logical pieces based on the partition range column.
The following example creates the table salestable for a period of two years, 1999 and 2000, and partitions it by range according to the column s_salesdate to separate the data into eight quarters, each corresponding to a partition.
CREATE TABLE salestable (s_productid NUMBER, s_saledate DATE, s_custid NUMBER, s_totalprice NUMBER) PARTITION BY RANGE(s_saledate) (PARTITION sal99q1 VALUES LESS PARTITION sal99q2 VALUES LESS PARTITION sal99q3 VALUES LESS PARTITION sal99q4 VALUES LESS PARTITION sal00q1 VALUES LESS PARTITION sal00q2 VALUES LESS PARTITION sal00q3 VALUES LESS PARTITION sal00q4 VALUES LESS
When to Use Hash Partitioning The way Oracle Database distributes data in hash partitions does not correspond to a business or a logical view of the data, as it does in range partitioning. Consequently, hash partitioning is not an effective way to manage historical data. However, hash partitions share some performance characteristics with range partitions. For example, partition pruning is limited to equality predicates. You can also use partition-wise joins, parallel index access, and parallel DML. See "Partition-Wise Joins" on page 5-20 for more information. As a general rule, use hash partitioning for these purposes:
s
To improve the availability and manageability of large tables or to enable parallel DML in tables that do not store historical data. To avoid data skew among partitions. Hash partitioning is an effective means of distributing data because Oracle hashes the data into a number of partitions,
5-10
each of which can reside on a separate device. Thus, data is evenly spread over a sufcient number of devices to maximize I/O throughput. Similarly, you can use hash partitioning to distribute evenly data among the nodes of an MPP platform that uses Oracle Real Application Clusters.
s
If it is important to use partition pruning and partition-wise joins according to a partitioning key that is mostly constrained by a distinct value or value list.
Note: In hash partitioning, partition pruning uses only equality or
IN-list predicates. If you add or merge a hashed partition, Oracle automatically rearranges the rows to reect the change in the number of partitions and subpartitions. The hash function that Oracle uses is especially designed to limit the cost of this reorganization. Instead of reshufing all the rows in the table, Oracles uses an "add partition" logic that splits one and only one of the existing hashed partitions. Conversely, Oracle coalesces a partition by merging two existing hashed partitions. Although the hash function's use of "add partition" logic dramatically improves the manageability of hash partitioned tables, it means that the hash function can cause a skew if the number of partitions of a hash partitioned table, or the number of subpartitions in each partition of a composite table, is not a power of two. In the worst case, the largest partition can be twice the size of the smallest. So for optimal performance, create a number of partitions and subpartitions for each partition that is a power of two. For example, 2, 4, 8, 16, 32, 64, 128, and so on. The following example creates four hashed partitions for the table sales_hash using the column s_productid as the partition key:
CREATE TABLE sales_hash (s_productid NUMBER, s_saledate DATE, s_custid NUMBER, s_totalprice NUMBER) PARTITION BY HASH(s_productid) PARTITIONS 4;
Specify partition names if you want to choose the names of the partitions. Otherwise, Oracle automatically generates internal names for the partitions. Also, you can use the STORE IN clause to assign hash partitions to tablespaces in a round-robin manner.
See Also: Oracle Database SQL Reference for partitioning syntax and the Oracle Database Administrator's Guide for more examples
When to Use List Partitioning You should use list partitioning when you want to specically map rows to partitions based on discrete values. Unlike range and hash partitioning, multi-column partition keys are not supported for list partitioning. If a table is partitioned by list, the partitioning key can only consist of a single column of the table. When to Use Composite Range-Hash Partitioning Composite range-hash partitioning offers the benets of both range and hash partitioning. With composite range-hash partitioning, Oracle rst partitions by range. Then, within each range, Oracle creates subpartitions and distributes data within them using the same hashing algorithm it uses for hash partitioned tables. Data placed in composite partitions is logically ordered only by the boundaries that dene the range level partitions. The partitioning of data within each partition has no logical organization beyond the identity of the partition to which the subpartitions belong. Consequently, tables and local indexes partitioned using the composite range-hash method:
s
Support historical data at the partition level. Support the use of subpartitions as units of parallelism for parallel operations such as PDML or space management and backup and recovery. Are eligible for partition pruning and partition-wise joins on the range and hash partitions.
Using Composite Range-Hash Partitioning Use the composite range-hash partitioning method for tables and local indexes if:
s
Partitions must have a logical meaning to efciently support historical data The contents of a partition can be spread across multiple tablespaces, devices, or nodes (of an MPP system) You require both partition pruning and partition-wise joins even when the pruning and join predicates use different columns of the partitioned table You require a degree of parallelism that is greater than the number of partitions for backup, recovery, and parallel operations
5-12
Most large tables in a data warehouse should use range partitioning. Composite partitioning should be used for very large tables or for data warehouses with a well-dened need for these conditions. When using the composite method, Oracle stores each subpartition on a different segment. Thus, the subpartitions may have properties that differ from the properties of the table or from the partition to which the subpartitions belong. The following example partitions the table sales_range_hash by range on the column s_saledate to create four partitions that order data by time. Then, within each range partition, the data is further subdivided into 16 subpartitions by hash on the column s_productid:
CREATE TABLE sales_range_hash( s_productid NUMBER, s_saledate DATE, s_custid NUMBER, s_totalprice NUMBER) PARTITION BY RANGE (s_saledate) SUBPARTITION BY HASH (s_productid) SUBPARTITIONS 8 (PARTITION sal99q1 VALUES LESS THAN (TO_DATE('01-APR-1999', PARTITION sal99q2 VALUES LESS THAN (TO_DATE('01-JUL-1999', PARTITION sal99q3 VALUES LESS THAN (TO_DATE('01-OCT-1999', PARTITION sal99q4 VALUES LESS THAN (TO_DATE('01-JAN-2000',
Each hashed subpartition contains sales data for a single quarter ordered by product code. The total number of subpartitions is 4x8 or 32. In addition to this syntax, you can create subpartitions by using a subpartition template. This offers better ease in naming and control of location for tablespaces and subpartitions. The following statement illustrates this:
CREATE TABLE sales_range_hash( s_productid NUMBER, s_saledate DATE, s_custid NUMBER, s_totalprice NUMBER) PARTITION BY RANGE (s_saledate) SUBPARTITION BY HASH (s_productid) SUBPARTITION TEMPLATE( SUBPARTITION sp1 TABLESPACE tbs1, SUBPARTITION sp2 TABLESPACE tbs2, SUBPARTITION sp3 TABLESPACE tbs3, SUBPARTITION sp4 TABLESPACE tbs4, SUBPARTITION sp5 TABLESPACE tbs5, SUBPARTITION sp6 TABLESPACE tbs6,
sp7 TABLESPACE sp8 TABLESPACE sal99q1 VALUES sal99q2 VALUES sal99q3 VALUES sal99q4 VALUES
tbs7, tbs8) LESS THAN LESS THAN LESS THAN LESS THAN
In this example, every partition has the same number of subpartitions. A sample mapping for sal99q1 is illustrated in Table 51. Similar mappings exist for sal99q2 through sal99q4.
Table 51 Subpartition Mapping Tablespace tbs1 tbs2 tbs3 tbs4 tbs5 tbs6 tbs7 tbs8
See Also: Oracle Database SQL Reference for details regarding syntax and restrictions
When to Use Composite Range-List Partitioning Composite range-list partitioning offers the benets of both range and list partitioning. With composite range-list partitioning, Oracle rst partitions by range. Then, within each range, Oracle creates subpartitions and distributes data within them to organize sets of data in a natural way as assigned by the list. Data placed in composite partitions is logically ordered only by the boundaries that dene the range level partitions. Using Composite Range-List Partitioning Use the composite range-list partitioning method for tables and local indexes if:
s
5-14
The contents of a partition can be spread across multiple tablespaces, devices, or nodes (of an MPP system). You require both partition pruning and partition-wise joins even when the pruning and join predicates use different columns of the partitioned table. You require a degree of parallelism that is greater than the number of partitions for backup, recovery, and parallel operations.
Most large tables in a data warehouse should use range partitioning. Composite partitioning should be used for very large tables or for data warehouses with a well-dened need for these conditions. When using the composite method, Oracle stores each subpartition on a different segment. Thus, the subpartitions may have properties that differ from the properties of the table or from the partition to which the subpartitions belong. This statement creates a table quarterly_regional_sales that is range partitioned on the txn_date eld and list subpartitioned on state.
CREATE TABLE quarterly_regional_sales (deptno NUMBER, item_no VARCHAR2(20), txn_date DATE, txn_amount NUMBER, state VARCHAR2(2)) PARTITION BY RANGE (txn_date) SUBPARTITION BY LIST (state) ( PARTITION q1_1999 VALUES LESS THAN(TO_DATE('1-APR-1999','DD-MON-YYYY')) (SUBPARTITION q1_1999_northwest VALUES ('OR', 'WA'), SUBPARTITION q1_1999_southwest VALUES ('AZ', 'UT', 'NM'), SUBPARTITION q1_1999_northeast VALUES ('NY', 'VM', 'NJ'), SUBPARTITION q1_1999_southeast VALUES ('FL', 'GA'), SUBPARTITION q1_1999_northcentral VALUES ('SD', 'WI'), SUBPARTITION q1_1999_southcentral VALUES ('NM', 'TX')), PARTITION q2_1999 VALUES LESS THAN(TO_DATE('1-JUL-1999','DD-MON-YYYY')) (SUBPARTITION q2_1999_northwest VALUES ('OR', 'WA'), SUBPARTITION q2_1999_southwest VALUES ('AZ', 'UT', 'NM'), SUBPARTITION q2_1999_northeast VALUES ('NY', 'VM', 'NJ'), SUBPARTITION q2_1999_southeast VALUES ('FL', 'GA'), SUBPARTITION q2_1999_northcentral VALUES ('SD', 'WI'), SUBPARTITION q2_1999_southcentral VALUES ('NM', 'TX')), PARTITION q3_1999 VALUES LESS THAN (TO_DATE('1-OCT-1999','DD-MON-YYYY')) (SUBPARTITION q3_1999_northwest VALUES ('OR', 'WA'), SUBPARTITION q3_1999_southwest VALUES ('AZ', 'UT', 'NM'), SUBPARTITION q3_1999_northeast VALUES ('NY', 'VM', 'NJ'), SUBPARTITION q3_1999_southeast VALUES ('FL', 'GA'), SUBPARTITION q3_1999_northcentral VALUES ('SD', 'WI'), SUBPARTITION q3_1999_southcentral VALUES ('NM', 'TX')),
PARTITION q4_1999 VALUES LESS THAN (TO_DATE('1-JAN-2000','DD-MON-YYYY')) (SUBPARTITION q4_1999_northwest VALUES('OR', 'WA'), SUBPARTITION q4_1999_southwest VALUES('AZ', 'UT', 'NM'), SUBPARTITION q4_1999_northeast VALUES('NY', 'VM', 'NJ'), SUBPARTITION q4_1999_southeast VALUES('FL', 'GA'), SUBPARTITION q4_1999_northcentral VALUES ('SD', 'WI'), SUBPARTITION q4_1999_southcentral VALUES ('NM', 'TX')));
You can create subpartitions in a composite partitioned table using a subpartition template. A subpartition template simplies the specication of subpartitions by not requiring that a subpartition descriptor be specied for every partition in the table. Instead, you describe subpartitions only once in a template, then apply that subpartition template to every partition in the table. The following statement illustrates an example where you can choose the subpartition name and tablespace locations:
CREATE TABLE quarterly_regional_sales (deptno NUMBER, item_no VARCHAR2(20), txn_date DATE, txn_amount NUMBER, state VARCHAR2(2)) PARTITION BY RANGE (txn_date) SUBPARTITION BY LIST (state) SUBPARTITION TEMPLATE( SUBPARTITION northwest VALUES ('OR', 'WA') TABLESPACE ts1, SUBPARTITION southwest VALUES ('AZ', 'UT', 'NM') TABLESPACE ts2, SUBPARTITION northeast VALUES ('NY', 'VM', 'NJ') TABLESPACE ts3, SUBPARTITION southeast VALUES ('FL', 'GA') TABLESPACE ts4, SUBPARTITION northcentral VALUES ('SD', 'WI') TABLESPACE ts5, SUBPARTITION southcentral VALUES ('NM', 'TX') TABLESPACE ts6) ( PARTITION q1_1999 VALUES LESS THAN(TO_DATE('1-APR-1999','DD-MON-YYYY')), PARTITION q2_1999 VALUES LESS THAN(TO_DATE('1-JUL-1999','DD-MON-YYYY')), PARTITION q3_1999 VALUES LESS THAN(TO_DATE('1-OCT-1999','DD-MON-YYYY')), PARTITION q4_1999 VALUES LESS THAN(TO_DATE('1-JAN-2000','DD-MON-YYYY')));
See Also: Oracle Database SQL Reference for details regarding syntax and restrictions
5-16
inherit the attribute from the table denition or, if nothing is specied on table level, from the tablespace denition. To decide whether or not a partition should be compressed or stay uncompressed adheres to the same rules as a nonpartitioned table. However, due to the capability of range and composite partitioning to separate data logically into distinct partitions, such a partitioned table is an ideal candidate for compressing parts of the data (partitions) that are mainly read-only. It is, for example, benecial in all rolling window operations as a kind of intermediate stage before aging out old data. With data segment compression, you can keep more old data online, minimizing the burden of additional storage consumption. You can also change any existing uncompressed table partition later on, add new compressed and uncompressed partitions, or change the compression attribute as part of any partition maintenance operation that requires data movement, such as MERGE PARTITION, SPLIT PARTITION, or MOVE PARTITION. The partitions can contain data or can be empty. The access and maintenance of a partially or fully compressed partitioned table are the same as for a fully uncompressed partitioned table. Everything that applies to fully uncompressed partitioned tables is also valid for partially or fully compressed partitioned tables.
See Also: Chapter 3, "Physical Design in Data Warehouses" for a
generic discussion of table compression, Chapter 15, "Maintaining the Data Warehouse" for a sample rolling window operation with a range-partitioned table, and Oracle Database Performance Tuning Guide for an example of calculating the compression ratio
Mark bitmap indexes unusable. Set the compression attribute. Rebuild the indexes.
The rst time you make a compressed partition part of an already existing, fully uncompressed partitioned table, you must either drop all existing bitmap indexes or mark them UNUSABLE prior to adding a compressed partition. This must be done irrespective of whether any partition contains any data. It is also independent of the
operation that causes one or more compressed partitions to become part of the table. This does not apply to a partitioned table having B-tree indexes only. This rebuilding of the bitmap index structures is necessary to accommodate the potentially higher number of rows stored for each data block with table compression enabled and must be done only for the rst time. All subsequent operations, whether they affect compressed or uncompressed partitions, or change the compression attribute, behave identically for uncompressed, partially compressed, or fully compressed partitioned tables. To avoid the recreation of any bitmap index structure, Oracle recommends creating every partitioned table with at least one compressed partition whenever you plan to partially or fully compress the partitioned table in the future. This compressed partition can stay empty or even can be dropped after the partition table creation. Having a partitioned table with compressed partitions can lead to slightly larger bitmap index structures for the uncompressed partitions. The bitmap index structures for the compressed partitions, however, are in most cases smaller than the appropriate bitmap index structure before table compression. This highly depends on the achieved compression rates.
Note: Oracle Database will raise an error if compression is
introduced to an object for the rst time and there are usable bitmap index segments.
If you use the MOVE statement, the local indexes for partition sales_q1_1998 become unusable. You have to rebuild them afterward, as follows:
ALTER TABLE sales MODIFY PARTITION sales_q1_1998 REBUILD UNUSABLE LOCAL INDEXES;
The following statement merges two existing partitions into a new, compressed partition, residing in a separate tablespace. The local bitmap indexes have to be rebuilt afterward, as follows:
ALTER TABLE sales MERGE PARTITIONS sales_q1_1998, sales_q2_1998
5-18
See Also: Oracle Database Performance Tuning Guide for details regarding how to estimate the compression ratio when using table compression
Partition Pruning
Partition pruning is an essential performance feature for data warehouses. In partition pruning, the optimizer analyzes FROM and WHERE clauses in SQL statements to eliminate unneeded partitions when building the partition access list. This enables Oracle Database to perform operations only on those partitions that are relevant to the SQL statement. Oracle prunes partitions when you use range, LIKE, equality, and IN-list predicates on the range or list partitioning columns, and when you use equality and IN-list predicates on the hash partitioning columns. Partition pruning dramatically reduces the amount of data retrieved from disk and shortens the use of processing time, improving query performance and resource utilization. If you partition the index and table on different columns (with a global, partitioned index), partition pruning also eliminates index partitions even when the partitions of the underlying table cannot be eliminated. On composite partitioned objects, Oracle can prune at both the range partition level and at the hash or list subpartition level using the relevant predicates. Refer to the table sales_range_hash earlier, partitioned by range on the column s_ salesdate and subpartitioned by hash on column s_productid, and consider the following example:
SELECT * FROM sales_range_hash WHERE s_saledate BETWEEN (TO_DATE('01-JUL-1999', 'DD-MON-YYYY')) AND (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')) AND s_productid = 1200;
Oracle uses the predicate on the partitioning columns to perform partition pruning as follows:
s
When using range partitioning, Oracle accesses only partitions sal99q2 and sal99q3. When using hash subpartitioning, Oracle accesses only the one subpartition in each partition that stores the rows with s_productid=1200. The mapping between the subpartition and the predicate is calculated based on Oracle's internal hash distribution function.
Although this uses the DD-MON-RR format, which is not the same as the base partition, the optimizer can still prune properly. If you execute an EXPLAIN PLAN statement on the query, the PARTITION_START and PARTITION_STOP columns of the output table do not specify which partitions Oracle is accessing. Instead, you see the keyword KEY for both columns. The keyword KEY for both columns means that partition pruning occurs at run-time. It can also affect the execution plan because the information about the pruned partitions is missing compared to the same statement using the same TO_DATE function than the partition table denition.
Partition-Wise Joins
Partition-wise joins reduce query response time by minimizing the amount of data exchanged among parallel execution servers when joins execute in parallel. This signicantly reduces response time and improves the use of both CPU and memory resources. In Oracle Real Application Clusters environments, partition-wise joins also avoid or at least limit the data trafc over the interconnect, which is the key to achieving good scalability for massive join operations. Partition-wise joins can be full or partial. Oracle decides which type of join to use.
5-20
table and a customer table on the column customerid. The query "nd the records of all customers who bought more than 100 articles in Quarter 3 of 1999" is a typical example of a SQL statement performing such a join. The following is an example of this:
SELECT c.cust_last_name, COUNT(*) FROM sales s, customers c WHERE s.cust_id = c.cust_id AND s.time_id BETWEEN TO_DATE('01-JUL-1999', 'DD-MON-YYYY') AND (TO_DATE('01-OCT-1999', 'DD-MON-YYYY')) GROUP BY c.cust_last_name HAVING COUNT(*) > 100;
This large join is typical in data warehousing environments. The entire customer table is joined with one quarter of the sales data. In large data warehouse applications, this might mean joining millions of rows. The join method to use in that case is obviously a hash join. You can reduce the processing time for this hash join even more if both tables are equipartitioned on the customerid column. This enables a full partition-wise join. When you execute a full partition-wise join in parallel, the granule of parallelism, as described under "Granules of Parallelism" on page 5-3, is a partition. As a result, the degree of parallelism is limited to the number of partitions. For example, you require at least 16 partitions to set the degree of parallelism of the query to 16. You can use various partitioning methods to equipartition both tables on the column customerid with 16 partitions. These methods are described in these subsections. Hash-Hash This is the simplest method: the customers and sales tables are both partitioned by hash into 16 partitions, on the s_customerid and c_customerid columns. This partitioning method enables full partition-wise join when the tables are joined on c_customerid and s_customerid, both representing the same customer identication number. Because you are using the same hash function to distribute the same information (customer ID) into the same number of hash partitions, you can join the equivalent partitions. They are storing the same values. In serial, this join is performed between pairs of matching hash partitions, one at a time. When one partition pair has been joined, the join of another partition pair begins. The join completes when the 16 partition pairs have been processed.
partition with the same partition number from each table. For example, with full partition-wise joins we join partition 0 of sales with partition 0 of customers, partition 1 of sales with partition 1 of customers, and so on. Parallel execution of a full partition-wise join is a straightforward parallelization of the serial execution. Instead of joining one partition pair at a time, 16 partition pairs are joined in parallel by the 16 query servers. Figure 51 illustrates the parallel execution of a full partition-wise join.
Figure 51 Parallel Execution of a Full Partition-wise Join
sales
P1
P2
P3
P16
...
customers P1 P2 P3 P16
Server
Server
Server
Server
In Figure 51, assume that the degree of parallelism and the number of partitions are the same, in other words, 16 for both. Dening more partitions than the degree of parallelism may improve load balancing and limit possible skew in the execution. If you have more partitions than query servers, when one query server completes the join of one pair of partitions, it requests that the query coordinator give it another pair to join. This process repeats until all pairs have been processed. This method enables the load to be balanced dynamically when the number of partition pairs is greater than the degree of parallelism, for example, 64 partitions with a degree of parallelism of 16.
Note: To guarantee an equal work distribution, the number of
5-22
In Oracle Real Application Clusters environments running on shared-nothing or MPP platforms, placing partitions on nodes is critical to achieving good scalability. To avoid remote I/O, both matching partitions should have afnity to the same node. Partition pairs should be spread over all nodes to avoid bottlenecks and to use all CPU resources available on the system. Nodes can host multiple pairs when there are more pairs than nodes. For example, with an 8-node system and 16 partition pairs, each node receives two pairs.
See Also: Oracle Real Application Clusters Deployment and Performance Guide for more information on data afnity
(Composite-Hash)-Hash This method is a variation of the hash-hash method. The sales table is a typical example of a table storing historical data. For all the reasons mentioned under the heading "When to Use Range Partitioning" on page 5-9, range is the logical initial partitioning method. For example, assume you want to partition the sales table into eight partitions by range on the column s_salesdate. Also assume you have two years and that each partition represents a quarter. Instead of using range partitioning, you can use composite partitioning to enable a full partition-wise join while preserving the partitioning on s_salesdate. Partition the sales table by range on s_ salesdate and then subpartition each partition by hash on s_customerid using 16 subpartitions for each partition, for a total of 128 subpartitions. The customers table can still use hash partitioning with 16 partitions. When you use the method just described, a full partition-wise join works similarly to the one created by the hash-hash method. The join is still divided into 16 smaller joins between hash partition pairs from both tables. The difference is that now each hash partition in the sales table is composed of a set of 8 subpartitions, one from each range partition. Figure 52 illustrates how the hash partitions are formed in the sales table. Each cell represents a subpartition. Each row corresponds to one range partition, for a total of 8 range partitions. Each range partition has 16 subpartitions. Each column corresponds to one hash partition for a total of 16 hash partitions; each hash partition has 8 subpartitions. Note that hash partitions can be dened only if all partitions have the same number of subpartitions, in this case, 16. Hash partitions are implicit in a composite table. However, Oracle does not record them in the data dictionary, and you cannot manipulate them with DDL commands as you can range partitions.
Figure 52
Hash partition #9
(Composite-Hash)-Hash partitioning is effective because it lets you combine pruning (on s_salesdate) with a full partition-wise join (on customerid). In the previous example query, pruning is achieved by scanning only the subpartitions corresponding to Q3 of 1999, in other words, row number 3 in Figure 52. Oracle then joins these subpartitions with the customer table, using a full partition-wise join. All characteristics of the hash-hash partition-wise join apply to the composite-hash partition-wise join. In particular, for this example, these two points are common to both methods:
s
The degree of parallelism for this full partition-wise join cannot exceed 16. Even though the sales table has 128 subpartitions, it has only 16 hash partitions.
5-24
The rules for data placement on MPP systems apply here. The only difference is that a hash partition is now a collection of subpartitions. You must ensure that all these subpartitions are placed on the same node as the matching hash partition from the other table. For example, in Figure 52, store hash partition 9 of the sales table shown by the eight circled subpartitions, on the same node as hash partition 9 of the customers table.
(Composite-List)-List The (Composite-List)-List method resembles that for (Composite-Hash)-Hash partition-wise joins. Composite-Composite (Hash/List Dimension) If needed, you can also partition the customer table by the composite method. For example, you partition it by range on a postal code column to enable pruning based on postal code. You then subpartition it by hash on customerid using the same number of partitions (16) to enable a partition-wise join on the hash dimension. Range-Range and List-List You can also join range partitioned tables with range partitioned tables and list partitioned tables with list partitioned tables in a partition-wise manner, but this is relatively uncommon. This is more complex to implement because you must know the distribution of the data before performing the join. Furthermore, if you do not correctly identify the partition bounds so that you have partitions of equal size, data skew during the execution may result. The basic principle for using range-range and list-list is the same as for using hash-hash: you must equipartition both tables. This means that the number of partitions must be the same and the partition bounds must be identical. For example, assume that you know in advance that you have 10 million customers, and that the values for customerid vary from 1 to 10,000,000. In other words, you have 10 million possible different values. To create 16 partitions, you can range partition both tables, sales on s_customerid and customers on c_ customerid. You should dene partition bounds for both tables in order to generate partitions of the same size. In this example, partition bounds should be dened as 625001, 1250001, 1875001, ... 10000001, so that each partition contains 625000 rows. Range-Composite, Composite-Composite (Range Dimension) Finally, you can also subpartition one or both tables on another column. Therefore, the range-composite and composite-composite methods on the range dimension are also valid for enabling a full partition-wise join on the range dimension.
5-26
example, all rows in customers that could have matching rows in partition P1 of sales are sent to query server 1 in the second set. Rows received by the second set of query servers are joined with the rows from the corresponding partitions in sales. Query server number 1 in the second set joins all customers rows that it receives with partition P1 of sales.
Figure 53 Partial Partition-Wise Join
sales
P1
P2
... ...
P16
JOIN
Server
Server
...
Server
re-distribution hash(c_customerid)
customers
SELECT
range-list partial partition-wise joins. Considerations for full partition-wise joins also apply to partial partition-wise joins:
s
The degree of parallelism does not need to equal the number of partitions. In Figure 53, the query executes with two sets of 16 query servers. In this case, Oracle assigns 1 partition to each query server of the second set. Again, the number of partitions should always be a multiple of the degree of parallelism.
In Oracle Real Application Clusters environments on shared-nothing platforms (MPPs), each hash partition of sales should preferably have afnity to only one node in order to avoid remote I/Os. Also, spread partitions over all nodes to avoid bottlenecks and use all CPU resources available on the system. A node can host multiple partitions when there are more partitions than nodes.
See Also: Oracle Real Application Clusters Deployment and
Performance Guide for more information on data afnity Composite As with full partition-wise joins, the prime partitioning method for the sales table is to use the range method on column s_salesdate. This is because sales is a typical example of a table that stores historical data. To enable a partial partition-wise join while preserving this range partitioning, subpartition sales by hash on column s_customerid using 16 subpartitions for each partition. Pruning and partial partition-wise joins can be used together if a query joins customers and sales and if the query has a selection predicate on s_salesdate. When sales is composite, the granule of parallelism for a partial partition-wise join is a hash partition and not a subpartition. Refer to Figure 52 for an illustration of a hash partition in a composite table. Again, the number of hash partitions should be a multiple of the degree of parallelism. Also, on an MPP system, ensure that each hash partition has afnity to a single node. In the previous example, the eight subpartitions composing a hash partition should have afnity to the same node.
Note: This section is based on range-hash, but it also applies for
range-list partial partition-wise joins. Range Finally, you can use range partitioning on s_customerid to enable a partial partition-wise join. This works similarly to the hash method, but a side effect of range partitioning is that the resulting data distribution could be skewed if the size of the partitions differs. Moreover, this method is more complex to implement because it requires prior knowledge of the values of the partitioning column that is also a join key.
5-28
Reduction of Communications Overhead When executed in parallel, partition-wise joins reduce communications overhead. This is because, in the default case, parallel execution of a join operation by a set of parallel execution servers requires the redistribution of each table on the join column into disjoint subsets of rows. These disjoint subsets of rows are then joined pair-wise by a single parallel execution server. Oracle can avoid redistributing the partitions because the two tables are already partitioned on the join column. This enables each parallel execution server to join a pair of matching partitions. This improved performance from using parallel execution is even more noticeable in Oracle Real Application Clusters congurations with internode parallel execution. Partition-wise joins dramatically reduce interconnect trafc. Using this feature is for large DSS congurations that use Oracle Real Application Clusters. Currently, most Oracle Real Application Clusters platforms, such as MPP and SMP clusters, provide limited interconnect bandwidths compared with their processing powers. Ideally, interconnect bandwidth should be comparable to disk bandwidth, but this is seldom the case. As a result, most join operations in Oracle Real Application Clusters experience high interconnect latencies without parallel execution of partition-wise joins. Reduction of Memory Requirements Partition-wise joins require less memory than the equivalent join operation of the complete data set of the tables being joined. In the case of serial joins, the join is performed at the same time on a pair of matching partitions. If data is evenly distributed across partitions, the memory requirement is divided by the number of partitions. There is no skew. In the parallel case, memory requirements depend on the number of partition pairs that are joined in parallel. For example, if the degree of parallelism is 20 and the number of partitions is 100, 5 times less memory is required because only 20 joins of two partitions are performed at the same time. The fact that partition-wise joins require less memory has a direct effect on performance. For example, the join probably does not need to write blocks to disk during the build phase of a hash join.
In range partitioning where partition sizes differ, data skew increases response time; some parallel execution servers take longer than others to nish their
joins. Oracle recommends the use of hash (sub)partitioning to enable partition-wise joins because hash partitioning, if the number of partitions is a power of two, limits the risk of skew.
s
The number of partitions used for partition-wise joins should, if possible, be a multiple of the number of query servers. With a degree of parallelism of 16, for example, you can have 16, 32, or even 64 partitions. If there is an even number of partitions, some parallel execution servers are used less than others. For example, if there are 17 evenly distributed partition pairs, only one pair will work on the last join, while the other pairs will have to wait. This is because, in the beginning of the execution, each parallel execution server works on a different partition pair. At the end of this rst phase, only one pair is left. Thus, a single parallel execution server joins this remaining pair while all other parallel execution servers are idle. Sometimes, parallel joins can cause remote I/Os. For example, on Oracle Real Application Clusters environments running on MPP congurations, if a pair of matching partitions is not collocated on the same node, a partition-wise join requires extra internode communication due to remote I/O. This is because Oracle must transfer at least one partition to the node where the join is performed. In this case, it is better to explicitly redistribute the data than to use a partition-wise join.
A LEVEL or ROWID pseudocolumn A column of the ROWID datatype A nested table, VARRAY, object type, or REF column A LOB column (BLOB, CLOB, NCLOB, or BFILE datatype)
A row's partitioning key is an ordered list of its values for the partitioning columns. Similarly, in composite partitioning a row's subpartitioning key is an ordered list of its values for the subpartitioning columns. Oracle applies either the range, list, or hash method to each row's partitioning key or subpartitioning key to determine which partition or subpartition the row belongs in.
5-30
Every partition of a range-partitioned table or index has a noninclusive upper bound, which is specied by the VALUES LESS THAN clause. Every partition except the rst partition also has an inclusive lower bound, which is specied by the VALUES LESS THAN on the next-lower partition.
The partition bounds collectively dene an ordering of the partitions in a table or index. The rst partition is the partition with the lowest VALUES LESS THAN clause, and the last or highest partition is the partition with the highest VALUES LESS THAN clause.
MAXVALUE
You can specify the keyword MAXVALUE for any value in the partition bound value_ list. This keyword represents a virtual innite value that sorts higher than any other value for the data type, including the NULL value.
For example, you might partition the OFFICE table on STATE (a CHAR(10) column) into three partitions with the following partition bounds:
s
VALUES LESS THAN ('I'): States whose names start with A through H VALUES LESS THAN ('S'): States whose names start with I through R VALUES LESS THAN (MAXVALUE): States whose names start with S through Z, plus special codes for non-U.S. regions
Nulls
NULL cannot be specied as a value in a partition bound value_list. An empty string also cannot be specied as a value in a partition bound value_list, because it is treated as NULL within the database server. For the purpose of assigning rows to partitions, Oracle Database sorts nulls greater than all other values except MAXVALUE. Nulls sort less than MAXVALUE. This means that if a table is partitioned on a nullable column, and the column is to contain nulls, then the highest partition should have a partition bound of MAXVALUE for that column. Otherwise the rows that contain nulls will map above the highest partition in the table and the insert will fail.
DATE Datatypes
If the partition key includes a column that has the DATE datatype and the NLS date format does not specify the century with the year, you must specify partition bounds using the TO_DATE function with a 4-character format mask for the year. Otherwise, you will not be able to create the table or index. For example, with the sales_range table using a DATE column:
CREATE TABLE sales_range (salesman_id NUMBER(5), salesman_name VARCHAR2(30), sales_amount NUMBER(10), sales_date DATE) COMPRESS PARTITION BY RANGE(sales_date) (PARTITION sales_jan2000 VALUES PARTITION sales_feb2000 VALUES PARTITION sales_mar2000 VALUES PARTITION sales_apr2000 VALUES
When you query or modify data, it is recommended that you use the TO_DATE function in the WHERE clause so that the value of the date information can be
5-32
determined at compile time. However, the optimizer can prune partitions using a selection criterion on partitioning columns of type DATE when you use another format, as in the following examples:
SELECT * FROM sales_range WHERE sales_date BETWEEN TO_DATE('01-JUL-00', 'DD-MON-YY') AND TO_DATE('01-OCT-00', 'DD-MON-YY'); SELECT * FROM sales_range WHERE sales_date BETWEEN '01-JUL-2000' AND '01-OCT-2000';
In this case, the date value will be complete only at runtime. Therefore you will not be able to see which partitions Oracle is accessing as is usually shown on the partition_start and partition_stop columns of the EXPLAIN PLAN statement output on the SQL statement. Instead, you will see the keyword KEY for both columns.
Index Partitioning
The rules for partitioning indexes are similar to those for tables:
s
You can mix partitioned and nonpartitioned indexes with partitioned and nonpartitioned tables: A partitioned table can have partitioned or nonpartitioned indexes. A nonpartitioned table can have partitioned or nonpartitioned B-tree indexes.
Bitmap indexes on nonpartitioned tables cannot be partitioned. A bitmap index on a partitioned table must be a local index.
However, partitioned indexes are more complicated than partitioned tables because there are three types of partitioned indexes:
s
These types are described in the following section. Oracle supports all three types.
5-34
Only one index partition needs to be rebuilt when a maintenance operation other than SPLIT PARTITION or ADD PARTITION is performed on an underlying table partition. The duration of a partition maintenance operation remains proportional to partition size if the partitioned table has only local indexes. Local indexes support partition independence. Local indexes support smooth roll-out of old data and roll-in of new data in historical tables. Oracle can take advantage of the fact that a local index is equipartitioned with the underlying table to generate better query access plans. Local indexes simplify the task of tablespace incomplete recovery. In order to recover a partition or subpartition of a table to a point in time, you must also recover the corresponding index entries to the same point in time. The only way to accomplish this is with a local index. Then you can recover the corresponding table and index partitions or subpartitions together.
See Also: PL/SQL Packages and Types Reference for a description of the DBMS_PCLXUTIL package
Local Prexed Indexes A local index is prexed if it is partitioned on a left prex of the index columns. For example, if the sales table and its local index sales_ix are partitioned on the week_num column, then index sales_ix is local prexed if it is dened on the columns (week_num, xaction_num). On the other hand, if index sales_ix is dened on column product_num then it is not prexed. Local prexed indexes can be unique or nonunique. Figure 54 illustrates another example of a local prexed index.
Figure 54
DEPTNO 0-9
DEPTNO 10-19
DEPTNO 90-99
DEPTNO 0-9
DEPTNO 10-19
...
DEPTNO 90-99
...
Local Nonprexed Indexes A local index is nonprexed if it is not partitioned on a left prex of the index columns. You cannot have a unique local nonprexed index unless the partitioning key is a subset of the index key. Figure 55 illustrates an example of a local nonprexed index.
5-36
Figure 55
CHKDATE 1/97
CHKDATE 2/97
CHKDATE 12/97
ACCTNO 31 ACCTNO 82
ACCTNO 54 ACCTNO 82
...
ACCTNO 15 ACCTNO 35
...
The highest partition of a global index must have a partition bound all of whose values are MAXVALUE. This insures that all rows in the underlying table can be represented in the index. Prexed and Nonprexed Global Partitioned Indexes A global partitioned index is prexed if it is partitioned on a left prex of the index columns. See Figure 56 for an example. A global partitioned index is nonprexed if it is not partitioned on a left prex of the index columns. Oracle does not support global nonprexed partitioned indexes. Global prexed partitioned indexes can be unique or nonunique. Nonpartitioned indexes are treated as global prexed nonpartitioned indexes. Management of Global Partitioned Indexes Global partitioned indexes are harder to manage than local indexes:
s
When the data in an underlying table partition is moved or removed (SPLIT, MOVE, DROP, or TRUNCATE), all partitions of a global index are affected. Consequently global indexes do not support partition independence. When an underlying table partition or subpartition is recovered to a point in time, all corresponding entries in a global index must be recovered to the same point in time. Because these entries may be scattered across all partitions or subpartitions of the index, mixed in with entries for other partitions or subpartitions that are not being recovered, there is no way to accomplish this except by re-creating the entire global index.
5-38
Figure 56
EMPNO 0-39
EMPNO 40-69
EMPNO 70-MAXVALUE
EMPNO 73 EMPNO 82 EMPNO 96
EMPNO 15 EMPNO 31
EMPNO 54
...
DEPTNO 0-9
DEPTNO 10-19
...
DEPTNO 90-99
If an index is local, it is equipartitioned with the underlying table. Otherwise, it is global. A prexed index is partitioned on a left prex of the index columns. Otherwise, it is nonprexed.
Table 52
Types of Partitioned Indexes Index Equipartitione d with Table Index Partitioned on Left Prex of Index Columns Yes No UNIQUE Attribute Allowed Yes Yes (Note1) Yes Example: Table Partitioning Key A A Example: Example: Index Index Partitioning Columns Key A, B B, A A A
Type of Index
Local Prexed (any Yes partitioning method) Local Nonprexed (any partitioning method) Global Prexed (range partitioning only) Yes
No (Note2)
Yes
Note 1: For a unique local nonprexed index, the partitioning key must be a subset of the index key. Note 2: Although a global partitioned index may be equipartitioned with the underlying table, Oracle does not take advantage of the partitioning or maintain equipartitioning after partition maintenance operations such as DROP or SPLIT PARTITION.
5-40
Of course, if there is also a predicate on the partitioning columns, then multiple index probes might not be necessary. Oracle takes advantage of the fact that a local index is equipartitioned with the underlying table to prune partitions based on the partition key. For example, if the predicate in Figure 54 on page 5-36 is chkdate<3/97, Oracle only has to probe two partitions. So for a nonprexed index, if the partition key is a part of the WHERE clause but not of the index key, then the optimizer determines which index partitions to probe based on the underlying table partition. When many queries and DML statements using keys of local, nonprexed, indexes have to probe all index partitions, this effectively reduces the degree of partition independence provided by such indexes.
Table 53 Comparing Prexed Local, Nonprexed Local, and Global Indexes Prexed Local Yes Nonprexed Local Yes Global Yes. Must be global if using indexes on columns other than the partitioning columns Harder to manage Good Not Good
For OLTP applications: Global indexes and local prexed indexes provide better performance than local nonprexed indexes because they minimize the number of index partition probes. Local indexes support more availability when there are partition or subpartition maintenance operations on the table. Local nonprexed indexes are very useful for historical databases.
For DSS applications, local nonprexed indexes can improve performance because many index partitions can be scanned in parallel by range queries on the index key. For example, a query using the predicate "acctno between 40 and 45" on the table checks of Figure 54 on page 5-36 causes parallel scans of all the partitions of the nonprexed index ix3. On the other hand, a query using the predicate "deptno BETWEEN 40 AND 45" on the table deptno of Figure 55 on page 5-37 cannot be parallelized because it accesses a single partition of the prexed index ix1.
For historical tables, indexes should be local if possible. This limits the impact of regularly scheduled drop partition operations. Unique indexes on columns other than the partitioning columns must be global because unique local nonprexed indexes whose key does not contain the partitioning key are not supported.
Values of physical attributes specied (explicitly or by default) for the index are used whenever the value of a corresponding partition attribute is not specied. Handling of the TABLESPACE attribute of partitions of a LOCAL index constitutes an important exception to this rule in that in the absence of a user-specied TABLESPACE value (at both partition and index levels), that of the corresponding partition of the underlying table is used. Physical attributes (other than TABLESPACE, as explained in the preceding) of partitions of local indexes created in the course of processing ALTER TABLE ADD PARTITION are set to the default physical attributes of each index.
Physical attributes (other than TABLESPACE) of index partitions created by ALTER TABLE SPLIT PARTITION are determined as follows:
s
Values of physical attributes of the index partition being split are used.
5-42
Physical attributes of an existing index partition can be modied by ALTER INDEX MODIFY PARTITION and ALTER INDEX REBUILD PARTITION. Resulting attributes are determined as follows:
s
Values of physical attributes of the partition before the statement was issued are used whenever a new value is not specied. Note that ALTER INDEX REBUILD PARTITION can be used to change the tablespace in which a partition resides.
Physical attributes of global index partitions created by ALTER INDEX SPLIT PARTITION are determined as follows:
s
Values of physical attributes of the partition being split are used whenever a new value is not specied. Physical attributes of all partitions of an index (along with default values) may be modied by ALTER INDEX, for example, ALTER INDEX indexname NOLOGGING changes the logging mode of all partitions of indexname to NOLOGGING. Oracle Database Administrator's Guide for more detailed examples of adding partitions and examples of rebuilding indexes
See Also:
5-44
6
Indexes
This chapter describes how to use the following types of indexes in a data warehousing environment:
s
Using Bitmap Indexes in Data Warehouses Using B-Tree Indexes in Data Warehouses Using Index Compression Choosing Between Local Indexes and Global Indexes
See Also: Oracle Database Concepts for general information regarding indexing
Note: Bitmap indexes are available only if you have purchased the
Indexes 6-1
Reduced response time for large classes of ad hoc queries. Reduced storage requirements compared to other indexing techniques. Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory. Efcient maintenance during parallel DML and loads.
Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table. An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids. Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. Bitmap indexes store the bitmaps in a compressed way. If the number of distinct key values is small, bitmap indexes compress better and the space saving benet compared to a B-tree index becomes even better. Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. Rows that satisfy some, but not all, conditions are ltered out before the table itself is accessed. This improves response time, often dramatically. If you are unsure of which indexes to create, the SQLAccess Advisor can generate recommendations on what to create.
Parallel query and parallel DML work with bitmap indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes. Bitmap indexes are required to take advantage of Oracle's star transformation capabilities.
See Also: Chapter 19, "Schema Modeling Techniques" for further information about using bitmap indexes in data warehousing environments
Cardinality
The advantages of using bitmap indexes are greatest for columns in which the ratio of the number of distinct values to the number of rows in the table is small. We refer to this ratio as the degree of cardinality. A gender column, which has only two distinct values (male and female), is optimal for a bitmap index. However, data warehouse administrators also build bitmap indexes on columns with higher cardinalities. For example, on a table with one million rows, a column with 10,000 distinct values is a candidate for a bitmap index. A bitmap index on this column can outperform a B-tree index, particularly when this column is often queried in conjunction with other indexed columns. In fact, in a typical data warehouse environments, a bitmap index can be considered for any non-unique column. B-tree indexes are most effective for high-cardinality data: that is, for data with many possible values, such as customer_name or phone_number. In a data warehouse, B-tree indexes should be used only for unique columns or other columns with very high cardinalities (that is, columns that are almost unique). The majority of indexes in a data warehouse should be bitmap indexes. In ad hoc queries and similar situations, bitmap indexes can dramatically improve query performance. AND and OR conditions in the WHERE clause of a query can be resolved quickly by performing the corresponding Boolean operations directly on the bitmaps before converting the resulting bitmap to rowids. If the resulting number of rows is small, the query can be answered quickly without resorting to a full table scan.
Example 61 Bitmap Index
Indexes 6-3
C CUST_MARITAL_STATUS CUST_INCOME_LEVEL - -------------------- --------------------F F M F F M M M D: H: H: I: C: F: J: G: 70,000 - 89,999 150,000 - 169,999 150,000 - 169,999 170,000 - 189,999 50,000 - 69,999 110,000 - 129,999 190,000 - 249,999 130,000 - 149,999
Because cust_gender, cust_marital_status, and cust_income_level are all low-cardinality columns (there are only three possible values for marital status and region, two possible values for gender, and 12 for income level), bitmap indexes are ideal for these columns. Do not create a bitmap index on cust_id because this is a unique column. Instead, a unique B-tree index on this column provides the most efcient representation and retrieval. Table 61 illustrates the bitmap index for the cust_gender column in this example. It consists of two separate bitmaps, one for gender.
Table 61 Sample Bitmap Index gender='M' cust_id 70 cust_id 80 cust_id 90 cust_id 100 cust_id 110 cust_id 120 cust_id 130 cust_id 140 0 0 1 0 0 1 1 1 gender='F' 1 1 0 1 1 0 0 0
Each entry (or bit) in the bitmap corresponds to a single row of the customers table. The value of each bit depends upon the values of the corresponding row in the table. For example, the bitmap cust_gender='F' contains a one as its rst bit because the gender is F in the rst row of the customers table. The bitmap cust_ gender='F' has a zero for its third bit because the gender of the third row is not F.
An analyst investigating demographic trends of the company's customers might ask, "How many of our married customers have an income level of G or H?" This corresponds to the following SQL query:
SELECT COUNT(*) FROM customers WHERE cust_marital_status = 'married' AND cust_income_level IN ('H: 150,000 - 169,999', 'G: 130,000 - 149,999');
Bitmap indexes can efciently process this query by merely counting the number of ones in the bitmap illustrated in Figure 61. The result set will be found by using bitmap or merge operations without the necessity of a conversion to rowids. To identify additional specic customer attributes that satisfy the criteria, use the resulting bitmap to access the table after a bitmap to rowid conversion.
Figure 61 Executing a Query Using Bitmap Indexes
status = 'married'
region = 'central'
region = 'west'
0 1 1 0 0 1
AND
0 1 0 0 1 1
OR
0 0 1 1 0 0
0 1 1 0 0 1
AND
0 1 1 1 1 1
0 1 1 0 0 1
This query uses a bitmap index on cust_marital_status. Note that this query would not be able to use a B-tree index, because B-tree indexes do not store the NULL values.
SELECT COUNT(*) FROM customers;
Indexes 6-5
Any bitmap index can be used for this query because all table rows are indexed, including those that have NULL data. If nulls were not indexed, the optimizer would be able to use indexes only on columns with NOT NULL constraints.
Example 63
Bitmap Join Index: One Dimension Table Columns Joins One Fact Table
Unlike the example in "Bitmap Index" on page 6-3, where a bitmap index on the cust_gender column on the customers table was built, we now create a bitmap join index on the fact table sales for the joined column customers(cust_ gender). Table sales stores cust_id values only:
SELECT time_id, cust_id, amount_sold FROM sales; TIME_ID CUST_ID AMOUNT_SOLD --------- ---------- ----------01-JAN-98 29700 2291 01-JAN-98 3380 114 01-JAN-98 67830 553 01-JAN-98 179330 0 01-JAN-98 127520 195 01-JAN-98 33030 280 ...
To create such a bitmap join index, column customers(cust_gender) has to be joined with table sales. The join condition is specied as part of the CREATE statement for the bitmap join index as follows:
CREATE BITMAP INDEX sales_cust_gender_bjix ON sales(customers.cust_gender) FROM sales, customers WHERE sales.cust_id = customers.cust_id LOCAL;
The following query shows illustrates the join result that is used to create the bitmaps that are stored in the bitmap join index:
SELECT sales.time_id, customers.cust_gender, sales.amount_sold FROM sales, customers WHERE sales.cust_id = customers.cust_id; TIME_ID --------01-JAN-98 01-JAN-98 01-JAN-98 01-JAN-98 01-JAN-98 01-JAN-98 01-JAN-98 ... C AMOUNT_SOLD - ----------M 2291 F 114 M 553 M 0 M 195 M 280 M 32
Indexes 6-7
Table 62 illustrates the bitmap representation for the bitmap join index in this example.
Table 62 Sample Bitmap Join Index cust_gender='M' sales record 1 sales record 2 sales record 3 sales record 4 sales record 5 sales record 6 sales record 7 1 0 1 1 1 1 1 cust_gender='F' 0 1 0 0 0 0 0
You can create other bitmap join indexes using more than one column or more than one table, as shown in these examples.
Example 64 Bitmap Join Index: Multiple Dimension Columns Join One Fact Table
You can create a bitmap join index on more than one column from a single dimension table, as in the following example, which uses customers(cust_ gender, cust_marital_status) from the sh schema:
CREATE BITMAP INDEX sales_cust_gender_ms_bjix ON sales(customers.cust_gender, customers.cust_marital_status) FROM sales, customers WHERE sales.cust_id = customers.cust_id LOCAL NOLOGGING; Example 65 Bitmap Join Index: Multiple Dimension Tables Join One Fact Table
You can create a bitmap join index on multiple dimension tables, as in the following, which uses customers(gender) and products(category):
CREATE BITMAP INDEX sales_c_gender_p_cat_bjix ON sales(customers.cust_gender, products.prod_category) FROM sales, customers, products WHERE sales.cust_id = customers.cust_id AND sales.prod_id = products.prod_id LOCAL NOLOGGING;
Example 66
You can create a bitmap join index on more than one table, in which the indexed column is joined to the indexed table by using another table. For example, we can build an index on countries.country_name, even though the countries table is not joined directly to the sales table. Instead, the countries table is joined to the customers table, which is joined to the sales table. This type of schema is commonly called a snowake schema.
CREATE BITMAP INDEX sales_co_country_name_bjix ON sales(countries.country_name) FROM sales, customers, countries WHERE sales.cust_id = customers.cust_id AND customers.country_id = countries.country_id LOCAL NOLOGGING COMPUTE STATISTICS;
Parallel DML is currently only supported on the fact table. Parallel DML on one of the participating dimension tables will mark the index as unusable. Only one table can be updated concurrently by different transactions when using the bitmap join index. No table can appear twice in the join. You cannot create a bitmap join index on an index-organized table or a temporary table. The columns in the index must all be columns of the dimension tables. The dimension table join columns must be either primary key columns or have unique constraints. The dimension table column(s) participating the join with the fact table must be either the primary key column(s) or with the unique constraint. If a dimension table has composite primary key, each column in the primary key must be part of the join. The current restrictions for creating a regular bitmap index also apply to a bitmap join index. For example, you cannot create a bitmap index with the UNIQUE attribute. See Oracle Database SQL Reference for other restrictions.
Indexes 6-9
6-10
choose to compress four columns, the repetitiveness will be almost gone, and the compression ratio will be worse. Although key compression reduces the storage requirements of an index, it can increase the CPU time required to reconstruct the key column values during an index scan. It also incurs some additional storage overhead, because every prex entry has an overhead of four bytes associated with it.
Indexes 6-11
6-12
7
Integrity Constraints
This chapter describes integrity constraints, and discusses:
s
Why Integrity Constraints are Useful in a Data Warehouse Overview of Constraint States Typical Data Warehouse Integrity Constraints
FOREIGN KEY constraints To ensure that two keys share a primary key to foreign key relationship
Data cleanliness Constraints verify that the data in the data warehouse conforms to a basic level of data consistency and correctness, preventing the introduction of dirty data.
Query optimization The Oracle Database utilizes constraints when optimizing SQL queries. Although constraints can be useful in many aspects of query optimization, constraints are particularly important for query rewrite of materialized views.
Unlike data in many relational database environments, data in a data warehouse is typically added or modied under controlled circumstances during the extraction, transformation, and loading (ETL) process. Multiple users normally do not update the data warehouse directly, as they do in an OLTP system.
See Also: Chapter 11, "Overview of Extraction, Transformation,
and Loading" Many signicant constraint features have been introduced for data warehousing. Readers familiar with Oracle's constraint functionality in Oracle database version 7 and Oracle database version 8.x should take special note of the functionality described in this chapter. In fact, many Oracle database version7-based and Oracle database version8-based data warehouses lacked constraints because of concerns about constraint performance. Newer constraint functionality addresses these concerns.
Enforcement In order to use a constraint for enforcement, the constraint must be in the ENABLE state. An enabled constraint ensures that all data modications upon a given table (or tables) satisfy the conditions of the constraints. Data modication operations which produce data that violates the constraint fail with a constraint violation error.
Validation To use a constraint for validation, the constraint must be in the VALIDATE state. If the constraint is validated, then all data that currently resides in the table satises the constraint. Note that validation is independent of enforcement. Although the typical constraint in an operational system is both enabled and validated, any constraint could be validated but not enabled or vice versa (enabled but not validated). These latter two cases are useful for data warehouses.
Belief In some cases, you will know that the conditions for a given constraint are true, so you do not need to validate or enforce the constraint. However, you may wish for the constraint to be present anyway to improve query optimization and performance. When you use a constraint in this way, it is called a belief or RELY constraint, and the constraint must be in the RELY state. The RELY state provides you with a mechanism for telling Oracle that a given constraint is believed to be true. Note that the RELY state only affects constraints that have not been validated.
RELY Constraints Integrity Constraints and Parallelism Integrity Constraints and Partitioning View Constraints
By default, this constraint is both enabled and validated. Oracle implicitly creates a unique index on sales_id to support this constraint. However, this index can be problematic in a data warehouse for three reasons:
s
The unique index can be very large, because the sales table can easily have millions or even billions of rows. The unique index is rarely used for query execution. Most data warehousing queries do not have predicates on unique keys, so creating this index will probably not improve performance. If sales is partitioned along a column other than sales_id, the unique index must be global. This can detrimentally affect all maintenance operations on the sales table.
A unique index is required for unique constraints to ensure that each individual row modied in the sales table satises the UNIQUE constraint. For data warehousing tables, an alternative mechanism for unique constraints is illustrated in the following statement:
ALTER TABLE sales ADD CONSTRAINT sales_uk UNIQUE (prod_id, cust_id, promo_id, channel_id, time_id) DISABLE VALIDATE;
This statement creates a unique constraint, but, because the constraint is disabled, a unique index is not required. This approach can be advantageous for many data warehousing environments because the constraint now ensures uniqueness without the cost of a unique index. However, there are trade-offs for the data warehouse administrator to consider with DISABLE VALIDATE constraints. Because this constraint is disabled, no DML statements that modify the unique column are permitted against the sales table. You can use one of two strategies for modifying this table in the presence of a constraint:
s
Use DDL to add data to this table (such as exchanging partitions). See the example in Chapter 15, "Maintaining the Data Warehouse". Before modifying this table, drop the constraint. Then, make all necessary data modications. Finally, re-create the disabled constraint. Re-creating the constraint is more efcient than re-creating an enabled constraint. However, this approach does not guarantee that data added to the sales table while the constraint has been dropped is unique.
However, in some situations, you may choose to use a different state for the FOREIGN KEY constraints, in particular, the ENABLE NOVALIDATE state. A data warehouse administrator might use an ENABLE NOVALIDATE constraint when either:
s
The tables contain data that currently disobeys the constraint, but the data warehouse administrator wishes to create a constraint for future enforcement. An enforced constraint is required immediately.
Suppose that the data warehouse loaded new data into the fact tables every day, but refreshed the dimension tables only on the weekend. During the week, the dimension tables and fact tables may in fact disobey the FOREIGN KEY constraints. Nevertheless, the data warehouse administrator might wish to maintain the enforcement of this constraint to prevent any changes that might affect the
FOREIGN KEY constraint outside of the ETL process. Thus, you can create the FOREIGN KEY constraints every night, after performing the ETL process, as shown here:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk FOREIGN KEY (time_id) REFERENCES times (time_id) ENABLE NOVALIDATE;
ENABLE NOVALIDATE can quickly create an enforced constraint, even when the constraint is believed to be true. Suppose that the ETL process veries that a FOREIGN KEY constraint is true. Rather than have the database re-verify this FOREIGN KEY constraint, which would require time and database resources, the data warehouse administrator could instead create a FOREIGN KEY constraint using ENABLE NOVALIDATE.
RELY Constraints
The ETL process commonly veries that certain constraints are true. For example, it can validate all of the foreign keys in the data coming into the fact table. This means that you can trust it to provide clean data, instead of implementing constraints in the data warehouse. You create a RELY constraint as follows:
ALTER TABLE sales ADD CONSTRAINT sales_time_fk FOREIGN KEY (time_id) REFERENCES times (time_id) RELY DISABLE NOVALIDATE;
This statement assumes that the primary key is in the RELY state. RELY constraints, even though they are not used for data validation, can:
s
Enable more sophisticated query rewrites for materialized views. See Chapter 18, "Query Rewrite" for further details. Enable other data warehousing tools to retrieve information regarding constraints directly from the Oracle data dictionary.
Creating a RELY constraint is inexpensive and does not impose any overhead during DML or load. Because the constraint is not being validated, no data processing is necessary to create it.
parallelism for a given constraint operation is determined by the default degree of parallelism of the underlying table.
View Constraints
You can create constraints on views. The only type of constraint supported on a view is a RELY constraint. This type of constraint is useful when queries typically access views instead of base tables, and the database administrator thus needs to dene the data relationships between views rather than tables. View constraints are particularly useful in OLAP environments, where they may enable more sophisticated rewrites for materialized views.
See Also: Chapter 8, "Basic Materialized Views" and Chapter 18, "Query Rewrite"
8
Basic Materialized Views
This chapter introduces you to the use of materialized views, and discusses:
s
Overview of Data Warehousing with Materialized Views Types of Materialized Views Creating Materialized Views Registering Existing Materialized Views Choosing Indexes for Materialized Views Dropping Materialized Views Analyzing Materialized View Capabilities
are often referred to as summaries, because they store summarized data. They can also be used to precompute joins with or without aggregations. A materialized view eliminates the overhead associated with expensive joins and aggregations for a large or important class of queries.
Figure 81
Query Results Strategy Compare plan cost and pick the best
When using query rewrite, create materialized views that satisfy the largest number of queries. For example, if you identify 20 queries that are commonly applied to the detail or fact tables, then you might be able to satisfy them with ve or six well-written materialized views. A materialized view denition can include any number of aggregations (SUM, COUNT(x), COUNT(*), COUNT(DISTINCT x), AVG, VARIANCE, STDDEV, MIN, and MAX). It can also include any number of joins. If you are unsure of which materialized views to create, Oracle provides the SQLAccess Advisor, which is a set of advisory procedures in the DBMS_ADVISOR package to help in designing and evaluating materialized views for query rewrite. See Chapter 17, "SQLAccess Advisor" for further details. If a materialized view is to be used by query rewrite, it must be stored in the same database as the detail tables on which it relies. A materialized view can be partitioned, and you can dene a materialized view on a partitioned table. You can also dene one or more indexes on the materialized view. Unlike indexes, materialized views can be accessed directly using a SELECT statement. However, it is recommended that you try to avoid writing SQL statements that directly reference the materialized view, because then it is difcult to change them without affecting the application. Instead, let query rewrite transparently rewrite your query to use the materialized view.
Note that the techniques shown in this chapter illustrate how to use materialized views in data warehouses. Materialized views can also be used by Oracle Replication. See Oracle Database Advanced Replication for further information.
Mechanisms to dene materialized views and dimensions. A refresh mechanism to ensure that all materialized views contain the latest data. A query rewrite capability to transparently rewrite a query to use a materialized view. The SQLAccess Advisor, which recommends materialized views and indexes to create. See Chapter 17, "SQLAccess Advisor" for more information. TUNE_MVIEW, which shows you how to make your materialized view fast refreshable and use general query rewrite.
The use of summary management features imposes no schema restrictions, and can enable some existing DSS database applications to improve performance without the need to redesign the database or the application. Figure 82 illustrates the use of summary management in the warehousing cycle. After the data has been transformed, staged, and loaded into the detail data in the warehouse, you can invoke the summary management process. First, use the SQLAccess Advisor to plan how you will use materialized views. Then, create materialized views and design how queries will be rewritten. If you are having problems trying to get your materialized views to work then use TUNE_MVIEW to obtain an optimzed materialized view.
Figure 82
Operational Databases
Summary Management
Data Warehouse
Query Rewrite
Workload Statistics
Understanding the summary management process during the earliest stages of data warehouse design can yield large dividends later in the form of higher performance, lower summary administration costs, and reduced storage requirements.
Dimension tables describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables. Dimension tables usually change slowly over time and are not modied on a periodic schedule. They are used in long-running decision support queries to aggregate the data returned from the query into appropriate levels of the dimension hierarchy.
Hierarchies describe the business relationships and common access patterns in the database. An analysis of the dimensions, combined with an understanding of the typical work load, can be used to create materialized views. See Chapter 10, "Dimensions" for more information. Fact tables describe the business transactions of an enterprise. The vast majority of data in a data warehouse is stored in a few very large fact tables that are updated periodically with data from one or more operational OLTP databases. Fact tables include facts (also called measures) such as sales, units, and inventory. A simple measure is a numeric or character column of one table such as fact.sales. A computed measure is an expression involving measures of one table, for example, fact.revenues - fact.expenses. A multitable measure is a computed measure dened on multiple tables, for example, fact_a.revenues - fact_b.expenses.
Fact tables also contain one or more foreign keys that organize the business transactions by the relevant business entities such as time, product, and market. In most cases, these foreign keys are non-null, form a unique compound key of the fact table, and each foreign key joins with exactly one row of a dimension table.
s
A materialized view is a precomputed table comprising aggregated and joined data from fact and possibly from dimension tables. Among builders of data warehouses, a materialized view is also known as a summary.
The purpose of a materialized view is to increase query execution performance. The existence of a materialized view is transparent to SQL applications, so that a database administrator can create or drop materialized views at any time without affecting the validity of SQL applications. A materialized view consumes storage space. The contents of the materialized view must be updated when the underlying detail tables are modied.
Guideline 3 Dimensions
If you are concerned with the time required to enable constraints and whether any constraints might be violated, then use the ENABLE NOVALIDATE with the RELY
clause to turn on constraint checking without validating any of the existing constraints. The risk with this approach is that incorrect query results could occur if any constraints are broken. Therefore, as the designer, you must determine how clean the data is and whether the risk of incorrect results is too great.
Data is rst loaded into a temporary table in the warehouse. Quality assurance procedures are applied to the data. Referential integrity constraints on the target table are disabled, and the local index in the target partition is marked unusable. The data is copied from the temporary area into the appropriate partition of the target table using INSERT AS SELECT with the PARALLEL or APPEND hint. The temporary table is then dropped. Alternatively, if the target table is partitioned, you can create a new (empty) partition in the target table and use ALTER TABLE ... EXCHANGE PARTITION to incorporate the temporary table into the target table. See Oracle Database SQL Reference for more information. The constraints are enabled, usually with the NOVALIDATE option.
Immediately after loading the detail data and updating the indexes on the detail data, the database can be opened for operation, if desired. You can disable query rewrite at the system level by issuing an ALTER SYSTEM SET QUERY_REWRITE_ ENABLED = FALSE statement until all the materialized views are refreshed. If QUERY_REWRITE_INTEGRITY is set to STALE_TOLERATED, access to the materialized view can be allowed at the session level to any users who do not require the materialized views to reect the data from the latest load by issuing an
8-10
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE statement. This scenario does not apply when QUERY_REWRITE_INTEGRITY is either ENFORCED or TRUSTED because the system ensures in these modes that only materialized views with updated data participate in a query rewrite.
Identifying what materialized views to create initially. Indexing the materialized views. Ensuring that all materialized views and materialized view indexes are refreshed properly each time the database is updated. Checking which materialized views have been used. Determining how effective each materialized view has been on workload performance. Measuring the space being used by materialized views. Determining which new materialized views should be created. Determining which existing materialized views should be dropped. Archiving old detail and materialized view data that is no longer useful.
After the initial effort of creating and populating the data warehouse or data mart, the major administration overhead is the update process, which involves:
s
Periodic extraction of incremental changes from the operational systems. Transforming the data. Verifying that the incremental changes are correct, consistent, and complete. Bulk-loading the data into the warehouse. Refreshing indexes and materialized views so that they are consistent with the detail data.
The update process must generally be performed within a limited period of time known as the update window. The update window depends on the update
frequency (such as daily or weekly) and the nature of the business. For a daily update frequency, an update window of two to six hours might be typical. You need to know your update window for the following activities:
s
Loading the detail data Updating or rebuilding the indexes on the detail data Performing quality assurance tests on the data Refreshing the materialized views Updating the indexes on the materialized views
Materialized Views with Aggregates Materialized Views Containing Only Joins Nested Materialized Views
8-12
Fast refresh for a materialized view containing joins and aggregates is possible after any type of DML to the base tables (direct load or conventional INSERT, UPDATE, or DELETE). It can be dened to be refreshed ON COMMIT or ON DEMAND. A REFRESH ON COMMIT materialized view will be refreshed automatically when a transaction that does DML to one of the materialized view's detail tables commits. The time taken to complete the commit may be slightly longer than usual when this method is chosen. This is because the refresh operation is performed as part of the commit process. Therefore, this method may not be suitable if many users are concurrently changing the tables upon which the materialized view is based. Here are some examples of materialized views with aggregates. Note that materialized view logs are only created because this materialized view will be fast refreshed.
Example 81 Example 1: Creating a Materialized View
CREATE MATERIALIZED VIEW LOG ON products WITH SEQUENCE, ROWID (prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW LOG ON sales WITH SEQUENCE, ROWID (prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW product_sales_mv PCTFREE 0 TABLESPACE demo STORAGE (INITIAL 8k NEXT 8k PCTINCREASE 0) BUILD IMMEDIATE REFRESH FAST ENABLE QUERY REWRITE AS SELECT p.prod_name, SUM(s.amount_sold) AS dollar_sales, COUNT(*) AS cnt, COUNT(s.amount_sold) AS cnt_amt FROM sales s, products p WHERE s.prod_id = p.prod_id GROUP BY p.prod_name;
This example creates a materialized view product_sales_mv that computes total number and value of sales for a product. It is derived by joining the tables sales and products on the column prod_id. The materialized view is populated with data immediately because the build method is immediate and it is available for use by query rewrite. In this example, the default refresh method is FAST, which is
allowed because the appropriate materialized view logs have been created on tables product and sales.
Example 82 Example 2: Creating a Materialized View
CREATE MATERIALIZED VIEW product_sales_mv PCTFREE 0 TABLESPACE demo STORAGE (INITIAL 16k NEXT 16k PCTINCREASE 0) BUILD DEFERRED REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT p.prod_name, SUM(s.amount_sold) AS dollar_sales FROM sales s, products p WHERE s.prod_id = p.prod_id GROUP BY p.prod_name;
This example creates a materialized view product_sales_mv that computes the sum of sales by prod_name. It is derived by joining the tables sales and products on the column prod_id. The materialized view does not initially contain any data, because the build method is DEFERRED. A complete refresh is required for the rst refresh of a build deferred materialized view. When it is refreshed and once populated, this materialized view can be used by query rewrite.
Example 83 Example 3: Creating a Materialized View
CREATE MATERIALIZED VIEW LOG ON sales WITH SEQUENCE, ROWID (prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW sum_sales PARALLEL BUILD IMMEDIATE REFRESH FAST ON COMMIT AS SELECT s.prod_id, s.time_id, COUNT(*) AS count_grp, SUM(s.amount_sold) AS sum_dollar_sales, COUNT(s.amount_sold) AS count_dollar_sales, SUM(s.quantity_sold) AS sum_quantity_sales, COUNT(s.quantity_sold) AS count_quantity_sales FROM sales s GROUP BY s.prod_id, s.time_id;
This example creates a materialized view that contains aggregates on a single table. Because the materialized view log has been created with all referenced columns in the materialized view's dening query, the materialized view is fast refreshable. If
8-14
DML is applied against the sales table, then the changes will be reected in the materialized view when the commit is issued.
If aggregate X is present, aggregate Y is required and aggregate Z is optional X COUNT(expr) SUM(expr) AVG(expr) STDDEV(expr) VARIANCE(expr) Y COUNT(expr) COUNT(expr) COUNT(expr) SUM(expr) COUNT(expr) SUM(expr) Z SUM(expr) SUM(expr * expr) SUM(expr * expr)
Note that COUNT(*) must always be present to guarantee all types of fast refresh. Otherwise, you may be limited to fast refresh after inserts only. Oracle recommends that you include the optional aggregates in column Z in the materialized view in order to obtain the most efcient and accurate fast refresh of the aggregates.
If you specify REFRESH FAST, Oracle performs further verication of the query denition to ensure that fast refresh can be performed if any of the detail tables change. These additional checks are:
s
A materialized view log must be present for each detail table and the ROWID column must be present in each materialized view log. The rowids of all the detail tables must appear in the SELECT list of the materialized view query denition. If there are no outer joins, you may have arbitrary selections and joins in the WHERE clause. However, if there are outer joins, the WHERE clause cannot have any selections. Further, if there are outer joins, all the joins must be connected by ANDs and must use the equality (=) operator. If there are outer joins, unique constraints must exist on the join columns of the inner table. For example, if you are joining the fact table and a dimension table and the join is an outer join with the fact table being the outer table, there must exist unique constraints on the join columns of the dimension table.
If some of these restrictions are not met, you can create the materialized view as REFRESH FORCE to take advantage of fast refresh when it is possible. If one of the tables did not meet all of the criteria, but the other tables did, the materialized view would still be fast refreshable with respect to the other tables for which all the criteria are met.
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID; CREATE MATERIALIZED VIEW LOG ON times WITH ROWID; CREATE MATERIALIZED VIEW LOG ON customers WITH ROWID;
8-16
CREATE MATERIALIZED VIEW detail_sales_mv PARALLEL BUILD IMMEDIATE REFRESH FAST AS SELECT s.rowid "sales_rid", t.rowid "times_rid", c.rowid "customers_rid", c.cust_id, c.cust_last_name, s.amount_sold, s.quantity_sold, s.time_id FROM sales s, times t, customers c WHERE s.cust_id = c.cust_id(+) AND s.time_id = t.time_id(+);
In this example, to perform a fast refresh, UNIQUE constraints should exist on c.cust_id and t.time_id. You should also create indexes on the columns sales_rid, times_rid, and customers_rid, as illustrated in the following. This will improve the refresh performance.
CREATE INDEX mv_ix_salesrid ON detail_sales_mv("sales_rid");
Alternatively, if the previous example did not include the columns times_rid and customers_rid, and if the refresh method was REFRESH FORCE, then this materialized view would be fast refreshable only if the sales table was updated but not if the tables times or customers were updated.
CREATE MATERIALIZED VIEW detail_sales_mv PARALLEL BUILD IMMEDIATE REFRESH FORCE AS SELECT s.rowid "sales_rid", c.cust_id, c.cust_last_name, s.amount_sold, s.quantity_sold, s.time_id FROM sales s, times t, customers c WHERE s.cust_id = c.cust_id(+) AND s.time_id = t.time_id(+);
In addition, optimizations can be performed for this class of single-table aggregate materialized view and thus refresh is very efcient.
Example 85 Nested Materialized View
You can create a nested materialized view on materialized views that contain joins only or joins and aggregates. All the underlying objects (materialized views or tables) on which the materialized view is dened must have a materialized view log. All the underlying objects are treated as if they were tables. In addition, you can use all the existing options for materialized views. Using the tables and their columns from the sh sample schema, the following materialized views illustrate how nested materialized views can be created.
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID; CREATE MATERIALIZED VIEW LOG ON customers WITH ROWID; CREATE MATERIALIZED VIEW LOG ON times WITH ROWID; /*create materialized view join_sales_cust_time as fast refreshable at COMMIT time */ CREATE MATERIALIZED VIEW join_sales_cust_time REFRESH FAST ON COMMIT AS SELECT c.cust_id, c.cust_last_name, s.amount_sold, t.time_id, t.day_number_in_week, s.rowid srid, t.rowid trid, c.rowid crid FROM sales s, customers c, times t WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id;
To create a nested materialized view on the table join_sales_cust_time, you would have to create a materialized view log on the table. Because this will be a single-table aggregate materialized view on join_sales_cust_time, you need to log all the necessary columns and use the INCLUDING NEW VALUES clause.
/* create materialized view log on join_sales_cust_time */ CREATE MATERIALIZED VIEW LOG ON join_sales_cust_time WITH ROWID (cust_last_name, day_number_in_week, amount_sold) INCLUDING NEW VALUES; /* create the single-table aggregate materialized view sum_sales_cust_time on join_sales_cust_time as fast refreshable at COMMIT time */ CREATE MATERIALIZED VIEW sum_sales_cust_time REFRESH FAST ON COMMIT AS SELECT COUNT(*) cnt_all, SUM(amount_sold) sum_sales, COUNT(amount_sold) cnt_sales, cust_last_name, day_number_in_week FROM join_sales_cust_time GROUP BY cust_last_name, day_number_in_week;
8-18
If you want to use fast refresh, you should fast refresh all the materialized views along any chain. If you want the highest level materialized view to be fresh with respect to the detail tables, you need to ensure that all materialized views in a tree are refreshed in the correct dependency order before refreshing the highest-level. You can automatically refesh intermediate materialized views in a nested hierarchy using the nested = TRUE parameter, as described in "Nesting Materialized Views with Joins and Aggregates" on page 8-19. If you do not specify nested = TRUE and the materialzied views under the highest-level materialized view are stale, refreshing only the highest-level will succeed, but makes it fresh only with respect to its underlying materialized view, not the detail tables at the base of the tree. When refreshing materialized views, you need to ensure that all materialized views in a tree are refreshed. If you only refresh the highest-level materialized view, the materialized views under it will be stale and you must explicitly refresh them. If you use the REFRESH procedure with the nested parameter value set to TRUE, only specied materialized views and their child materialized views in the tree are refreshed, and not their top-level materialized views. Use the REFRESH_DEPENDENT procedure with the nested parameter value set to TRUE if you want to ensure that all materialized views in a tree are refreshed. Freshness of a materialized view is calculated relative to the objects directly referenced by the materialized view. When a materialized view references another materialized view, the freshness of the topmost materialized view is calculated relative to changes in the materialized view it directly references, not
CREATE MATERIALIZED VIEW cust_sales_mv PCTFREE 0 TABLESPACE demo STORAGE (INITIAL 16k NEXT 16k PCTINCREASE 0) PARALLEL BUILD IMMEDIATE REFRESH COMPLETE ENABLE QUERY REWRITE AS SELECT c.cust_last_name, SUM(amount_sold) AS sum_amount_sold FROM customers c, sales s WHERE s.cust_id = c.cust_id GROUP BY c.cust_last_name;
It is not uncommon in a data warehouse to have already created summary or aggregation tables, and you might not wish to repeat this work by building a new materialized view. In this case, the table that already exists in the database can be registered as a prebuilt materialized view. This technique is described in "Registering Existing Materialized Views" on page 8-34. Once you have selected the materialized views you want to create, follow these steps for each materialized view.
1.
Design the materialized view. Existing user-dened materialized views do not require this step. If the materialized view contains many rows, then, if appropriate, the materialized view should be partitioned (if possible) and should match the partitioning of the largest or most frequently updated detail or fact table (if possible). Refresh performance benets from partitioning,
8-20
because it can take advantage of parallel DML capabilities and possible PCT-based refresh.
2.
Use the CREATE MATERIALIZED VIEW statement to create and, optionally, populate the materialized view. If a user-dened materialized view already exists, then use the ON PREBUILT TABLE clause in the CREATE MATERIALIZED VIEW statement. Otherwise, use the BUILD IMMEDIATE clause to populate the materialized view immediately, or the BUILD DEFERRED clause to populate the materialized view later. A BUILD DEFERRED materialized view is disabled for use by query rewrite until the rst COMPLETE REFRESH, after which it will be automatically enabled, provided the ENABLE QUERY REWRITE clause has been specied.
See Also: Oracle Database SQL Reference for descriptions of the SQL statements CREATE MATERIALIZED VIEW, ALTER MATERIALIZED VIEW, and DROP MATERIALIZED VIEW
SELECT s.time_id, c.time_id FROM sales s, products p, costs c WHERE s.prod_id = p.prod_id AND c.prod_id = p.prod_id AND p.prod_name IN (SELECT prod_name FROM products);
Even though the materialized view's dening query is almost identical and logically equivalent to the user's input query, query rewrite does not happen because of the failure of full text match that is the only rewrite possibility for some queries (for example, a subquery in the WHERE clause). You can add a column alias list to a CREATE MATERIALIZED VIEW statement. The column alias list explicitly resolves any column name conict without attaching aliases in the SELECT clause of the materialized view. The syntax of the materialized view column alias list is illustrated in the following example:
CREATE MATERIALIZED VIEW sales_mv (sales_tid, costs_tid) ENABLE QUERY REWRITE AS SELECT s.time_id, c.time_id FROM sales s, products p, costs c WHERE s.prod_id = p.prod_id AND c.prod_id = p.prod_id AND p.prod_name IN (SELECT prod_name FROM products);
In this example, the dening query of sales_mv now matches exactly with the user query Q1, so full text match rewrite will take place. Note that when aliases are specied in both the SELECT clause and the new alias list clause, the alias list clause supersedes the ones in the SELECT clause.
8-22
materialized view should be specied in terms of the tablespace where it is to reside and the size of the extents. If you do not know how much space the materialized view will require, then the DBMS_MVIEW.ESTIMATE_SIZE package can estimate the number of bytes required to store this uncompressed materialized view. This information can then assist the design team in determining the tablespace in which the materialized view should reside. You should use table compression with highly redundant data, such as tables with many foreign keys. This is particularly useful for materialized views created with the ROLLUP clause. Table compression reduces disk use and memory use (specically, the buffer cache), often leading to a better scaleup for read-only operations. Table compression can also speed up query execution at the expense of update cost.
See Also: Oracle Database SQL Reference for a complete description
of STORAGE semantics, Oracle Database Performance Tuning Guide, and Chapter 5, "Parallelism and Partitioning in Data Warehouses" for table compression examples
Build Methods
Two build methods are available for creating the materialized view, as shown in Table 83. If you select BUILD IMMEDIATE, the materialized view denition is added to the schema objects in the data dictionary, and then the fact or detail tables are scanned according to the SELECT expression and the results are stored in the materialized view. Depending on the size of the tables to be scanned, this build process can take a considerable amount of time. An alternative approach is to use the BUILD DEFERRED clause, which creates the materialized view without data, thereby enabling it to be populated at a later date using the DBMS_MVIEW.REFRESH package described in Chapter 15, "Maintaining the Data Warehouse".
Table 83 Build Methods Description Create the materialized view and then populate it with data. Create the materialized view denition but do not populate it with data.
The dening query of the materialized view cannot contain any non-repeatable expressions (ROWNUM, SYSDATE, non-repeatable PL/SQL functions, and so on). The query cannot contain any references to RAW or LONG RAW datatypes or object REFs. If the materialized view was registered as PREBUILT, the precision of the columns must agree with the precision of the corresponding SELECT expressions unless overridden by the WITH REDUCED PRECISION clause.
8-24
If a query has both local and remote tables, only local tables will be considered for potential rewrite. Neither the detail tables nor the materialized view can be owned by SYS. If a column or expression is present in the GROUP BY clause of the materialized view, it must also be present in the SELECT list. Aggregate functions must occur only as the outermost part of the expression. That is, aggregates such as AVG(AVG(x)) or AVG(x)+ AVG(x) are not allowed. CONNECT BY clauses are not allowed.
Refresh Options
When you dene a materialized view, you can specify three refresh options: how to refresh, what type of refresh, and can trusted constraints be used. If unspecied, the defaults are assumed as ON DEMAND, FORCE, and ENFORCED constraints respectively. The two refresh execution modes are ON COMMIT and ON DEMAND. Depending on the materialized view you create, some of the options may not be available. Table 84 describes the refresh modes.
Table 84 Refresh Modes
Description Refresh occurs automatically when a transaction that modied one of the materialized view's detail tables commits. This can be specied as long as the materialized view is fast refreshable (in other words, not complex). The ON COMMIT privilege is necessary to use this mode Refresh occurs when a user manually executes one of the available refresh procedures contained in the DBMS_MVIEW package (REFRESH, REFRESH_ALL_MVIEWS, REFRESH_DEPENDENT)
ON DEMAND
When a materialized view is maintained using the ON COMMIT method, the time required to complete the commit may be slightly longer than usual. This is because the refresh operation is performed as part of the commit process. Therefore this method may not be suitable if many users are concurrently changing the tables upon which the materialized view is based. If you anticipate performing insert, update or delete operations on tables referenced by a materialized view concurrently with the refresh of that materialized view, and
that materialized view includes joins and aggregation, Oracle recommends you use ON COMMIT fast refresh rather than ON DEMAND fast refresh. If you think the materialized view did not refresh, check the alert log or trace le. If a materialized view fails during refresh at COMMIT time, you must explicitly invoke the refresh procedure using the DBMS_MVIEW package after addressing the errors specied in the trace les. Until this is done, the materialized view will no longer be refreshed automatically at commit time. You can specify how you want your materialized views to be refreshed from the detail tables by selecting one of four options: COMPLETE, FAST, FORCE, and NEVER. Table 85 describes the refresh options.
Table 85 Refresh Options
Description Refreshes by recalculating the materialized view's dening query. Applies incremental changes to refresh the materialized view using the information logged in the materialized view logs, or from a SQL*Loader direct-path or a partition maintenance operation. Applies FAST refresh if possible; otherwise, it applies COMPLETE refresh. Indicates that the materialized view will not be refreshed with refresh mechanisms.
Whether the fast refresh option is available depends upon the type of materialized view. You can call the procedure DBMS_MVIEW.EXPLAIN_MVIEW to determine whether fast refresh is possible. You can also specify if it is acceptable to use trusted constraints and REWRITE_ INTEGRITY = TRUSTED during refresh. Any nonvalidated RELY constraint is a trusted constraint. For example, nonvalidated foreign key/primary key relationships, functional dependencies dened in dimensions or a materialized view in the UNKNOWN state. If query rewrite is enabled during refresh, these can improve the performance of refresh by enabling more performant query rewrites. Any materialized view that can uses TRUSTED constraints for refresh is left in a state of trusted freshness (the UNKNOWN state) after refresh. This is reected in the column STALENESS in the view USER_MVIEWS. The column UNKNOWN_TRUSTED_FD in the same view is also set to Y, which means yes. You can dene this property of the materialized either during create time by specifying REFRESH USING TRUSTED [ENFORCED] CONSTRAINTS or by using ALTER MATERIALIZED VIEW DDL.
8-26
Table 86
Constraints
Description Refresh can use trusted constraints and REWRITE_INTEGRITY = TRUSTED during refresh. This allows use of non-validated RELY constraints and rewrite against materialized views in UNKNOWN or FRESH state during refresh.
ENFORCED CONSTRAINTS
Refresh can use validated constraints and REFRESH_INTEGRITY=ENFORCED during refresh. This allows use of only validated, enforced constraints and rewrite against materialized views in FRESH state during refresh.
The materialized view must not contain references to non-repeating expressions like SYSDATE and ROWNUM. The materialized view must not contain references to RAW or LONG RAW data types.
All restrictions from "General Restrictions on Fast Refresh" on page 8-27. They cannot have GROUP BY clauses or aggregates. If the WHERE clause of the query contains outer joins, then unique constraints must exist on the join columns of the inner join table. If there are no outer joins, you can have arbitrary selections and joins in the WHERE clause. However, if there are outer joins, the WHERE clause cannot have any selections. Furthermore, if there are outer joins, all the joins must be connected by ANDs and must use the equality (=) operator. Rowids of all the tables in the FROM list must appear in the SELECT list of the query. Materialized view logs must exist with rowids for all the base tables in the FROM list of the query.
Fast refresh is supported for both ON COMMIT and ON DEMAND materialized views, however the following restrictions apply:
s
All tables in the materialized view must have materialized view logs, and the materialized view logs must:
s
Contain all columns from the table referenced in the materialized view. Specify with ROWID and INCLUDING NEW VALUES. Specify the SEQUENCE clause if the table is expected to have a mix of inserts/direct-loads, deletes, and updates.
Only SUM, COUNT, AVG, STDDEV, VARIANCE, MIN and MAX are supported for fast refresh. COUNT(*) must be specied. For each aggregate such as AVG(expr), the corresponding COUNT(expr) must be present. If VARIANCE(expr) or STDDEV(expr) is specied, COUNT(expr) and SUM(expr) must be specied. Oracle recommends that SUM(expr *expr) be specied. See Table 82 on page 8-15 for further details. The SELECT list must contain all GROUP BY columns. If the materialized view has one of the following, then fast refresh is supported only on conventional DML inserts and direct loads.
s
Materialized views with MIN or MAX aggregates Materialized views which have SUM(expr) but no COUNT(expr) Materialized views without COUNT(*)
A materialized view with MAX or MIN is fast refreshable after delete or mixed DML statements if it does not have a WHERE clause. Materialized views with named views or subqueries in the FROM clause can be fast refreshed provided the views can be completely merged. For information on which views will merge, refer to the Oracle Database Performance Tuning Guide. If there are no outer joins, you may have arbitrary selections and joins in the WHERE clause.
8-28
Materialized aggregate views with outer joins are fast refreshable after conventional DML and direct loads, provided only the outer table has been modied. Also, unique constraints must exist on the join columns of the inner join table. If there are outer joins, all the joins must be connected by ANDs and must use the equality (=) operator. For materialized views with CUBE, ROLLUP, grouping sets, or concatenation of them, the following restrictions apply:
s
The SELECT list should contain grouping distinguisher that can either be a GROUPING_ID function on all GROUP BY expressions or GROUPING functions one for each GROUP BY expression. For example, if the GROUP BY clause of the materialized view is "GROUP BY CUBE(a, b)", then the SELECT list should contain either "GROUPING_ID(a, b)" or "GROUPING(a) AND GROUPING(b)" for the materialized view to be fast refreshable. GROUP BY should not result in any duplicate groupings. For example, "GROUP BY a, ROLLUP(a, b)" is not fast refreshable because it results in duplicate groupings "(a), (a, b), AND (a)".
The dening query must have the UNION ALL operator at the top level. The UNION ALL operator cannot be embedded inside a subquery, with one exception: The UNION ALL can be in a subquery in the FROM clause provided the dening query is of the form SELECT * FROM (view or subquery with UNION ALL) as in the following example:
CREATE VIEW view_with_unionall_mv AS (SELECT c.rowid crid, c.cust_id, 2 umarker FROM customers c WHERE c.cust_last_name = 'Smith' UNION ALL SELECT c.rowid crid, c.cust_id, 3 umarker FROM customers c WHERE c.cust_last_name = 'Jones'); CREATE MATERIALIZED VIEW unionall_inside_view_mv REFRESH FAST ON DEMAND AS SELECT * FROM view_with_unionall;
Note that the view view_with_unionall_mv satises all requirements for fast refresh.
s
Each query block in the UNION ALL query must satisfy the requirements of a fast refreshable materialized view with aggregates or a fast refreshable materialized view with joins. The appropriate materialized view logs must be created on the tables as required for the corresponding type of fast refreshable materialized view. Note that the Oracle Database also allows the special case of a single table materialized view with joins only provided the ROWID column has been included in the SELECT list and in the materialized view log. This is shown in the dening query of the view view_with_unionall_mv.
The SELECT list of each query must include a maintenance column, called a UNION ALL marker. The UNION ALL column must have a distinct constant numeric or string value in each UNION ALL branch. Further, the marker column must appear in the same ordinal position in the SELECT list of each query block. Some features such as outer joins, insert-only aggregate materialized view queries and remote tables are not supported for materialized views with UNION ALL. PCT-based refresh is not supported for UNION ALL materialized views. The compatibility initialization parameter must be set to 9.2.0 or higher to create a fast refreshable materialized view with UNION ALL.
8-30
set to TRUE on the base tables. If you use DBMS_MVIEW.REFRESH, the entire materialized view chain is refreshed from the top down. With DBMS_ MVIEW.REFRESH_DEPENDENT, the entire chain is refreshed from the bottom up.
Example 87 Example of Refreshing a Nested Materialized View
This statement will rst refresh all child materialized views of sales_mv and cost_mv based on the dependency analysis and then refresh the two specied materialized views. You can query the STALE_SINCE column in the *_MVIEWS views to nd out when a materialized view became stale.
ORDER BY Clause
An ORDER BY clause is allowed in the CREATE MATERIALIZED VIEW statement. It is used only during the initial creation of the materialized view. It is not used during a full refresh or a fast refresh. To improve the performance of queries against large materialized views, store the rows in the materialized view in the order specied in the ORDER BY clause. This initial ordering provides physical clustering of the data. If indexes are built on the columns by which the materialized view is ordered, accessing the rows of the materialized view using the index often reduces the time for disk I/O due to the physical clustering. The ORDER BY clause is not considered part of the materialized view denition. As a result, there is no difference in the manner in which Oracle Database detects the various types of materialized views (for example, materialized join views with no aggregates). For the same reason, query rewrite is not affected by the ORDER BY clause. This feature is similar to the CREATE TABLE ... ORDER BY capability.
are not created on the materialized view. For fast refresh of materialized views, the denition of the materialized view logs must normally specify the ROWID clause. In addition, for aggregate materialized views, it must also contain every column in the table referenced in the materialized view, the INCLUDING NEW VALUES clause and the SEQUENCE clause. An example of a materialized view log is shown as follows where one is created on the table sales.
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID (prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold) INCLUDING NEW VALUES;
Alternatively, a materialized view log can be amended to include the rowid, as in the following:
ALTER MATERIALIZED VIEW LOG ON sales ADD ROWID (prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold) INCLUDING NEW VALUES;
Oracle recommends that the keyword SEQUENCE be included in your materialized view log statement unless you are sure that you will never perform a mixed DML operation (a combination of INSERT, UPDATE, or DELETE operations on multiple tables). The SEQUENCE column is required in the materialized view log to support fast refresh with a combination of INSERT, UPDATE, or DELETE statements on multiple tables. You can, however, add the SEQUENCE number to the materialized view log after it has been created. The boundary of a mixed DML operation is determined by whether the materialized view is ON COMMIT or ON DEMAND.
s
For ON COMMIT, the mixed DML statements occur within the same transaction because the refresh of the materialized view will occur upon commit of this transaction. For ON DEMAND, the mixed DML statements occur between refreshes. The following example of a materialized view log illustrates where one is created on the table sales that includes the SEQUENCE keyword:
CREATE MATERIALIZED VIEW LOG ON sales WITH SEQUENCE, ROWID (prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold) INCLUDING NEW VALUES;
8-32
Expressions that may return different values, depending on NLS parameter settings. For example, (date > "01/02/03") or (rate <= "2.150") are NLS parameter dependent expressions. Equijoins where one side of the join is character data. The result of this equijoin depends on collation and this can change on a session basis, giving an incorrect result in the case of query rewrite or an inconsistent materialized view after a refresh operation. Expressions that generate internal conversion to character data in the SELECT list of a materialized view, or inside an aggregate of a materialized aggregate view. This restriction does not apply to expressions that involve only numeric data, for example, a+b where a and b are numeric elds.
To view the comment after the preceding statement execution, the user can query the catalog views, {USER, DBA} ALL_MVIEW_COMMENTS. For example:
SELECT MVIEW_NAME, COMMENTS FROM USER_MVIEW_COMMENTS WHERE MVIEW_NAME = 'SALES_MV';
Note: If the compatibility is set to 10.0.1 or higher, COMMENT ON TABLE will not be allowed for the materialized view container table. The following error message will be thrown if it is issued.
ORA-12098: cannot comment on the materialized view.
In the case of a prebuilt table, if it has an existing comment, the comment will be inherited by the materialized view after it has been created. The existing comment will be prexed with '(from table)'. For example, table sales_summary was created to contain sales summary information. An existing comment 'Sales summary data' was associated with the table. A materialized view of the same name is created to use the prebuilt table as its container table. After the materialized view creation, the comment becomes '(from table) Sales summary data'. However, if the prebuilt table, sales_summary, does not have any comment, the following comment is added: 'Sales summary data'. Then, if we drop the materialized view, the comment will be passed to the prebuilt table with the comment: '(from materialized view) Sales summary data'.
Provide query rewrite to all SQL applications. Enable materialized views dened in one application to be transparently accessed in another application. Generally support fast parallel or fast materialized view refresh.
Because of these limitations, and because existing materialized views can be extremely large and expensive to rebuild, you should register your existing
8-34
materialized view tables whenever possible. You can register a user-dened materialized view with the CREATE MATERIALIZED VIEW ... ON PREBUILT TABLE statement. Once registered, the materialized view can be used for query rewrites or maintained by one of the refresh methods, or both. The contents of the table must reect the materialization of the dening query at the time you register it as a materialized view, and each column in the dening query must correspond to a column in the table that has a matching datatype. However, you can specify WITH REDUCED PRECISION to allow the precision of columns in the dening query to be different from that of the table columns. The table and the materialized view must have the same name, but the table retains its identity as a table and can contain columns that are not referenced in the dening query of the materialized view. These extra columns are known as unmanaged columns. If rows are inserted during a refresh operation, each unmanaged column of the row is set to its default value. Therefore, the unmanaged columns cannot have NOT NULL constraints unless they also have default values. Materialized views based on prebuilt tables are eligible for selection by query rewrite provided the parameter QUERY_REWRITE_INTEGRITY is set to STALE_ TOLERATED or TRUSTED. See Chapter 18, "Query Rewrite" for details about integrity levels. When you drop a materialized view that was created on a prebuilt table, the table still existsonly the materialized view is dropped. The following example illustrates the two steps required to register a user-dened table. First, the table is created, then the materialized view is dened using exactly the same name as the table. This materialized view sum_sales_tab is eligible for use in query rewrite.
CREATE TABLE sum_sales_tab PCTFREE 0 TABLESPACE demo STORAGE (INITIAL 16k NEXT 16k PCTINCREASE 0) AS SELECT s.prod_id, SUM(amount_sold) AS dollar_sales, SUM(quantity_sold) AS unit_sales FROM sales s GROUP BY s.prod_id; CREATE MATERIALIZED VIEW sum_sales_tab ON PREBUILT TABLE WITHOUT REDUCED PRECISION ENABLE QUERY REWRITE AS SELECT s.prod_id, SUM(amount_sold) AS dollar_sales, SUM(quantity_sold) AS unit_sales FROM sales s GROUP BY s.prod_id;
You could have compressed this table to save space. See "Storage And Table Compression" on page 8-22 for details regarding table compression. In some cases, user-dened materialized views are refreshed on a schedule that is longer than the update cycle. For example, a monthly materialized view might be updated only at the end of each month, and the materialized view values always refer to complete time periods. Reports written directly against these materialized views implicitly select only data that is not in the current (incomplete) time period. If a user-dened materialized view already contains a time dimension:
s
It should be registered and then fast refreshed each update cycle. You can create a view that selects the complete time period of interest. The reports should be modied to refer to the view instead of referring directly to the user-dened materialized view.
If the user-dened materialized view does not contain a time dimension, then:
s
Create a new materialized view that does include the time dimension (if possible). The view should aggregate over the time column in the new materialized view.
8-36
This statement drops the materialized view sales_sum_mv. If the materialized view was prebuilt on a table, then the table is not dropped, but it can no longer be maintained with the refresh mechanism or used by query rewrite. Alternatively, you can drop a materialized view using Oracle Enterprise Manager.
If a materialized view is fast refreshable What types of query rewrite you can perform with this materialized view Whether PCT refresh is possible
Using this procedure is straightforward. You simply call DBMS_MVIEW.EXPLAIN_ MVIEW, passing in as a single parameter the schema and materialized view name for an existing materialized view. Alternatively, you can specify the SELECT string for a potential materialized view or the complete CREATE MATERIALIZED VIEW statement. The materialized view or potential materialized view is then analyzed and the results are written into either a table called MV_CAPABILITIES_TABLE, which is the default, or to an array called MSG_ARRAY. Note that you must run the utlxmv.sql script prior to calling EXPLAIN_MVIEW except when you are placing the results in MSG_ARRAY. The script is found in the admin directory. It is to create the MV_CAPABILITIES_TABLE in the current schema. An explanation of the various capabilities is in Table 87 on page 8-41, and all the possible messages are listed in Table 88 on page 8-43.
stmt_id
An optional parameter. A client-supplied unique identier to associate output rows with specic invocations of EXPLAIN_MVIEW.
s
mv The name of an existing materialized view or the query denition or the entire CREATE MATERIALIZED VIEW statement of a potential materialized view you want to analyze.
EXPLAIN_MVIEW analyzes the specied materialized view in terms of its refresh and rewrite capabilities and inserts its results (in the form of multiple rows) into MV_CAPABILITIES_TABLE or MSG_ARRAY.
See Also: PL/SQL Packages and Types Reference for further
DBMS_MVIEW.EXPLAIN_MVIEW Declarations
The following PL/SQL declarations that are made for you in the DBMS_MVIEW package show the order and datatypes of these parameters for explaining an existing materialized view and a potential materialized view with output to a table and to a VARRAY. Explain an existing or potential materialized view with output to MV_ CAPABILITIES_TABLE:
DBMS_MVIEW.EXPLAIN_MVIEW (mv IN VARCHAR2, stmt_id IN VARCHAR2:= NULL);
Using MV_CAPABILITIES_TABLE
One of the simplest ways to use DBMS_MVIEW.EXPLAIN_MVIEW is with the MV_ CAPABILITIES_TABLE, which has the following structure:
CREATE TABLE MV_CAPABILITIES_TABLE (STMT_ID VARCHAR(30), -MV VARCHAR(30), -CAPABILITY_NAME VARCHAR(30), --client-supplied unique statement identifier NULL for SELECT based EXPLAIN_MVIEW A descriptive name of particular capabilities, such as REWRITE.
8-38
MSGTXT SEQ
See Table 87 Y = capability is possible N = capability is not possible owner.table.column, and so on related to this message When there is a numeric value associated with a row, it goes here. When available, message # explaining why disabled or more details when enabled. Text associated with MSGNO Useful in ORDER BY clause when selecting from this table.
You can use the utlxmv.sql script found in the admin directory to create MV_ CAPABILITIES_TABLE.
Example 88 DBMS_MVIEW.EXPLAIN_MVIEW
First, create the materialized view. Alternatively, you can use EXPLAIN_MVIEW on a potential materialized view using its SELECT statement or the complete CREATE MATERIALIZED VIEW statement.
CREATE MATERIALIZED VIEW cal_month_sales_mv BUILD IMMEDIATE REFRESH FORCE ENABLE QUERY REWRITE AS SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars FROM sales s, times t WHERE s.time_id = t.time_id GROUP BY t.calendar_month_desc;
Then, you invoke EXPLAIN_MVIEW with the materialized view to explain. You need to use the SEQ column in an ORDER BY clause so the rows will display in a logical order. If a capability is not possible, N will appear in the P column and an explanation in the MSGTXT column. If a capability is not possible for more than one reason, a row is displayed for each reason.
EXECUTE DBMS_MVIEW.EXPLAIN_MVIEW ('SH.CAL_MONTH_SALES_MV'); SELECT capability_name, possible, SUBSTR(related_text,1,8) AS rel_text, SUBSTR(msgtxt,1,60) AS msgtxt FROM MV_CAPABILITIES_TABLE ORDER BY seq;
CAPABILITY_NAME --------------PCT REFRESH_COMPLETE REFRESH_FAST REWRITE PCT_TABLE PCT_TABLE REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_INSERT REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ONETAB_DML REFRESH_FAST_AFTER_ANY_DML REFRESH_FAST_AFTER_ANY_DML REFRESH_FAST_AFTER_ANY_DML REFRESH_PCT REWRITE_FULL_TEXT_MATCH REWRITE_PARTIAL_TEXT_MATCH REWRITE_GENERAL REWRITE_PCT
P N Y N Y N N N N N N N N N N N N N N N N Y Y Y N
REL_TEXT --------
MSGTXT ------
SH.TIMES SH.SALES
no partition key or PMARKER in select list relation is not a partitioned table mv log must have new values mv log must have ROWID mv log does not have all necessary columns mv log must have new values mv log must have ROWID mv log does not have all necessary columns SUM(expr) without COUNT(expr) see the reason why REFRESH_FAST_AFTER_INSERT is disabled COUNT(*) is not present in the select list SUM(expr) without COUNT(expr) see the reason why REFRESH_FAST_AFTER_ONETAB_DML is disabled mv log must have sequence mv log must have sequence PCT is not possible on any of the detail tables in the materialized view
MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details
Table 87 lists explanations for values in the CAPABILITY_NAME column.
8-40
Table 87
CAPABILITY_NAME Column Details Description If this capability is possible, Partition Change Tracking (PCT) is possible on at least one detail relation. If this capability is not possible, PCT is not possible with any detail relation referenced by the materialized view. If this capability is possible, complete refresh of the materialized view is possible. If this capability is possible, fast refresh is possible at least under certain circumstances. If this capability is possible, at least full text match query rewrite is possible. If this capability is not possible, no form of query rewrite is possible. If this capability is possible, it is possible with respect to a particular partitioned table in the top level FROM list. When possible, PCT applies to the partitioned table named in the RELATED_TEXT column. PCT is needed to support fast fresh after partition maintenance operations on the table named in the RELATED_TEXT column. PCT may also support fast refresh with regard to updates to the table named in the RELATED_TEXT column when fast refresh from a materialized view log is not possible. PCT is also needed to support query rewrite in the presence of partial staleness of the materialized view with regard to the table named in the RELATED_TEXT column. When disabled, PCT does not apply to the table named in the RELATED_TEXT column. In this case, fast refresh is not possible after partition maintenance operations on the table named in the RELATED_TEXT column. In addition, PCT-based refresh of updates to the table named in the RELATED_TEXT column is not possible. Finally, query rewrite cannot be supported in the presence of partial staleness of the materialized view with regard to the table named in the RELATED_TEXT column.
CAPABILITY_NAME PCT
PCT_TABLE_REWRITE If this capability is possible, it is possible with respect to a particular partitioned table in the top level FROM list. When possible, PCT applies to the partitioned table named in the RELATED_TEXT column. This capability is needed to support query rewrite against this materialized view in partial stale state with regard to the table named in the RELATED_TEXT column. When disabled, query rewrite cannot be supported if this materialized view is in partial stale state with regard to the table named in the RELATED_TEXT column. REFRESH_FAST_ AFTER_INSERT If this capability is possible, fast refresh from a materialized view log is possible at least in the case where the updates are restricted to INSERT operations; complete refresh is also possible. If this capability is not possible, no form of fast refresh from a materialized view log is possible.
Table 87
(Cont.) CAPABILITY_NAME Column Details Description If this capability is possible, fast refresh from a materialized view log is possible regardless of the type of update operation, provided all update operations are performed on a single table. If this capability is not possible, fast refresh from a materialized view log may not be possible when the update operations are performed on multiple tables. If this capability is possible, fast refresh from a materialized view log is possible regardless of the type of update operation or the number of tables updated. If this capability is not possible, fast refresh from a materialized view log may not be possible when the update operations (other than INSERT) affect multiple tables. If this capability is possible, fast refresh using PCT is possible. Generally, this means that refresh is possible after partition maintenance operations on those detail tables where PCT is indicated as possible.
REFRESH_FAST_ AFTER_ANY_DML
REFRESH_FAST_PCT
REWRITE_FULL_TEXT_ If this capability is possible, full text match query rewrite is possible. If this capability is MATCH not possible, full text match query rewrite is not possible. REWRITE_PARTIAL_ TEXT_MATCH REWRITE_GENERAL If this capability is possible, at least full and partial text match query rewrite are possible. If this capability is not possible, at least partial text match query rewrite and general query rewrite are not possible. If this capability is possible, all query rewrite capabilities are possible, including general query rewrite and full and partial text match query rewrite. If this capability is not possible, at least general query rewrite is not possible. If this capability is possible, query rewrite can use a partially stale materialized view even in QUERY_REWRITE_INTEGRITY = ENFORCED or TRUSTED modes. When this capability is not possible, query rewrite can use a partially stale materialized view only in QUERY_REWRITE_INTEGRITY = STALE_TOLERATED mode.
REWRITE_PCT
8-42
MV_CAPABILITIES_TABLE Column Details MSGTXT NULL RELATED_NUM RELATED_TEXT For PCT capability only: [owner.]name of the table upon which PCT is enabled Oracle error number that occurred [owner.]name of relation for which PCT is not supported [owner.]name of relation for which PCT is not supported [owner.]name of relation for which PCT is not supported [owner.]name of relation for which PCT is not supported
2066 2067
This statement resulted in an Oracle error No partition key or PMARKER or join dependent expression in select list in select list Relation is not partitioned PCT not supported with multicolumn partition key PCT not supported with this type of partitioning
2068 2069 2070 2071 2072 2077 2078 2079 2080 2081 2082 2099
Internal error: undened PCT The unrecognized numeric [owner.]name of relation for which failure code PCT failure code PCT is not supported Requirements not satised for fast refresh of nested mv Mv log is newer than last full refresh Mv log must have new values Mv log must have ROWID Mv log must have primary key Mv log does not have all necessary columns Problem with mv log Mv references a remote table or view in the FROM list Multiple master sites Offset from the SELECT keyword to the table or view in question [owner.]table_name of table upon which the mv log is needed [owner.]table_name of table upon which the mv log is needed [owner.]table_name of table upon which the mv log is needed [owner.]table_name of table upon which the mv log is needed [owner.]table_name of table upon which the mv log is needed [owner.]table_name of table upon which the mv log is needed [owner.]name of the table or view in question Name of the rst different node, or NULL if the rst different node is local
2126
(Cont.) MV_CAPABILITIES_TABLE Column Details MSGTXT Join or lter condition(s) are complex Expression not supported for fast refresh Select lists must be identical across the UNION operator RELATED_NUM RELATED_TEXT [owner.]name of the table involved with the join or lter condition (or NULL when not available) Offset from the SELECT The alias name in the select list of the keyword to the expression expression in question in question Offset from the SELECT keyword to the rst different select item in the SELECT list The alias name of the rst different select item in the SELECT list
2130
2150
2182 2183
PCT is enabled through a join dependency Expression to enable PCT not in PARTITION BY of analytic function or spreadsheet Expression to enable PCT cannot be rolled up No partition key or PMARKER in the SELECT list GROUP OUTER JOIN is present materialized view on external table
[owner.]name of relation for which PCT_TABLE_REWRITE is not enabled The unrecognized numeric [owner.]name of relation for which PCT failure code PCT is not enabled [owner.]name of relation for which PCT is not enabled [owner.]name of relation for which PCT_TABLE_REWRITE is not enabled
8-44
9
Advanced Materialized Views
This chapter discusses advanced topics in using materialized views:
s
Partitioning and Materialized Views Materialized Views in OLAP Environments Materialized Views and Models Invalidating Materialized Views Security Issues with Materialized Views Altering Materialized Views
At least one of the detail tables referenced by the materialized view must be partitioned. Partitioned tables must use either range, list or composite partitioning.
The top level partition key must consist of only a single column. The materialized view must contain either the partition key column or a partition marker or ROWID or join dependent expression of the detail table. See PL/SQL Packages and Types Reference for details regarding the DBMS_ MVIEW.PMARKER function. If you use a GROUP BY clause, the partition key column or the partition marker or ROWID or join dependent expression must be present in the GROUP BY clause. If you use an analytic window function or the MODEL clause, the partition key column or the partition marker or ROWID or join dependent expression must be present in their respective PARTITION BY subclauses. Data modications can only occur on the partitioned table. If PCT refresh is being done for a table which has join dependent expression in the materialized view, then data modications should not have occurred in any of the join dependent tables. The COMPATIBILITY initialization parameter must be a minimum of 9.0.0.0.0. PCT is not supported for a materialized view that refers to views, remote tables, or outer joins. PCT-based refresh is not supported for UNION ALL materialized views.
Partition Key
Partition change tracking requires sufcient information in the materialized view to be able to correlate a detail row in the source partitioned detail table to the corresponding materialized view row. This can be accomplished by including the detail table partition key columns in the SELECT list and, if GROUP BY is used, in the GROUP BY list. Consider an example of a materialized view storing daily customer sales. The following example uses the sh sample schema and the three detail tables sales, products, and times to create the materialized view. sales table is partitioned by time_id column and products is partitioned by the prod_id column. times is not a partitioned table.
Example 91 Partition Key
The detail tables must have materialized view logs for FAST REFRESH. The following is an example:
CREATE MATERIALIZED VIEW LOG ON SALES WITH ROWID (prod_id, time_id, quantity_sold, amount_sold) INCLUDING NEW VALUES;
CREATE MATERIALIZED VIEW LOG ON PRODUCTS WITH ROWID (prod_id, prod_name, prod_desc) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW LOG ON TIMES WITH ROWID (time_id, calendar_month_name, calendar_year) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW cust_dly_sales_mv BUILD DEFERRED REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT s.time_id, p.prod_id, p.prod_name, COUNT(*), SUM(s.quantity_sold), SUM(s.amount_sold), COUNT(s.quantity_sold), COUNT(s.amount_sold) FROM sales s, products p, times t WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id GROUP BY s.time_id, p.prod_id, p.prod_name;
For cust_dly_sales_mv, PCT is enabled on both the sales table and products table because their respective partitioning key columns time_id and prod_id are in the materialized view.
In this query, times table is a join dependent table since it is joined to sales table on the partitioning key column time_id. Moreover, calendar_month_name is a dimension hierarchical attribute of times.time_id, because calendar_month_ name is an attribute of times.mon_id and times.mon_id is a dimension hierarchical parent of times.time_id. Hence, the expression calendar_month_ name from times tables is a join dependent expression. Let's look at another example:
SELECT s.time_id, y.calendar_year_name FROM sales s, times_d d, times_m m, times_y y WHERE s.time_id = d.time_id AND d.day_id = m.day_id AND m.mon_id = y.mon_id;
Here, times table is denormalized into times_d, times_m and times_y tables. The expression calendar_year_name from times_y table is a join dependent
expression and the tables times_d, times_m and times_y are join dependent tables. This is because times_y table is joined indirectly through times_m and times_d tables to sales table on its partitioning key column time_id. This lets users create materialized views containing aggregates on some level higher than the partitioning key of the detail table. Consider the following example of materialized view storing monthly customer sales.
Example 92 Join Dependent Expression
Assuming the presence of materialized view logs dened earlier, the materialized view can be created using the following DDL:
CREATE MATERIALIZED VIEW cust_mth_sales_mv BUILD DEFERRED REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT t.calendar_month_name, p.prod_id, p.prod_name, COUNT(*), SUM(s.quantity_sold), SUM(s.amount_sold), COUNT(s.quantity_sold), COUNT(s.amount_sold) FROM sales s, products p, times t WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id GROUP BY t.calendar_month_name, p.prod_id, p.prod_name;
Here, you can correlate a detail table row to its corresponding materialized view row using the join dependent table times and the relationship that times.calendar_month_name is a dimensional attribute determined by times.time_id. This enables partition change tracking on sales table. In addition to this, PCT is enabled on products table because of presence of its partitioning key column prod_id in the materialized view.
Partition Marker
The DBMS_MVIEW.PMARKER function is designed to signicantly reduce the cardinality of the materialized view (see Example 93 on page 9-6 for an example). The function returns a partition identier that uniquely identies the partition for a specied row within a specied partition table. Therefore, the DBMS_ MVIEW.PMARKER function is used instead of the partition key column in the SELECT and GROUP BY clauses. Unlike the general case of a PL/SQL function in a materialized view, use of the DBMS_MVIEW.PMARKER does not prevent rewrite with that materialized view even when the rewrite mode is QUERY_REWRITE_INTEGRITY=ENFORCED. As an example of using the PMARKER function, consider calculating a typical number, such as revenue generated by a product category during a given year. If
there were 1000 different products sold each month, it would result in 12,000 rows in the materialized view.
Example 93 Partition Marker
Consider an example of a materialized view storing the yearly sales revenue for each product category. With approximately hundreds of different products in each product category, including the partitioning key column prod_id of products table in the materialized view would substantially increase the cardinality. Instead, this materialized view uses the DBMS_MVIEW.PMARKER function, which increases the cardinality of materialized view by a factor of the number of partitions in the products table.
CREATE MATERIALIZED VIEW prod_yr_sales_mv BUILD DEFERRED REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT DBMS_MVIEW.PMARKER(p.rowid), p.prod_category, t.calendar_year, COUNT(*), SUM(s.amount_sold), SUM(s.quantity_sold), COUNT(s.amount_sold), COUNT(s.quantity_sold) FROM sales s, products p, times t WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id GROUP BY DBMS_MVIEW.PMARKER (p.rowid), p.prod_category, t.calendar_year;
prod_yr_sales_mv includes the DBMS_MVIEW.PMARKER function on the products table in its SELECT list. This enables partition change tracking on products table with signicantly less cardinality impact than grouping by the partition key column prod_id. In this example, the desired level of aggregation for the prod_yr_sales_mv is to group by products.prod_category. Using the DBMS_MVIEW.PMARKER function, the materialized view cardinality is increased only by a factor of the number of partitions in the products table. This would generally be signicantly less than the cardinality impact of including the partition key columns. Please note that partition change tracking is enabled on sales table because of presence of join dependent expression calendar_year in the SELECT list.
Partial Rewrite
A subsequent INSERT statement adds a new row to the sales_part3 partition of table sales. At this point, because cust_dly_sales_mv has PCT available on table sales using a partition key, Oracle can identify the stale rows in the materialized view cust_dly_sales_mv corresponding to sales_part3 partition (The other rows are unchanged in their freshness state). Query rewrite cannot
identify the fresh portion of materialized views cust_mth_sales_mv and prod_ yr_sales_mv because PCT is available on table sales using join dependent expressions. Query rewrite can determine the fresh portion of a materialized view on changes to a detail table only if PCT is available on the detail table using a partition key or partition marker.
(PARTITION month1 VALUES LESS THAN (TO_DATE('31-12-1998', 'DD-MM-YYYY')) PCTFREE 0 PCTUSED 99 STORAGE (INITIAL 64k NEXT 16k PCTINCREASE 0) TABLESPACE sf1, PARTITION month2 VALUES LESS THAN (TO_DATE('31-12-1999', 'DD-MM-YYYY')) PCTFREE 0 PCTUSED 99 STORAGE (INITIAL 64k NEXT 16k PCTINCREASE 0) TABLESPACE sf2, PARTITION month3 VALUES LESS THAN (TO_DATE('31-12-2000', 'DD-MM-YYYY')) PCTFREE 0 PCTUSED 99 STORAGE (INITIAL 64k NEXT 16k PCTINCREASE 0) TABLESPACE sf3) AS SELECT s.time_id, s.cust_id, SUM(s.amount_sold) AS sum_dollar_sales, SUM(s.quantity_sold) AS sum_unit_sales FROM sales s GROUP BY s.time_id, s.cust_id; CREATE MATERIALIZED VIEW part_sales_tab_mv ON PREBUILT TABLE ENABLE QUERY REWRITE AS SELECT s.time_id, s.cust_id, SUM(s.amount_sold) AS sum_dollar_sales, SUM(s.quantity_sold) AS sum_unit_sales FROM sales s GROUP BY s.time_id, s.cust_id;
In this example, the table part_sales_tab has been partitioned over three months and then the materialized view was registered to use the prebuilt table. This materialized view is eligible for query rewrite because the ENABLE QUERY REWRITE clause has been included.
The materialized view is partitioned on the partitioning key column or join dependent expressions of the detail table.
If PCT is enabled using either the partitioning key column or join expressions, both the materialized view should be range or list partitioned. PCT refresh is non-atomic.
OLAP Cubes
While data warehouse environments typically view data in the form of a star schema, OLAP environments view data in the form of a hierarchical cube. A hierarchical cube includes the data aggregated along the rollup hierarchy of each of its dimensions and these aggregations are combined across dimensions. It includes the typical set of aggregations needed for business intelligence queries.
Example 94
Hierarchical Cube
Consider a sales data set with two dimensions, each of which has a 4-level hierarchy:
s
Time, which contains (all times), year, quarter, and month. Product, which contains (all products), division, brand, and item.
This means there are 16 aggregate groups in the hierarchical cube. This is because the four levels of time are multiplied by four levels of product to produce the cube. Table 91 shows the four levels of each dimension.
Table 91 ROLLUP By Time and Product ROLLUP By Product division, brand, item division, brand division all products
ROLLUP By Time year, quarter, month year, quarter year all times
Note that as you increase the number of dimensions and levels, the number of groups to calculate increases dramatically. This example involves 16 groups, but if you were to add just two more dimensions with the same number of levels, you would have 4 x 4 x 4 x 4 = 256 different groups. Also, consider that a similar increase in groups occurs if you have multiple hierarchies in your dimensions. For example, the time dimension might have an additional hierarchy of scal month rolling up to scal quarter and then scal year. Handling the explosion of groups has historically been the major challenge in data storage for OLAP systems. Typical OLAP queries slice and dice different parts of the cube comparing aggregations from one level to aggregation from another level. For instance, a query might nd sales of the grocery division for the month of January, 2002 and compare them with total sales of the grocery division for all of 2001.
9-10
Hence, the most effective partitioning scheme for these materialized views is to use composite partitioning (range-list on (time, GROUPING_ID) columns). By partitioning the materialized views this way, you enable:
s
PCT refresh, thereby improving refresh performance. Partition pruning: only relevant aggregate groups will be accessed, thereby greatly reducing the query processing cost.
If you do not want to use PCT refresh, you can just partition by list on GROUPING_ ID column.
Example 95
To create a UNION ALL materialized view with two join views, the materialized view logs must have the rowid column and, in the following example, the UNION ALL marker is the columns, 1 marker and 2 marker.
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID; CREATE MATERIALIZED VIEW LOG ON customers WITH ROWID; CREATE MATERIALIZED VIEW unionall_sales_cust_joins_mv REFRESH FAST ON COMMIT ENABLE QUERY REWRITE AS (SELECT c.rowid crid, s.rowid srid, c.cust_id, s.amount_sold, 1 marker FROM sales s, customers c WHERE s.cust_id = c.cust_id AND c.cust_last_name = 'Smith') UNION ALL (SELECT c.rowid crid, s.rowid srid, c.cust_id, s.amount_sold, 2 marker FROM sales s, customers c WHERE s.cust_id = c.cust_id AND c.cust_last_name = 'Brown'); Example 96 Materialized View Using UNION ALL with Joins and Aggregates
The following example shows a UNION ALL of a materialized view with joins and a materialized view with aggregates. A couple of things can be noted in this example. Nulls or constants can be used to ensure that the data types of the corresponding SELECT list columns match. Also the UNION ALL marker column can be a string literal, which is 'Year' umarker, 'Quarter' umarker, or 'Daily' umarker in the following example:
CREATE MATERIALIZED VIEW LOG ON sales WITH ROWID, SEQUENCE (amount_sold, time_id) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW LOG ON times WITH ROWID, SEQUENCE (time_id, fiscal_year, fiscal_quarter_number, day_number_in_week) INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW unionall_sales_mix_mv REFRESH FAST ON DEMAND AS (SELECT 'Year' umarker, NULL, NULL, t.fiscal_year, SUM(s.amount_sold) amt, COUNT(s.amount_sold), COUNT(*) FROM sales s, times t WHERE s.time_id = t.time_id GROUP BY t.fiscal_year) UNION ALL (SELECT 'Quarter' umarker, NULL, NULL, t.fiscal_quarter_number,
9-12
SUM(s.amount_sold) amt, COUNT(s.amount_sold), COUNT(*) FROM sales s, times t WHERE s.time_id = t.time_id and t.fiscal_year = 2001 GROUP BY t.fiscal_quarter_number) UNION ALL (SELECT 'Daily' umarker, s.rowid rid, t.rowid rid2, t.day_number_in_week, s.amount_sold amt, 1, 1 FROM sales s, times t WHERE s.time_id = t.time_id AND t.time_id between '01-Jan-01' AND '01-Dec-31');
By using two materialized views, you can incrementally maintain the materialized view my_groupby_mv. The materialized view my_model_mv is on a much smaller data set because it is built on my_groupby_mv and can be maintained by a complete refresh. Materialized views with models can use complete refresh or PCT refresh only, and are available for partial text query rewrite only. See Chapter 22, "SQL for Modeling" for further details about model calculations.
The state of a materialized view can be checked by querying the data dictionary views USER_MVIEWS or ALL_MVIEWS. The column STALENESS will show one of the values FRESH, STALE, UNUSABLE, UNKNOWN, UNDEFINED, or NEEDS_COMPILE to indicate whether the materialized view can be used. The state is maintained automatically. However, if the staleness of a materialized view is marked as NEEDS_ COMPILE, you could issue an ALTER MATERIALIZED VIEW ... COMPILE statement to validate the materialized view and get the correct staleness state.
9-14
another schema. Moreover, if you enable query rewrite on a materialized view that references tables outside your schema, you must have the GLOBAL QUERY REWRITE privilege or the QUERY REWRITE object privilege on each table outside your schema. If the materialized view is on a prebuilt container, the creator, if different from the owner, must have SELECT WITH GRANT privilege on the container table. If you continue to get a privilege error while trying to create a materialized view and you believe that all the required privileges have been granted, then the problem is most likely due to a privilege not being granted explicitly and trying to inherit the privilege from a role instead. The owner of the materialized view must have explicitly been granted SELECT access to the referenced tables if the tables are in a different schema. If the materialized view is being created with ON COMMIT REFRESH specied, then the owner of the materialized view requires an additional privilege if any of the tables in the dening query are outside the owner's schema. In that case, the owner requires the ON COMMIT REFRESH system privilege or the ON COMMIT REFRESH object privilege on each table outside the owner's schema.
9-16
up the VPD-generated predicate on the request query with the predicate you directly specify when you create the materialized view.
Change its refresh option (FAST/FORCE/COMPLETE/NEVER). Change its refresh mode (ON COMMIT/ON DEMAND). Recompile it. Enable or disable its use for query rewrite. Consider it fresh. Partition maintenance operations.
All other changes are achieved by dropping and then re-creating the materialized view. The COMPILE clause of the ALTER MATERIALIZED VIEW statement can be used when the materialized view has been invalidated. This compile process is quick, and allows the materialized view to be used by query rewrite again.
See Also: Oracle Database SQL Reference for further information about the ALTER MATERIALIZED VIEW statement and "Invalidating Materialized Views" on page 9-14
9-18
10
Dimensions
The following sections will help you create and manage a data warehouse:
s
What are Dimensions? Creating Dimensions Viewing Dimensions Using Dimensions with Constraints Validating Dimensions Altering Dimensions Deleting Dimensions
Dimensions
10-1
What is the effect of promoting one product on the sale of a related product that is not promoted? What are the sales of a product before and after a promotion? How does a promotion affect the various distribution channels?
The data in the retailer's data warehouse system has two important components: dimensions and facts. The dimensions are products, customers, promotions, channels, and time. One approach for identifying your dimensions is to review your reference tables, such as a product table that contains everything about a product, or a promotion table containing all information about promotions. The facts are sales (units sold) and prots. A data warehouse contains facts about the sales of each product at on a daily basis. A typical relational implementation for such a data warehouse is a star schema. The fact information is stored in what is called a fact table, whereas the dimensional information is stored in dimension tables. In our example, each sales transaction record is uniquely dened as for each customer, for each product, for each sales channel, for each promotion, and for each day (time).
See Also:
details In Oracle Database, the dimensional information itself is stored in a dimension table. In addition, the database object dimension helps to organize and group dimensional information into hierarchies. This represents natural 1:n relationships between columns or column groups (the levels of a hierarchy) that cannot be represented with constraint conditions. Going up a level in the hierarchy is called rolling up the data and going down a level in the hierarchy is called drilling down the data. In the retailer example:
s
Within the time dimension, months roll up to quarters, quarters roll up to years, and years roll up to all years.
10-2
Within the product dimension, products roll up to subcategories, subcategories roll up to categories, and categories roll up to all products. Within the customer dimension, customers roll up to city. Then cities roll up to state. Then states roll up to country. Then countries roll up to subregion. Finally, subregions roll up to region, as shown in Figure 101.
Sample Rollup for a Customer Dimension
Figure 101
region
subregion
country
state
city
customer
Data analysis typically starts at higher levels in the dimensional hierarchy and gradually drills down if the situation warrants such analysis. Dimensions do not have to be dened. However, if your application uses dimensional modeling, it is worth spending time creating them as it can yield signicant benets, because they help query rewrite perform more complex types of rewrite. Dimensions are also benecial to certain types of materialized view refresh operations and with the SQLAccess Advisor. They are only mandatory if you use the SQLAccess Advisor (a GUI tool for materialized view and index management) without a workload to recommend which materialized views and indexes to create, drop, or retain.
Dimensions
10-3
Creating Dimensions
regarding query rewrite and Chapter 17, "SQLAccess Advisor" for further details regarding the SQLAccess Advisor In spite of the benets of dimensions, you must not create dimensions in any schema that does not fully satisfy the dimensional relationships described in this chapter. Incorrect results can be returned from queries otherwise.
Creating Dimensions
Before you can create a dimension object, the dimension tables must exist in the database possibly containing the dimension data. For example, if you create a customer dimension, one or more tables must exist that contain the city, state, and country information. In a star schema data warehouse, these dimension tables already exist. It is therefore a simple task to identify which ones will be used. Now you can draw the hierarchies of a dimension as shown in Figure 101. For example, city is a child of state (because you can aggregate city-level data up to state), and country. This hierarchical information will be stored in the database object dimension. In the case of normalized or partially normalized dimension representation (a dimension that is stored in more than one table), identify how these tables are joined. Note whether the joins between the dimension tables can guarantee that each child-side row joins with one and only one parent-side row. In the case of denormalized dimensions, determine whether the child-side columns uniquely determine the parent-side (or attribute) columns. If you use constraints to represent these relationships, they can be enabled with the NOVALIDATE and RELY clauses if the relationships represented by the constraints are guaranteed by other means. You create a dimension using either the CREATE DIMENSION statement or the Dimension Wizard in Oracle Enterprise Manager. Within the CREATE DIMENSION statement, use the LEVEL clause to identify the names of the dimension levels.
See Also: Oracle Database SQL Reference for a complete description
of the CREATE DIMENSION statement This customer dimension contains a single hierarchy with a geographical rollup, with arrows drawn from the child level to the parent level, as shown in Figure 101 on page 10-3. Each arrow in this graph indicates that for any child there is one and only one parent. For example, each city must be contained in exactly one state and each state
10-4
Creating Dimensions
must be contained in exactly one country. States that belong to more than one country, or that belong to no country, violate hierarchical integrity. Hierarchical integrity is necessary for the correct operation of management functions for materialized views that include aggregates. For example, you can declare a dimension products_dim, which contains levels product, subcategory, and category:
CREATE DIMENSION products_dim LEVEL product LEVEL subcategory LEVEL category IS (products.prod_id) IS (products.prod_subcategory) IS (products.prod_category) ...
Each level in the dimension must correspond to one or more columns in a table in the database. Thus, level product is identied by the column prod_id in the products table and level subcategory is identied by a column called prod_ subcategory in the same table. In this example, the database tables are denormalized and all the columns exist in the same table. However, this is not a prerequisite for creating dimensions. "Using Normalized Dimension Tables" on page 10-10 shows how to create a dimension customers_dim that has a normalized schema design using the JOIN KEY clause. The next step is to declare the relationship between the levels with the HIERARCHY statement and give that hierarchy a name. A hierarchical relationship is a functional dependency from one level of a hierarchy to the next level in the hierarchy. Using the level names dened previously, the CHILD OF relationship denotes that each child's level value is associated with one and only one parent level value. The following statement declares a hierarchy prod_rollup and denes the relationship between products, subcategory, and category.
HIERARCHY prod_rollup (product CHILD OF subcategory CHILD OF category)
In addition to the 1:n hierarchical relationships, dimensions also include 1:1 attribute relationships between the hierarchy levels and their dependent, determined dimension attributes. For example, the dimension times_dim, as dened in Oracle Database Sample Schemas, has columns fiscal_month_desc, fiscal_month_name, and days_in_fiscal_month. Their relationship is dened as follows:
LEVEL fis_month ... IS TIMES.FISCAL_MONTH_DESC
Dimensions
10-5
Creating Dimensions
The ATTRIBUTE ... DETERMINES clause relates fis_month to fiscal_month_ name and days_in_fiscal_month. Note that this is a unidirectional determination. It is only guaranteed, that for a specic fiscal_month, for example, 1999-11, you will nd exactly one matching values for fiscal_month_ name, for example, November and days_in_fiscal_month, for example, 28. You cannot determine a specic fiscal_month_desc based on the fiscal_month_ name, which is November for every scal year. In this example, suppose a query were issued that queried by fiscal_month_ name instead of fiscal_month_desc. Because this 1:1 relationship exists between the attribute and the level, an already aggregated materialized view containing fiscal_month_desc can be joined back to the dimension information and used to identify the data.
See Also: Chapter 18, "Query Rewrite" for further details of using dimensional information
Alternatively, the extended_attribute_clause could have been used instead of the attribute_clause, as shown in the following example:
CREATE DIMENSION products_dim LEVEL product IS (products.prod_id) LEVEL subcategory IS (products.prod_subcategory)
10-6
Creating Dimensions
LEVEL category IS (products.prod_category) HIERARCHY prod_rollup ( product CHILD OF subcategory CHILD OF category ) ATTRIBUTE product_info LEVEL product DETERMINES (products.prod_name, products.prod_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, prod_status, prod_list_price, prod_min_price) ATTRIBUTE subcategory DETERMINES (prod_subcategory, prod_subcategory_desc) ATTRIBUTE category DETERMINES (prod_category, prod_category_desc);
The design, creation, and maintenance of dimensions is part of the design, creation, and maintenance of your data warehouse schema. Once the dimension has been created, check that it meets these requirements:
s
There must be a 1:n relationship between a parent and children. A parent can have one or more children, but a child can have only one parent. There must be a 1:1 attribute relationship between hierarchy levels and their dependent dimension attributes. For example, if there is a column fiscal_ month_desc, then a possible attribute relationship would be fiscal_month_ desc to fiscal_month_name. If the columns of a parent level and child level are in different relations, then the connection between them also requires a 1:n join relationship. Each row of the child table must join with one and only one row of the parent table. This relationship is stronger than referential integrity alone, because it requires that the child join key must be non-null, that referential integrity must be maintained from the child join key to the parent join key, and that the parent join key must be unique. You must ensure (using database constraints if necessary) that the columns of each hierarchy level are non-null and that hierarchical integrity is maintained. The hierarchies of a dimension can overlap or be disconnected from each other. However, the columns of a hierarchy level cannot be associated with more than one dimension. Join relationships that form cycles in the dimension graph are not supported. For example, a hierarchy level cannot be joined to itself either directly or indirectly.
Dimensions
10-7
Creating Dimensions
declarative. The previously discussed relationships are not enforced with the creation of a dimension object. You should validate any dimension denition with the DBMS_DIMENSION.VALIDATE_ DIMENSION procedure, as discussed on "Validating Dimensions" on page 10-12.
10-8
Creating Dimensions
Multiple Hierarchies
A single dimension denition can contain multiple hierarchies. Suppose our retailer wants to track the sales of certain items over time. The rst step is to dene the time dimension over which sales will be tracked. Figure 102 illustrates a dimension times_dim with two time hierarchies.
Figure 102 times_dim Dimension with Two Time Hierarchies
year
fis_year
quarter
fis_quarter
month
fis_month
fis_week
day
From the illustration, you can construct the hierarchy of the denormalized time_ dim dimension's CREATE DIMENSION statement as follows. The complete CREATE DIMENSION statement as well as the CREATE TABLE statement are shown in Oracle Database Sample Schemas.
CREATE DIMENSION times_dim LEVEL day IS times.time_id LEVEL month IS times.calendar_month_desc LEVEL quarter IS times.calendar_quarter_desc LEVEL year IS times.calendar_year LEVEL fis_week IS times.week_ending_day LEVEL fis_month IS times.fiscal_month_desc LEVEL fis_quarter IS times.fiscal_quarter_desc LEVEL fis_year IS times.fiscal_year HIERARCHY cal_rollup ( day CHILD OF
Dimensions
10-9
Creating Dimensions
month CHILD OF quarter CHILD OF year ) HIERARCHY fis_rollup ( day CHILD OF fis_week CHILD OF fis_month CHILD OF fis_quarter CHILD OF fis_year ) <attribute determination clauses>;
Viewing Dimensions
Viewing Dimensions
Dimensions can be viewed through one of two methods:
s
Dimensions
10-11
LEVEL CHANNEL_CLASS IS SH.CHANNELS.CHANNEL_CLASS HIERARCHY CHANNEL_ROLLUP ( CHANNEL CHILD OF CHANNEL_CLASS) ATTRIBUTE CHANNEL LEVEL CHANNEL DETERMINES SH.CHANNELS.CHANNEL_DESC ATTRIBUTE CHANNEL_CLASS LEVEL CHANNEL_CLASS DETERMINES SH.CHANNELS.CHANNEL_CLASS
Primary and foreign keys should be implemented also. Referential integrity constraints and NOT NULL constraints on the fact tables provide information that query rewrite can use to extend the usefulness of materialized views. In addition, you should use the RELY clause to inform query rewrite that it can rely upon the constraints being correct as follows:
ALTER TABLE time MODIFY CONSTRAINT pk_time RELY;
This information is also used for query rewrite. See Chapter 18, "Query Rewrite" for more information.
Validating Dimensions
The information of a dimension object is declarative only and not enforced by the database. If the relationships described by the dimensions are incorrect, incorrect results could occur. Therefore, you should verify the relationships specied by CREATE DIMENSION using the DBMS_DIMENSION.VALIDATE_DIMENSION procedure periodically.
Validating Dimensions
Dimension owner and name. Set to TRUE to check only the new rows for tables of this dimension. Set to TRUE to verify that all columns are not null. A user-supplied unique identier to identify the result of each run of the procedure.
Before running the VALIDATE_DIMENSION procedure, you need to create a local table, DIMENSION_EXCEPTIONS, by running the provided script utldim.sql. If the VALIDATE_DIMENSION procedure encounters any errors, they are placed in this table. Querying this table will identify the exceptions that were found. The following illustrates a sample:
SELECT * FROM dimension_exceptions WHERE statement_id = 'my 1st example';
STATEMENT_ID -----------my 1st example OWNER ----SH TABLE_NAME ---------MONTH DIMENSION_NAME -------------TIME_FN RELATIONSHIP -----------FOREIGN KEY BAD_ROWID --------AAAAuwAAJAAAARwAAA
However, rather than query this table, it may be better to query the rowid of the invalid row to retrieve the actual row that has violated the constraint. In this example, the dimension TIME_FN is checking a table called month. It has found a row that violates the constraints. Using the rowid, you can see exactly which row in the month table is causing the problem, as in the following:
SELECT * FROM month WHERE rowid IN (SELECT bad_rowid FROM dimension_exceptions WHERE statement_id = 'my 1st example'); MONTH -----199903 QUARTER ------19981 FISCAL_QTR ---------19981 YEAR ---1998 FULL_MONTH_NAME --------------March MONTH_NUMB ---------3
Dimensions
10-13
Altering Dimensions
Altering Dimensions
You can modify the dimension using the ALTER DIMENSION statement. You can add or drop a level, hierarchy, or attribute from the dimension using this command. Referring to the time dimension in Figure 102 on page 10-9, you can remove the attribute fis_year, drop the hierarchy fis_rollup, or remove the level fiscal_year. In addition, you can add a new level called f_year as in the following:
ALTER ALTER ALTER ALTER DIMENSION DIMENSION DIMENSION DIMENSION times_dim times_dim times_dim times_dim DROP ATTRIBUTE fis_year; DROP HIERARCHY fis_rollup; DROP LEVEL fis_year ADD LEVEL f_year IS times.fiscal_year;
If you used the extended_attribute_clause when creating the dimension, you can drop one attribute column without dropping all attribute columns. This is illustrated in "Dropping and Creating Attributes with Columns" on page 10-8, which shows the following statement:
ALTER DIMENSION product_dim DROP ATTRIBUTE size LEVEL prod_type COLUMN Prod_TypeSize;
If you try to remove anything with further dependencies inside the dimension, Oracle Database rejects the altering of the dimension. A dimension becomes invalid if you change any schema object that the dimension is referencing. For example, if the table on which the dimension is dened is altered, the dimension becomes invalid. To check the status of a dimension, view the contents of the column invalid in the ALL_DIMENSIONS data dictionary view. To revalidate the dimension, use the COMPILE option as follows:
ALTER DIMENSION times_dim COMPILE;
Dimensions can also be modied or deleted using Oracle Enterprise Manager. See Oracle Enterprise Manager Administrator's Guide for more information.
Deleting Dimensions
A dimension is removed using the DROP DIMENSION statement. For example:
DROP DIMENSION times_dim;
Part IV
Managing the Data Warehouse Environment
This section deals with the tasks for managing a data warehouse. It contains the following chapters:
s
Chapter 11, "Overview of Extraction, Transformation, and Loading" Chapter 12, "Extraction in Data Warehouses" Chapter 13, "Transportation in Data Warehouses" Chapter 14, "Loading and Transformation" Chapter 15, "Maintaining the Data Warehouse" Chapter 16, "Change Data Capture" Chapter 17, "SQLAccess Advisor"
11
Overview of Extraction, Transformation, and Loading
This chapter discusses the process of extracting, transporting, transforming, and loading data in a data warehousing environment, and includes the following:
s
11-2
example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement. The emphasis in many of the examples in this section is scalability. Many long-time users of Oracle Database are experts in programming complex data transformation logic using PL/SQL. These chapters suggest alternatives for many such data manipulation operations, with a particular emphasis on implementations that take advantage of Oracle's new SQL functionality, especially for ETL and the parallel query infrastructure.
11-4
12
Extraction in Data Warehouses
This chapter discusses extraction, which is the process of taking data from an operational system and moving it to your data warehouse or staging system. The chapter discusses:
s
Overview of Extraction in Data Warehouses Introduction to Extraction Methods in Data Warehouses Data Warehousing Extraction Examples
Which extraction method do I choose? This inuences the source system, the transportation process, and the time needed for refreshing the warehouse.
How do I provide the extracted data for further processing? This inuences the transportation method, and the need for cleaning and transforming the data.
12-2
an incremental extraction of data due to the performance or the increased workload of these systems. Sometimes even the customer is not allowed to add anything to an out-of-the-box application system. The estimated amount of the data to be extracted and the stage in the ETL process (initial load or maintenance of data) may also impact the decision of how to extract, from a logical and a physical perspective. Basically, you have to decide how to extract data logically and physically.
Full Extraction
The data is extracted completely from the source system. Because this extraction reects all the data currently available on the source system, there's no need to keep track of changes to the data source since the last successful extraction. The source data will be provided as-is and no additional logical information (for example, timestamps) is necessary on the source site. An example for a full extraction may be an export le of a distinct table or a remote SQL statement scanning the complete source table.
Incremental Extraction
At a specic point in time, only the data that has changed since a well-dened event back in history will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a scal period. To identify this delta change there must be a possibility to identify all the changed information since this specic time event. This information can be either provided by the source data itself such as an application column, reecting the last-changed timestamp or a change table where an appropriate additional mechanism keeps track of the changes besides the originating transactions. In most cases, using the latter method means adding extraction logic to the source system. Many data warehouses do not use any change-capture techniques as part of the extraction process. Instead, entire tables from the source systems are extracted to the data warehouse or staging area, and these tables are compared with a previous extract from the source system to identify the changed data. This approach may not
have signicant impact on the source systems, but it clearly can place a considerable burden on the data warehouse processes, particularly if the data volumes are large. Oracle's Change Data Capture mechanism can extract and maintain such delta information. See Chapter 16, "Change Data Capture" for further details about the Change Data Capture framework.
Online Extraction
The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves or to an intermediate system that stores the data in a precongured manner (for example, snapshot logs or change tables). Note that the intermediate system is not necessarily physically different from the source system. With online extractions, you need to consider whether the distributed transactions are using original source objects or prepared source objects.
Offline Extraction
The data is not extracted directly from the source system but is staged explicitly outside the original source system. The data already has an existing structure (for example, redo logs, archive logs or transportable tablespaces) or was created by an extraction routine. You should consider the following structures:
s
Flat les Data in a dened, generic format. Additional information about the source object is necessary for further processing.
Dump les
12-4
Oracle-specic format. Information about the containing objects may or may not be included, depending on the chosen utility.
s
Transportable tablespaces A powerful way to extract and move large volumes of data between Oracle databases. A more detailed example of using this feature to extract and transport data is provided in Chapter 13, "Transportation in Data Warehouses". Oracle Corporation recommends that you use transportable tablespaces whenever possible, because they can provide considerable advantages in performance and manageability over other extraction techniques. See Oracle Database Utilities for more information on using export/import.
These techniques are based upon the characteristics of the source systems, or may require modications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners of the source system prior to implementation. Each of these techniques can work in conjunction with the data extraction technique discussed previously. For example, timestamps can be used whether the data is being unloaded to a le or accessed through a distributed query. See Chapter 16, "Change Data Capture" for further details.
Timestamps
The tables in some operational systems have timestamp columns. The timestamp species the time and date that a given row was last modied. If the tables in an operational system have columns containing timestamps, then the latest data can easily be identied using the timestamp columns. For example, the following query might be useful for extracting today's data from an orders table:
SELECT * FROM orders WHERE TRUNC(CAST(order_date AS date),'dd') = TO_DATE(SYSDATE,'dd-mon-yyyy');
If the timestamp information is not available in an operational source system, you will not always be able to modify the system to include timestamps. Such modication would require, rst, modifying the operational system's tables to include a new timestamp column and then creating a trigger to update the timestamp column following every operation that modies a given row.
Partitioning
Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identication of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week's data.
Triggers
Triggers can be created in operational systems to keep track of recently updated records. They can then be used in conjunction with timestamp columns to identify the exact time and date when a given row was last modied. You do this by creating a trigger on each source table that requires change data capture. Following each DML statement that is executed on the source table, this trigger updates the timestamp column with the current time. Thus, the timestamp column provides the exact time and date when a given row was last modied.
12-6
A similar internalized trigger-based technique is used for Oracle materialized view logs. These logs are used by materialized views to identify changed data, and these logs are accessible to end users. However, the format of the materialized view logs is not documented and might change over time. If you want to use a trigger-based mechanism, use synchronous change data capture. It is recommended that you use synchronous Change Data Capture for trigger based change capture, because CDC provides an externalized interface for accessing the change information and provides a framework for maintaining the distribution of this information to various clients. Materialized view logs rely on triggers, but they provide an advantage in that the creation and maintenance of this change-data system is largely managed by the database. However, Oracle Corporation recommends the usage of synchronous Change Data Capture for trigger-based change capture, since CDC provides an externalized interface for accessing the change information and provides a framework for maintaining the distribution of this information to various clients Trigger-based techniques might affect performance on the source systems, and this impact should be carefully considered prior to implementation on a production source system.
extraction or the results of joining multiple tables together. Different extraction techniques vary in their capabilities to support these two scenarios. When the source system is an Oracle database, several alternatives are available for extracting data into les:
s
Extracting into Flat Files Using SQL*Plus Extracting into Flat Files Using OCI or Pro*C Programs Exporting into Export Files Using the Export Utility Extracting into Export Files Using External Tables
The exact format of the output le can be specied using SQL*Plus system variables. This extraction technique offers the advantage of storing the result in a customized format. Note that using the external table data pump unload facility, you can also extract the result of an arbitrary SQL operation. The example previously extracts the results of a join. This extraction technique can be parallelized by initiating multiple, concurrent SQL*Plus sessions, each session running a separate query representing a different portion of the data to be extracted. For example, suppose that you wish to extract data from an orders table, and that the orders table has been range partitioned by month, with partitions orders_jan1998, orders_feb1998, and so on. To extract a single year of data from the orders table, you could initiate 12 concurrent SQL*Plus sessions, each extracting a single partition. The SQL script for one such session could be:
SPOOL order_jan.dat SELECT * FROM orders PARTITION (orders_jan1998);
12-8
SPOOL OFF
These 12 SQL*Plus processes would concurrently spool data to 12 separate les. You can then concatenate them if necessary (using operating system utilities) following the extraction. If you are planning to use SQL*Loader for loading into the target, these 12 les can be used as is for a parallel load with 12 SQL*Loader sessions. See Chapter 13, "Transportation in Data Warehouses" for an example. Even if the orders table is not partitioned, it is still possible to parallelize the extraction either based on logical or physical criteria. The logical method is based on logical ranges of column values, for example:
SELECT ... WHERE order_date BETWEEN TO_DATE('01-JAN-99') AND TO_DATE('31-JAN-99');
The physical method is based on a range of values. By viewing the data dictionary, it is possible to identify the Oracle Database data blocks that make up the orders table. Using this information, you could then derive a set of rowid-range queries for extracting data from the orders table:
SELECT * FROM orders WHERE rowid BETWEEN value1 and value2;
Parallelizing the extraction of complex SQL queries is sometimes possible, although the process of breaking a single complex query into multiple components can be challenging. In particular, the coordination of independent processes to guarantee a globally consistent view can be difcult. Unlike the SQL*Plus approach, using the new external table data pump unload functionality provides transparent parallel capabilities. Note that all parallel techniques can use considerably more CPU and I/O resources on the source system, and the impact on the source system should be evaluated before parallelizing any extraction technique.
columns. It is also helpful to know the extraction format, which might be the separator between distinct columns.
The export les contain metadata as well as data. An export le contains not only the raw data of a table, but also information on how to re-create the table, potentially including any indexes, constraints, grants, and other attributes associated with that table. A single export le may contain a subset of a single object, many database objects, or even an entire schema. Export cannot be directly used to export the results of a complex SQL query. Export can be used only to extract subsets of distinct database objects. The output of the Export utility must be processed using the Import utility.
Oracle provides the original Export and Import utilities for backward compatibility and the data pump export/import infrastructure for high-performant, scalable and parallel extraction. See Oracle Database Utilities for further details.
SELECT c.*, co.country_name, co.country_subregion, co.country_region FROM customers c, countries co where co.country_id=c.country_id;
The total number of extraction les specied limits the maximum degree of parallelism for the write operation. Note that the parallelizing of the extraction does not automatically parallelize the SELECT portion of the statement. Unlike using any kind of export/import, the metadata for the external table is not part of the created les when using the external table data pump unload. To extract the appropriate metadata for the external table, use the DBMS_METADATA package, as illustrated in the following statement:
SET LONG 2000 SELECT DBMS_METADATA.GET_DDL('TABLE','EXTRACT_CUST') FROM DUAL;
This statement creates a local table in a data mart, country_city, and populates it with data from the countries and customers tables on the source system. This technique is ideal for moving small volumes of data. However, the data is transported from the source system to the data warehouse through a single Oracle Net connection. Thus, the scalability of this technique is limited. For larger data
volumes, le-based data extraction and transportation techniques are often more scalable and thus more appropriate. See Oracle Database Heterogeneous Connectivity Administrator's Guide and Oracle Database Concepts for more information on distributed queries.
13
Transportation in Data Warehouses
The following topics provide information about transporting data into a data warehouse:
s
A source system to a staging database or a data warehouse database A staging database to a data warehouse A data warehouse to a data mart
Transportation is often one of the simpler portions of the ETL process, and can be integrated with other portions of the process. For example, as shown in Chapter 12, "Extraction in Data Warehouses", distributed query technology provides a mechanism for both extracting and transporting data.
Transportation Using Flat Files Transportation Through Distributed Operations Transportation Using Transportable Tablespaces
13-2
systems, thus providing both extraction and transformation in a single step. Depending on the tolerable impact on time and system resources, these mechanisms can be well suited for both extraction and transformation. As opposed to at le transportation, the success or failure of the transportation is recognized immediately with the result of the distributed query or transaction. See Chapter 12, "Extraction in Data Warehouses" for further information.
hold a copy of the current month's data. Using the CREATE TABLE ... AS SELECT statement, the current month's data can be efciently copied to this tablespace:
CREATE TABLE temp_jan_sales NOLOGGING TABLESPACE ts_temp_sales AS SELECT * FROM sales WHERE time_id BETWEEN '31-DEC-1999' AND '01-FEB-2000';
A tablespace cannot be transported unless there are no active transactions modifying the tablespace. Setting the tablespace to read-only enforces this. The tablespace ts_temp_sales may be a tablespace that has been especially created to temporarily store data for use by the transportable tablespace features. Following "Copy the Datales and Export File to the Target System", this tablespace can be set to read/write, and, if desired, the table temp_jan_sales can be dropped, or the tablespace can be re-used for other transportations or for other purposes. In a given transportable tablespace operation, all of the objects in a given tablespace are transported. Although only one table is being transported in this example, the tablespace ts_temp_sales could contain multiple tables. For example, perhaps the data mart is refreshed not only with the new month's worth of sales transactions, but also with a new copy of the customer table. Both of these tables could be transported in the same tablespace. Moreover, this tablespace could also contain other database objects such as indexes, which would also be transported. Additionally, in a given transportable-tablespace operation, multiple tablespaces can be transported at the same time. This makes it easier to move very large volumes of data between databases. Note, however, that the transportable tablespace feature can only transport a set of tablespaces which contain a complete set of database objects without dependencies on other tablespaces. For example, an index cannot be transported without its table, nor can a partition be transported without the rest of the table. You can use the DBMS_TTS package to check that a tablespace is transportable.
See Also: PL/SQL Packages and Types Reference for detailed
information about the DBMS_TTS package In this step, we have copied the January sales data into a separate tablespace; however, in some cases, it may be possible to leverage the transportable tablespace feature without even moving data to a separate tablespace. If the sales table has
13-4
been partitioned by month in the data warehouse and if each partition is in its own tablespace, then it may be possible to directly transport the tablespace containing the January data. Suppose the January partition, sales_jan2000, is located in the tablespace ts_sales_jan2000. Then the tablespace ts_sales_jan2000 could potentially be transported, rather than creating a temporary copy of the January sales data in the ts_temp_sales. However, the same conditions must be satised in order to transport the tablespace ts_sales_jan2000 as are required for the specially created tablespace. First, this tablespace must be set to READ ONLY. Second, because a single partition of a partitioned table cannot be transported without the remainder of the partitioned table also being transported, it is necessary to exchange the January partition into a separate table (using the ALTER TABLE statement) to transport the January data. The EXCHANGE operation is very quick, but the January data will no longer be a part of the underlying sales table, and thus may be unavailable to users until this data is exchanged back into the sales table after the export of the metadata. The January data can be exchanged back into the sales table after you complete step 3. Step 2 Export the Metadata The Export utility is used to export the metadata describing the objects contained in the transported tablespace. For our example scenario, the Export command could be:
EXP TRANSPORT_TABLESPACE=y TABLESPACES=ts_temp_sales FILE=jan_sales.dmp
This operation will generate an export le, jan_sales.dmp. The export le will be small, because it contains only metadata. In this case, the export le will contain information describing the table temp_jan_sales, such as the column names, column datatype, and all other information that the target Oracle database will need in order to access the objects in ts_temp_sales. Step 3 Copy the Datales and Export File to the Target System Copy the data les that make up ts_temp_sales, as well as the export le jan_ sales.dmp to the data mart platform, using any transportation mechanism for at les. Once the datales have been copied, the tablespace ts_temp_sales can be set to READ WRITE mode if desired. Step 4 Import the Metadata Once the les have been copied to the data mart, the metadata should be imported into the data mart:
IMP TRANSPORT_TABLESPACE=y DATAFILES='/db/tempjan.f'
TABLESPACES=ts_temp_sales FILE=jan_sales.dmp
At this point, the tablespace ts_temp_sales and the table temp_sales_jan are accessible in the data mart. You can incorporate this new data into the data mart's tables. You can insert the data from the temp_sales_jan table into the data mart's sales table in one of two ways:
INSERT /*+ APPEND */ INTO sales SELECT * FROM temp_sales_jan;
Following this operation, you can delete the temp_sales_jan table (and even the entire ts_temp_sales tablespace). Alternatively, if the data mart's sales table is partitioned by month, then the new transported tablespace and the temp_sales_jan table can become a permanent part of the data mart. The temp_sales_jan table can become a partition of the data mart's sales table:
ALTER TABLE sales ADD PARTITION sales_00jan VALUES LESS THAN (TO_DATE('01-feb-2000','dd-mon-yyyy')); ALTER TABLE sales EXCHANGE PARTITION sales_00jan WITH TABLE temp_sales_jan INCLUDING INDEXES WITH VALIDATION;
13-6
14
Loading and Transformation
This chapter helps you create and manage a data warehouse, and discusses:
s
Overview of Loading and Transformation in Data Warehouses Loading Mechanisms Transformation Mechanisms Loading and Transformation Scenarios
Transformation Flow
From an architectural perspective, you can transform your data in two ways:
s
14-2
Figure 141
new_sales_step2
new_sales_step3
Table
Table
When using Oracle Database as a transformation engine, a common strategy is to implement each transformation as a separate SQL operation and to create a separate, temporary staging table (such as the tables new_sales_step1 and new_ sales_step2 in Figure 141) to store the incremental results for each step. This load-then-transform strategy also provides a natural checkpointing scheme to the entire transformation process, which enables to the process to be more easily monitored and restarted. However, a disadvantage to multistaging is that the space and time requirements increase. It may also be possible to combine many simple logical transformations into a single SQL statement or single PL/SQL procedure. Doing so may provide better performance than performing each step independently, but it may also introduce difculties in modifying, adding, or dropping individual transformations, as well as recovering from failed transformations.
Loading Mechanisms
The new functionality renders some of the former necessary process steps obsolete while some others can be remodeled to enhance the data ow and the data transformation to become more scalable and non-interruptive. The task shifts from serial transform-then-load process (with most of the tasks done outside the database) or load-then-transform process, to an enhanced transform-while-loading. Oracle offers a wide variety of new capabilities to address all the issues and tasks relevant in an ETL scenario. It is important to understand that the database offers toolkit functionality rather than trying to address a one-size-ts-all solution. The underlying database has to enable the most appropriate ETL process ow for a specic customer need, and not dictate or constrain it from a technical perspective. Figure 142 illustrates the new functionality, which is discussed throughout later sections.
Figure 142 Pipelined Data Transformation
External table Convert source product keys to warehouse product keys
Flat Files
sales
Insert into sales warehouse table Table
Loading Mechanisms
You can use the following mechanisms for loading a data warehouse:
s
Loading a Data Warehouse with SQL*Loader Loading a Data Warehouse with External Tables Loading a Data Warehouse with OCI and Direct-Path APIs Loading a Data Warehouse with Export/Import
14-4
Loading Mechanisms
Loading Mechanisms
data to be rst loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data. External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table. The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) are possible and no indexes can be created on them. External tables are a mostly compliant to the existing SQL*Loader functionality and provide superior functionality in most cases. External tables are especially useful for environments where the complete external source has to be joined with existing database objects or when the data has to be transformed in a complex manner. For example, unlike SQL*Loader, you can apply any arbitrary SQL transformation and use the direct path insert method. You can create an external table named sales_transactions_ext, representing the structure of the complete sales transaction data, represented in the external le sh_sales.dat. The product department is especially interested in a cost analysis on product and time. We thus create a fact table named cost in the sales history schema. The operational source data is the same as for the sales fact table. However, because we are not investigating every dimensional information that is provided, the data in the cost fact table has a coarser granularity than in the sales fact table, for example, all different distribution channels are aggregated. We cannot load the data into the cost fact table without applying the previously mentioned aggregation of the detailed information, due to the suppression of some of the dimensions. The external table framework offers a solution to solve this. Unlike SQL*Loader, where you would have to load the data before applying the aggregation, you can combine the loading and transformation within a single SQL DML statement, as shown in the following. You do not have to stage the data temporarily before inserting into the target table. The object directories must already exist, and point to the directory containing the sh_sales.dat le as well as the directory containing the bad and log les.
CREATE TABLE sales_transactions_ext
14-6
Loading Mechanisms
(PROD_ID NUMBER, CUST_ID NUMBER, TIME_ID DATE, CHANNEL_ID NUMBER, PROMO_ID NUMBER, QUANTITY_SOLD NUMBER, AMOUNT_SOLD NUMBER(10,2), UNIT_COST NUMBER(10,2), UNIT_PRICE NUMBER(10,2)) ORGANIZATION external (TYPE oracle_loader DEFAULT DIRECTORY data_file_dir ACCESS PARAMETERS (RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII BADFILE log_file_dir:'sh_sales.bad_xt' LOGFILE log_file_dir:'sh_sales.log_xt' FIELDS TERMINATED BY "|" LDRTRIM ( PROD_ID, CUST_ID, TIME_ID DATE(10) "YYYY-MM-DD", CHANNEL_ID, PROMO_ID, QUANTITY_SOLD, AMOUNT_SOLD, UNIT_COST, UNIT_PRICE)) location ('sh_sales.dat') )REJECT LIMIT UNLIMITED;
The external table can now be used from within the database, accessing some columns of the external data only, grouping the data, and inserting it into the costs fact table:
INSERT /*+ APPEND */ INTO COSTS (TIME_ID, PROD_ID, UNIT_COST, UNIT_PRICE) SELECT TIME_ID, PROD_ID, AVG(UNIT_COST), AVG(amount_sold/quantity_sold) FROM sales_transactions_ext GROUP BY time_id, prod_id;
of external table syntax and restrictions and Oracle Database Utilities for usage examples
Transformation Mechanisms
Transformation Mechanisms
You have the following choices for transforming data inside the database:
s
Transformation Using SQL Transformation Using PL/SQL Transformation Using Table Functions
CREATE TABLE ... AS SELECT And INSERT /*+APPEND*/ AS SELECT Transformation Using UPDATE Transformation Using MERGE Transformation Using Multitable INSERT
14-8
Transformation Mechanisms
value. For example, you can do this efciently using a SQL function as part of the insertion into the target sales table statement: The structure of source table sales_activity_direct is as follows:
DESC sales_activity_direct Name Null? Type ------------------------------SALES_DATE DATE PRODUCT_ID NUMBER CUSTOMER_ID NUMBER PROMOTION_ID NUMBER AMOUNT NUMBER QUANTITY NUMBER INSERT /*+ APPEND NOLOGGING PARALLEL */ INTO sales SELECT product_id, customer_id, TRUNC(sales_date), 3, promotion_id, quantity, amount FROM sales_activity_direct;
Transformation Mechanisms
WHEN MATCHED THEN UPDATE SET t.prod_list_price=s.prod_list_price, t.prod_min_price=s.prod_min_price WHEN NOT MATCHED THEN INSERT (prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_status, prod_list_price, prod_min_price) VALUES (s.prod_id, s.prod_name, s.prod_desc, s.prod_subcategory, s.prod_subcategory_desc, s.prod_category, s.prod_category_desc, s.prod_status, s.prod_list_price, s.prod_min_price);
The following statement aggregates the transactional sales information, stored in sales_activity_direct, on a per daily base and inserts into both the sales and the costs fact table for the current day.
INSERT ALL INTO sales VALUES (product_id, customer_id, today, 3, promotion_id, quantity_per_day, amount_per_day) INTO costs VALUES (product_id, today, promotion_id, 3, product_cost, product_price) SELECT TRUNC(s.sales_date) AS today, s.product_id, s.customer_id, s.promotion_id, SUM(s.amount) AS amount_per_day, SUM(s.quantity) quantity_per_day, p.prod_min_price*0.8 AS product_cost, p.prod_list_price AS product_price
Transformation Mechanisms
FROM sales_activity_direct s, products p WHERE s.product_id = p.prod_id AND TRUNC(sales_date) = TRUNC(SYSDATE) GROUP BY TRUNC(sales_date), s.product_id, s.customer_id, s.promotion_id, p.prod_min_price*0.8, p.prod_list_price; Example 143 Conditional ALL Insert
The following statement inserts a row into the sales and costs tables for all sales transactions with a valid promotion and stores the information about multiple identical orders of a customer in a separate table cum_sales_activity. It is possible two rows will be inserted for some sales transactions, and none for others.
INSERT ALL WHEN promotion_id IN (SELECT promo_id FROM promotions) THEN INTO sales VALUES (product_id, customer_id, today, 3, promotion_id, quantity_per_day, amount_per_day) INTO costs VALUES (product_id, today, promotion_id, 3, product_cost, product_price) WHEN num_of_orders > 1 THEN INTO cum_sales_activity VALUES (today, product_id, customer_id, promotion_id, quantity_per_day, amount_per_day, num_of_orders) SELECT TRUNC(s.sales_date) AS today, s.product_id, s.customer_id, s.promotion_id, SUM(s.amount) AS amount_per_day, SUM(s.quantity) quantity_per_day, COUNT(*) num_of_orders, p.prod_min_price*0.8 AS product_cost, p.prod_list_price as product_price FROM sales_activity_direct s, products p WHERE s.product_id = p.prod_id AND TRUNC(sales_date) = TRUNC(SYSDATE) GROUP BY TRUNC(sales_date), s.product_id, s.customer_id, s.promotion_id, p.prod_min_price*0.8, p.prod_list_price; Example 144 Conditional FIRST Insert
The following statement inserts into an appropriate shipping manifest according to the total quantity and the weight of a product order. An exception is made for high value orders, which are also sent by express, unless their weight classication is not too high. It assumes the existence of appropriate tables large_freight_ shipping, express_shipping, and default_shipping.
INSERT FIRST WHEN (sum_quantity_sold > 10 AND prod_weight_class < 5) OR (sum_quantity_sold > 5 AND prod_weight_class > 5) THEN INTO large_freight_shipping VALUES (time_id, cust_id, prod_id, prod_weight_class, sum_quantity_sold) WHEN sum_amount_sold > 1000 THEN INTO express_shipping VALUES
Transformation Mechanisms
(time_id, cust_id, prod_id, prod_weight_class, sum_amount_sold, sum_quantity_sold) ELSE INTO default_shipping VALUES (time_id, cust_id, prod_id, sum_quantity_sold) SELECT s.time_id, s.cust_id, s.prod_id, p.prod_weight_class, SUM(amount_sold) AS sum_amount_sold, SUM(quantity_sold) AS sum_quantity_sold FROM sales s, products p WHERE s.prod_id = p.prod_id AND s.time_id = TRUNC(SYSDATE) GROUP BY s.time_id, s.cust_id, s.prod_id, p.prod_weight_class; Example 145 Mixed Conditional and Unconditional Insert
The following example inserts new customers into the customers table and stores all new customers with cust_credit_limit higher then 4500 in an additional, separate table for further promotions.
INSERT INTO ELSE SELECT FIRST WHEN cust_credit_limit >= 4500 THEN INTO customers customers_special VALUES (cust_id, cust_credit_limit) INTO customers * FROM customers_new;
See Chapter 15, "Maintaining the Data Warehouse" for more information regarding MERGE operations.
Transformation Mechanisms
other without the necessity of intermediate staging. You can use table functions to implement such behavior.
Now, functions are not limited in these ways. Table functions extend database functionality by allowing:
s
Multiple rows to be returned from a function. Results of SQL subqueries (that select multiple rows) to be passed directly to functions. Functions take cursors as input. Functions can be parallelized. Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
Table functions can be dened in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
See Also: PL/SQL User's Guide and Reference for further information and Oracle Data Cartridge Developer's Guide
Figure 143 illustrates a typical aggregation where you input a set of rows and output a set of rows, in that case, after performing a SUM operation.
Transformation Mechanisms
Figure 143
In
Region North South North East West South ... Sales 10 20 25 5 10 10 ... Table Function
Out
Region North South West East Sum of Sales 35 30 10 5
The table function takes the result of the SELECT on In as input and delivers a set of records in a different format as output for a direct insertion into Out. Additionally, a table function can fan out data within the scope of an atomic transaction. This can be used for many occasions like an efcient logging mechanism or a fan out for other independent transformations. In such a scenario, a single staging table will be needed.
Figure 144 Pipelined Parallel Transformation with Fanout
tf1
tf2
Target
This will insert into target and, as part of tf1, into Stage Table 1 within the scope of an atomic transaction.
INSERT INTO target SELECT * FROM tf3(SELT * FROM stage_table1);
Transformation Mechanisms
Example 146
The following examples demonstrate the fundamentals of table functions, without the usage of complex business rules implemented inside those functions. They are chosen for demonstration purposes only, and are all implemented in PL/SQL. Table functions return sets of records and can take cursors as input. Besides the sh sample schema, you have to set up the following database objects before using the examples:
CREATE TYPE product_t AS OBJECT ( prod_id NUMBER(6) , prod_name VARCHAR2(50) , prod_desc VARCHAR2(4000) , prod_subcategory VARCHAR2(50) , prod_subcategory_desc VARCHAR2(2000) , prod_category VARCHAR2(50) , prod_category_desc VARCHAR2(2000) , prod_weight_class NUMBER(2) , prod_unit_of_measure VARCHAR2(20) , prod_pack_size VARCHAR2(30) , supplier_id NUMBER(6) , prod_status VARCHAR2(20) , prod_list_price NUMBER(8,2) , prod_min_price NUMBER(8,2) ); / CREATE TYPE product_t_table AS TABLE OF product_t; / COMMIT; CREATE OR REPLACE PACKAGE cursor_PKG AS TYPE product_t_rec IS RECORD ( prod_id NUMBER(6) , prod_name VARCHAR2(50) , prod_desc VARCHAR2(4000) , prod_subcategory VARCHAR2(50) , prod_subcategory_desc VARCHAR2(2000) , prod_category VARCHAR2(50) , prod_category_desc VARCHAR2(2000) , prod_weight_class NUMBER(2) , prod_unit_of_measure VARCHAR2(20) , prod_pack_size VARCHAR2(30) , supplier_id NUMBER(6) , prod_status VARCHAR2(20) , prod_list_price NUMBER(8,2)
Transformation Mechanisms
, prod_min_price NUMBER(8,2)); TYPE product_t_rectab IS TABLE OF product_t_rec; TYPE strong_refcur_t IS REF CURSOR RETURN product_t_rec; TYPE refcur_t IS REF CURSOR; END; / REM artificial help table, used later CREATE TABLE obsolete_products_errors (prod_id NUMBER, msg VARCHAR2(2000));
The following example demonstrates a simple ltering; it shows all obsolete products except the prod_category Electronics. The table function returns the result set as a set of records and uses a weakly typed ref cursor as input.
CREATE OR REPLACE FUNCTION obsolete_products(cur cursor_pkg.refcur_t) RETURN product_t_table IS prod_id NUMBER(6); prod_name VARCHAR2(50); prod_desc VARCHAR2(4000); prod_subcategory VARCHAR2(50); prod_subcategory_desc VARCHAR2(2000); prod_category VARCHAR2(50); prod_category_desc VARCHAR2(2000); prod_weight_class NUMBER(2); prod_unit_of_measure VARCHAR2(20); prod_pack_size VARCHAR2(30); supplier_id NUMBER(6); prod_status VARCHAR2(20); prod_list_price NUMBER(8,2); prod_min_price NUMBER(8,2); sales NUMBER:=0; objset product_t_table := product_t_table(); i NUMBER := 0; BEGIN LOOP -- Fetch from cursor variable FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price; EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched -- Category Electronics is not meant to be obsolete and will be suppressed IF prod_status='obsolete' AND prod_category != 'Electronics' THEN -- append to collection
Transformation Mechanisms
i:=i+1; objset.extend; objset(i):=product_t( prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price); END IF; END LOOP; CLOSE cur; RETURN objset; END; /
You can use the table function in a SQL statement to show the results. Here we use additional SQL functionality for the output:
SELECT DISTINCT UPPER(prod_category), prod_status FROM TABLE(obsolete_products( CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price FROM products)));
The following example implements the same ltering than the rst one. The main differences between those two are:
s
This example uses a strong typed REF cursor as input and can be parallelized based on the objects of the strong typed cursor, as shown in one of the following examples. The table function returns the result set incrementally as soon as records are created.
CREATE OR REPLACE FUNCTION obsolete_products_pipe(cur cursor_pkg.strong_refcur_t) RETURN product_t_table PIPELINED PARALLEL_ENABLE (PARTITION cur BY ANY) IS prod_id NUMBER(6); prod_name VARCHAR2(50); prod_desc VARCHAR2(4000); prod_subcategory VARCHAR2(50); prod_subcategory_desc VARCHAR2(2000); prod_category VARCHAR2(50); prod_category_desc VARCHAR2(2000); prod_weight_class NUMBER(2);
Transformation Mechanisms
prod_unit_of_measure VARCHAR2(20); prod_pack_size VARCHAR2(30); supplier_id NUMBER(6); prod_status VARCHAR2(20); prod_list_price NUMBER(8,2); prod_min_price NUMBER(8,2); sales NUMBER:=0; BEGIN LOOP -- Fetch from cursor variable FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price; EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched IF prod_status='obsolete' AND prod_category !='Electronics' THEN PIPE ROW (product_t( prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price)); END IF; END LOOP; CLOSE cur; RETURN; END; /
We now change the degree of parallelism for the input table products and issue the same statement again:
ALTER TABLE products PARALLEL 4;
The session statistics show that the statement has been parallelized:
SELECT * FROM V$PQ_SESSTAT WHERE statistic='Queries Parallelized';
Transformation Mechanisms
LAST_QUERY ---------1
SESSION_TOTAL ------------3
Table functions are also capable to fanout results into persistent table structures. This is demonstrated in the next example. The function lters returns all obsolete products except a those of a specic prod_category (default Electronics), which was set to status obsolete by error. The result set of the table function consists of all other obsolete product categories. The detected wrong prod_id's are stored in a separate table structure obsolete_products_error. Note that if a table function is part of an autonomous transaction, it must COMMIT or ROLLBACK before each PIPE ROW statement to avoid an error in the calling subprogram. Its result set consists of all other obsolete product categories. It furthermore demonstrates how normal variables can be used in conjunction with table functions:
CREATE OR REPLACE FUNCTION obsolete_products_dml(cur cursor_pkg.strong_refcur_t, prod_cat varchar2 DEFAULT 'Electronics') RETURN product_t_table PIPELINED PARALLEL_ENABLE (PARTITION cur BY ANY) IS PRAGMA AUTONOMOUS_TRANSACTION; prod_id NUMBER(6); prod_name VARCHAR2(50); prod_desc VARCHAR2(4000); prod_subcategory VARCHAR2(50); prod_subcategory_desc VARCHAR2(2000); prod_category VARCHAR2(50); prod_category_desc VARCHAR2(2000); prod_weight_class NUMBER(2); prod_unit_of_measure VARCHAR2(20); prod_pack_size VARCHAR2(30); supplier_id NUMBER(6); prod_status VARCHAR2(20); prod_list_price NUMBER(8,2); prod_min_price NUMBER(8,2); sales NUMBER:=0; BEGIN LOOP -- Fetch from cursor variable FETCH cur INTO prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status,
Transformation Mechanisms
prod_list_price, prod_min_price; EXIT WHEN cur%NOTFOUND; -- exit when last row is fetched IF prod_status='obsolete' THEN IF prod_category=prod_cat THEN INSERT INTO obsolete_products_errors VALUES (prod_id, 'correction: category '||UPPER(prod_cat)||' still available'); COMMIT; ELSE PIPE ROW (product_t( prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price)); END IF; END IF; END LOOP; CLOSE cur; RETURN; END; /
The following query shows all obsolete product groups except the prod_ category Electronics, which was wrongly set to status obsolete:
SELECT DISTINCT prod_category, prod_status FROM TABLE(obsolete_products_dml( CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price FROM products)));
As you can see, there are some products of the prod_category Electronics that were obsoleted by accident:
SELECT DISTINCT msg FROM obsolete_products_errors;
Taking advantage of the second input variable, you can specify a different product group than Electronics to be considered:
SELECT DISTINCT prod_category, prod_status FROM TABLE(obsolete_products_dml( CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price FROM products),'Photo'));
Because table functions can be used like a normal table, they can be nested, as shown in the following:
SELECT DISTINCT prod_category, prod_status FROM TABLE(obsolete_products_dml(CURSOR(SELECT * FROM TABLE(obsolete_products_pipe(CURSOR(SELECT prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price FROM products))))));
The biggest advantage of Oracle Database's ETL is its toolkit functionality, where you can combine any of the latter discussed functionality to improve and speed up your ETL processing. For example, you can take an external table as input, join it with an existing table and use it as input for a parallelized table function to process complex business logic. This table function can be used as input source for a MERGE operation, thus streaming the new information for the data warehouse, provided in a at le within one single statement through the complete ETL process. See PL/SQL User's Guide and Reference for details about table functions and the PL/SQL programming. For details about table functions implemented in other languages, see Oracle Data Cartridge Developer's Guide.
In order to execute this transformation, a lookup table must relate the product_id values to the UPC codes. This table might be the product dimension table, or perhaps another table in the data warehouse that has been created specically to support this transformation. For this example, we assume that there is a table named product, which has a product_id and an upc_code column. This data substitution transformation can be implemented using the following CTAS statement:
CREATE TABLE temp_sales_step2 NOLOGGING PARALLEL AS SELECT sales_transaction_id, product.product_id sales_product_id, sales_customer_id, sales_time_id, sales_channel_id, sales_quantity_sold, sales_dollar_amount FROM temp_sales_step1, product WHERE temp_sales_step1.upc_code = product.upc_code;
This CTAS statement will convert each valid UPC code to a valid product_id value. If the ETL process has guaranteed that each UPC code is valid, then this statement alone may be sufcient to implement the entire transformation.
This invalid data is now stored in a separate table, temp_sales_step1_invalid, and can be handled separately by the ETL process. Another way to handle invalid data is to modify the original CTAS to use an outer join:
CREATE TABLE temp_sales_step2 NOLOGGING PARALLEL AS SELECT sales_transaction_id, product.product_id sales_product_id, sales_customer_id, sales_time_id, sales_channel_id, sales_quantity_sold, sales_dollar_amount FROM temp_sales_step1, product WHERE temp_sales_step1.upc_code = product.upc_code (+);
Using this outer join, the sales transactions that originally contained invalidated UPC codes will be assigned a product_id of NULL. These transactions can be handled later.
Additional approaches to handling invalid UPC codes exist. Some data warehouses may choose to insert null-valued product_id values into their sales table, while other data warehouses may not allow any new data from the entire batch to be inserted into the sales table until all invalid UPC codes have been addressed. The correct approach is determined by the business requirements of the data warehouse. Regardless of the specic requirements, exception handling can be addressed by the same basic SQL techniques as transformations.
Pivoting Scenarios
A data warehouse can receive data from many different sources. Some of these source systems may not be relational databases and may store data in very different formats from the data warehouse. For example, suppose that you receive a set of sales records from a nonrelational database having the form:
product_id, customer_id, weekly_start_date, sales_sun, sales_mon, sales_tue, sales_wed, sales_thu, sales_fri, sales_sat
In your data warehouse, you would want to store the records in a more typical relational form in a fact table sales of the sh sample schema:
prod_id, cust_id, time_id, amount_sold
Note: A number of constraints on the sales table have been disabled for purposes of this example, because the example ignores a number of table columns for the sake of brevity.
Thus, you need to build a transformation such that each record in the input stream must be converted into seven records for the data warehouse's sales table. This operation is commonly referred to as pivoting, and Oracle Database offers several ways to do this. The result of the previous example will resemble the following:
SELECT prod_id, cust_id, time_id, amount_sold FROM sales; PROD_ID CUST_ID ---------- ---------111 222 111 222 111 222 111 222 111 222 111 222 111 222 222 333 222 333 222 333 222 333 222 333 222 333 222 333 333 444 333 444 333 444 333 444 333 444 333 444 333 444 Example 147 Pivoting TIME_ID AMOUNT_SOLD --------- ----------01-OCT-00 100 02-OCT-00 200 03-OCT-00 300 04-OCT-00 400 05-OCT-00 500 06-OCT-00 600 07-OCT-00 700 08-OCT-00 200 09-OCT-00 300 10-OCT-00 400 11-OCT-00 500 12-OCT-00 600 13-OCT-00 700 14-OCT-00 800 15-OCT-00 300 16-OCT-00 400 17-OCT-00 500 18-OCT-00 600 19-OCT-00 700 20-OCT-00 800 21-OCT-00 900
The following example uses the multitable insert syntax to insert into the demo table sh.sales some data from an input table with a different structure. The multitable insert statement looks like the following:
INSERT ALL INTO sales (prod_id, cust_id, time_id, amount_sold) VALUES (product_id, customer_id, weekly_start_date, sales_sun) INTO sales (prod_id, cust_id, time_id, amount_sold) VALUES (product_id, customer_id, weekly_start_date+1, sales_mon) INTO sales (prod_id, cust_id, time_id, amount_sold) VALUES (product_id, customer_id, weekly_start_date+2, sales_tue) INTO sales (prod_id, cust_id, time_id, amount_sold) VALUES (product_id, customer_id, weekly_start_date+3, sales_wed) INTO sales (prod_id, cust_id, time_id, amount_sold) VALUES (product_id, customer_id, weekly_start_date+4, sales_thu) INTO sales (prod_id, cust_id, time_id, amount_sold) VALUES (product_id, customer_id, weekly_start_date+5, sales_fri) INTO sales (prod_id, cust_id, time_id, amount_sold)
VALUES (product_id, customer_id, weekly_start_date+6, sales_sat) SELECT product_id, customer_id, weekly_start_date, sales_sun, sales_mon, sales_tue, sales_wed, sales_thu, sales_fri, sales_sat FROM sales_input_table;
This statement only scans the source table once and then inserts the appropriate data for each day.
15
Maintaining the Data Warehouse
This chapter discusses how to load and refresh a data warehouse, and discusses:
s
Using Partitioning to Improve Data Warehouse Refresh Optimizing DML Operations During Refresh Refreshing Materialized Views Using Materialized Views with Partitioned Tables
Place the new data into a separate table, sales_01_2001. This data can be directly loaded into sales_01_2001 from outside the data warehouse, or this data can be the result of previous data transformation operations that have already occurred in the data warehouse. sales_01_2001 has the exact same columns, datatypes, and so forth, as the sales table. Gather statistics on the sales_01_2001 table. Create indexes and add constraints on sales_01_2001. Again, the indexes and constraints on sales_01_2001 should be identical to the indexes and constraints on sales. Indexes can be built in parallel and should use the NOLOGGING and the COMPUTE STATISTICS options. For example:
CREATE BITMAP INDEX sales_01_2001_customer_id_bix ON sales_01_2001(customer_id) TABLESPACE sales_idx NOLOGGING PARALLEL 8 COMPUTE STATISTICS;
2.
15-2
Apply all constraints to the sales_01_2001 table that are present on the sales table. This includes referential integrity constraints. A typical constraint would be:
ALTER TABLE sales_01_2001 ADD CONSTRAINT sales_customer_id REFERENCES customer(customer_id) ENABLE NOVALIDATE;
If the partitioned table sales has a primary or unique key that is enforced with a global index structure, ensure that the constraint on sales_pk_jan01 is validated without the creation of an index structure, as in the following:
ALTER TABLE sales_01_2001 ADD CONSTRAINT sales_pk_jan01 PRIMARY KEY (sales_transaction_id) DISABLE VALIDATE;
The creation of the constraint with ENABLE clause would cause the creation of a unique index, which does not match a local index structure of the partitioned table. You must not have any index structure built on the nonpartitioned table to be exchanged for existing global indexes of the partitioned table. The exchange command would fail.
3.
Add the sales_01_2001 table to the sales table. In order to add this new data to the sales table, we need to do two things. First, we need to add a new partition to the sales table. We will use the ALTER TABLE ... ADD PARTITION statement. This will add an empty partition to the sales table:
ALTER TABLE sales ADD PARTITION sales_01_2001 VALUES LESS THAN (TO_DATE('01-FEB-2001', 'DD-MON-YYYY'));
Then, we can add our newly created table to this partition using the EXCHANGE PARTITION operation. This will exchange the new, empty partition with the newly loaded table.
ALTER TABLE sales EXCHANGE PARTITION sales_01_2001 WITH TABLE sales_01_2001 INCLUDING INDEXES WITHOUT VALIDATION UPDATE GLOBAL INDEXES;
The EXCHANGE operation will preserve the indexes and constraints that were already present on the sales_01_2001 table. For unique constraints (such as the unique constraint on sales_transaction_id), you can use the UPDATE GLOBAL INDEXES clause, as shown previously. This will automatically maintain your global index structures as part of the partition maintenance operation and keep them accessible throughout the whole process. If there were only foreign-key constraints, the exchange operation would be instantaneous.
The benets of this partitioning technique are signicant. First, the new data is loaded with minimal resource utilization. The new data is loaded into an entirely separate table, and the index processing and constraint processing are applied only to the new partition. If the sales table was 50 GB and had 12 partitions, then a new month's worth of data contains approximately 4 GB. Only the new month's worth of data needs to be indexed. None of the indexes on the remaining 46 GB of data needs to be modied at all. This partitioning scheme additionally ensures that the load processing time is directly proportional to the amount of new data being loaded, not to the total size of the sales table. Second, the new data is loaded with minimal impact on concurrent queries. All of the operations associated with data loading are occurring on a separate sales_01_ 2001 table. Therefore, none of the existing data or indexes of the sales table is affected during this data refresh process. The sales table and its indexes remain entirely untouched throughout this refresh process. Third, in case of the existence of any global indexes, those are incrementally maintained as part of the exchange command. This maintenance does not affect the availability of the existing global index structures. The exchange operation can be viewed as a publishing mechanism. Until the data warehouse administrator exchanges the sales_01_2001 table into the sales table, end users cannot see the new data. Once the exchange has occurred, then any end user query accessing the sales table will immediately be able to see the sales_01_2001 data. Partitioning is useful not only for adding new data but also for removing and archiving data. Many data warehouses maintain a rolling window of data. For example, the data warehouse stores the most recent 36 months of sales data. Just as a new partition can be added to the sales table (as described earlier), an old partition can be quickly (and independently) removed from the sales table. These two benets (reduced resources utilization and minimal end-user impact) are just as pertinent to removing a partition as they are to adding a partition. Removing data from a partitioned table does not necessarily mean that the old data is physically deleted from the database. There are two alternatives for removing old data from a partitioned table. First, you can physically delete all data from the database by dropping the partition containing the old data, thus freeing the allocated space:
ALTER TABLE sales DROP PARTITION sales_01_1998;
Also, you can exchange the old partition with an empty table of the same structure; this empty table is created equivalent to step1 and 2 described in the load process.
15-4
Assuming the new empty table stub is named sales_archive_01_1998, the following SQL statement will 'empty' partition sales_01_1998:
ALTER TABLE sales EXCHANGE PARTITION sales_01_1998 WITH TABLE sales_archive_01_ 1998 INCLUDING INDEXES WITHOUT VALIDATION UPDATE GLOBAL INDEXES;
Note that the old data is still existent as the exchanged, nonpartitioned table sales_archive_01_1998. If the partitioned table was setup in a way that every partition is stored in a separate tablespace, you can archive (or transport) this table using Oracle Database's transportable tablespace framework before dropping the actual data (the tablespace). See "Transportation Using Transportable Tablespaces" on page 15-5 for further details regarding transportable tablespaces. In some situations, you might not want to drop the old data immediately, but keep it as part of the partitioned table; although the data is no longer of main interest, there are still potential queries accessing this old, read-only data. You can use Oracle's data compression to minimize the space usage of the old data. We also assume that at least one compressed partition is already part of the partitioned table. See Chapter 3, "Physical Design in Data Warehouses" for a generic discussion of table compression and Chapter 5, "Parallelism and Partitioning in Data Warehouses" for partitioning and table compression.
Refresh Scenarios
A typical scenario might not only need to compress old data, but also to merge several old partitions to reect the granularity for a later backup of several merged partitions. Let's assume that a backup (partition) granularity is on a quarterly base for any quarter, where the oldest month is more than 36 months behind the most recent month. In this case, we are therefore compressing and merging sales_01_ 1998, sales_02_1998, and sales_03_1998 into a new, compressed partition sales_q1_1998.
1.
Create the new merged partition in parallel another tablespace. The partition will be compressed as part of the MERGE operation:
ALTER TABLE sales MERGE PARTITION sales_01_1998, sales_02_1998, sales_03_ 1998 INTO PARTITION sales_q1_1998 TABLESPACE archive_q1_1998 COMPRESS UPDATE GLOBAL INDEXES PARALLEL 4;
2.
The partition MERGE operation invalidates the local indexes for the new merged partition. We therefore have to rebuild them:
ALTER TABLE sales MODIFY PARTITION sales_1_1998
Alternatively, you can choose to create the new compressed table outside the partitioned table and exchange it back. The performance and the temporary space consumption is identical for both methods:
1.
Create an intermediate table to hold the new merged information. The following statement inherits all NOT NULL constraints from the origin table by default:
CREATE TABLE sales_q1_1998_out TABLESPACE archive_q1_1998 NOLOGGING COMPRESS PARALLEL 4 AS SELECT * FROM sales WHERE time_id >= TO_DATE('01-JAN-1998','dd-mon-yyyy') AND time_id < TO_DATE('01-JUN-1998','dd-mon-yyyy');
2. 3.
Create the equivalent index structure for table sales_q1_1998_out than for the existing table sales. Prepare the existing table sales for the exchange with the new compressed table sales_q1_1998_out. Because the table to be exchanged contains data actually covered in three partition, we have to 'create one matching partition, having the range boundaries we are looking for. You simply have to drop two of the existing partitions. Note that you have to drop the lower two partitions sales_01_1998 and sales_02_1998; the lower boundary of a range partition is always dened by the upper (exclusive) boundary of the previous partition:
ALTER TABLE sales DROP PARTITION sales_01_1998; ALTER TABLE sales DROP PARTITION sales_02_1998;
4.
You can now exchange table sales_q1_1998_out with partition sales_03_ 1998. Unlike what the name of the partition suggests, its boundaries cover Q1-1998.
ALTER TABLE sales EXCHANGE PARTITION sales_03_1998 WITH TABLE sales_q1_1998_out INCLUDING INDEXES WITHOUT VALIDATION UPDATE GLOBAL INDEXES;
Both methods apply to slightly different business scenarios: Using the MERGE PARTITION approach invalidates the local index structures for the affected partition, but it keeps all data accessible all the time. Any attempt to access the affected partition through one of the unusable index structures raises an error. The limited availability time is approximately the time for re-creating the local bitmap index structures. In most cases this can be neglected, since this part of the partitioned table shouldn't be touched too often.
15-6
The CTAS approach, however, minimizes unavailability of any index structures close to zero, but there is a specic time window, where the partitioned table does not have all the data, because we dropped two partitions. The limited availability time is approximately the time for exchanging the table. Depending on the existence and number of global indexes, this time window varies. Without any existing global indexes, this time window is a matter of a fraction to few seconds. These examples are a simplication of the data warehouse rolling window load scenario. Real-world data warehouse refresh characteristics are always more complex. However, the advantages of this rolling window approach are not diminished in more complex scenarios. Note that before you add single or multiple compressed partitions to a partitioned table for the rst time, all local bitmap indexes must be either dropped or marked unusable. After the rst compressed partition is added, no additional actions are necessary for all subsequent operations involving compressed partitions. It is irrelevant how the compressed partitions are added to the partitioned table. See Chapter 5, "Parallelism and Partitioning in Data Warehouses" for further details about partitioning and table compression.
Refresh Scenario 1
Data is loaded daily. However, the data warehouse contains two years of data, so that partitioning by day might not be desired. The solution is to partition by week or month (as appropriate). Use INSERT to add the new data to an existing partition. The INSERT operation only affects a single partition, so the benets described previously remain intact. The INSERT operation could occur while the partition remains a part of the table. Inserts into a single partition can be parallelized:
INSERT /*+ APPEND*/ INTO sales PARTITION (sales_01_2001) SELECT * FROM new_sales;
The indexes of this sales partition will be maintained in parallel as well. An alternative is to use the EXCHANGE operation. You can do this by exchanging the sales_01_2001 partition of the sales table and then using an INSERT operation. You might prefer this technique when dropping and rebuilding indexes is more efcient than maintaining them.
Refresh Scenario 2
New data feeds, although consisting primarily of data for the most recent day, week, and month, also contain some data from previous time periods. Solution 1 Use parallel SQL operations (such as CREATE TABLE ... AS SELECT) to separate the new data from the data in previous time periods. Process the old data separately using other techniques. New data feeds are not solely time based. You can also feed new data into a data warehouse with data from multiple operational systems on a business need basis. For example, the sales data from direct channels may come into the data warehouse separately from the data from indirect channels. For business reasons, it may furthermore make sense to keep the direct and indirect data in separate partitions. Solution 2 Oracle supports composite range list partitioning. The primary partitioning strategy of the sales table could be range partitioning based on time_ id as shown in the example. However, the subpartitioning is a list based on the channel attribute. Each subpartition can now be loaded independently of each other (for each distinct channel) and added in a rolling window operation as discussed before. The partitioning strategy addresses the business needs in the most optimal manner.
15-8
As a typical scenario, suppose that there is a table called new_sales that contains both inserts and updates that will be applied to the sales table. When designing the entire data warehouse load process, it was determined that the new_sales table would contain records with the following semantics:
s
If a given sales_transaction_id of a record in new_sales already exists in sales, then update the sales table by adding the sales_dollar_amount and sales_quantity_sold values from the new_sales table to the existing row in the sales table. Otherwise, insert the entire new record from the new_sales table into the sales table.
This UPDATE-ELSE-INSERT operation is often called a merge. A merge can be executed using one SQL statement.
Example 151 MERGE Operation
MERGE INTO sales s USING new_sales n ON (s.sales_transaction_id = n.sales_transaction_id) WHEN MATCHED THEN UPDATE s_quantity = s_quantity + n_quantity, s_dollar = s_dollar + n_dollar WHEN NOT MATCHED THEN INSERT (sales_quantity_sold, sales_dollar_amount) VALUES (n.sales_quantity_sold, n.sales_dollar_amount);
In addition to using the MERGE statement for unconditional UPDATE ELSE INSERT functionality into a target table, you can also use it to:
s
Perform an UPDATE only or INSERT only statement. Apply additional WHERE conditions for the UPDATE or INSERT portion of the MERGE statement. The UPDATE operation can even delete rows if a specic condition yields true.
Omitting the INSERT Clause
Example 152
In some data warehouse applications, it is not allowed to add new rows to historical information, but only to update them. It may also happen that you don't want to update but only insert new information. The following examples demonstrate the INSERT-only respective the UPDATE-only functionality:
MERGE USING Product_Changes S INTO Products D1 ON (D1.PROD_ID = S.PROD_ID) WHEN MATCHED THEN UPDATE ----Source/Delta table Destination table 1 Search/Join condition update if join
When the INSERT clause is omitted, Oracle performs a regular join of the source and the target tables. When the UPDATE clause is omitted, Oracle performs an antijoin of the source and the target tables. This makes the join between the source and target table more efcient.
Example 154 Skipping the UPDATE Clause
In some situations, you may want to skip the UPDATE operation when merging a given row into the table. In this case, you can use an optional WHERE clause in the UPDATE clause of the MERGE. As a result, the UPDATE operation only executes when a given condition is true. The following statement illustrates an example of skipping the UPDATE operation:
MERGE USING Product_Changes S INTO Products P ON (P.PROD_ID = S.PROD_ID) WHEN MATCHED THEN UPDATE SET P.PROD_LIST_PRICE = S.PROD_NEW_PRICE WHERE P.PROD_STATUS <> "OBSOLETE" -- Source/Delta table -- Destination table 1 -- Search/Join condition -- update if join -- Conditional UPDATE
This shows how the UPDATE operation would be skipped if the condition P.PROD_ STATUS <> "OBSOLETE" is not true. The condition predicate can refer to both the target and the source table.
Example 155 Conditional Inserts with MERGE Statements
You may want to skip the INSERT operation when merging a given row into the table. So an optional WHERE clause is added to the INSERT clause of the MERGE. As a result, the INSERT operation only executes when a given condition is true. The following statement offers an example:
MERGE USING Product_Changes S -- Source/Delta table INTO Products P -- Destination table 1 ON (P.PROD_ID = S.PROD_ID) -- Search/Join condition WHEN MATCHED THEN UPDATE -- update if join SET P.PROD_LIST_PRICE = S.PROD_NEW_PRICE WHERE P.PROD_STATUS <> "OBSOLETE" -- Conditional UPDATE WHEN NOT MATCHED THEN INSERT -- insert if not join SET P.PROD_LIST_PRICE = S.PROD_NEW_PRICE WHERE S.PROD_STATUS <> "OBSOLETE" -- Conditional INSERT
This example shows that the INSERT operation would be skipped if the condition S.PROD_STATUS <> "OBSOLETE" is not true, and INSERT will only occur if the condition is true. The condition predicate can refer to the source table only. This predicate would be most likely a column lter.
Example 156 Using the DELETE Clause with MERGE Statements
You may want to cleanse tables while populating or updating them. To do this, you may want to consider using the DELETE clause in a MERGE statement, as in the following example:
MERGE USING Product_Changes S INTO Products D ON (D.PROD_ID = S.PROD_ID) WHEN MATCHED THEN UPDATE SET D.PROD_LIST_PRICE =S.PROD_NEW_PRICE, D.PROD_STATUS = S.PROD_ NEWSTATUS DELETE WHERE (D.PROD_STATUS = "OBSOLETE") WHEN NOT MATCHED THEN INSERT (PROD_ID, PROD_LIST_PRICE, PROD_STATUS) VALUES (S.PROD_ID, S.PROD_NEW_PRICE, S.PROD_NEW_STATUS);
Thus when a row is updated in products, Oracle checks the delete condition D.PROD_STATUS = "OBSOLETE", and deletes the row if the condition yields true. The DELETE operation is not as same as that of a complete DELETE statement. Only the rows from the destination of the MERGE can be deleted. The only rows that will be affected by the DELETE are the ones that are updated by this MERGE statement. Thus, although a give row of the destination table meets the delete condition, if it does not join under the ON clause condition, it will not be deleted.
Example 157 Unconditional Inserts with MERGE Statements
You may want to insert all of the source rows into a table. In this case, the join between the source and target table can be avoided. By identifying special constant
join conditions that always result to FALSE, for example, 1=0, such MERGE statements will be optimized and the join condition will be suppressed.
MERGE USING New_Product S -- Source/Delta table INTO Products P -- Destination table 1 ON (1 = 0) -- Search/Join condition WHEN NOT MATCHED THEN -- insert if no join INSERT (PROD_ID, PROD_STATUS) VALUES (S.PROD_ID, S.PROD_NEW_STATUS)
Purging Data
Occasionally, it is necessary to remove large amounts of data from a data warehouse. A very common scenario is the rolling window discussed previously, in which older data is rolled out of the data warehouse to make room for new data.
However, sometimes other data might need to be removed from a data warehouse. Suppose that a retail company has previously sold products from MS Software, and that MS Software has subsequently gone out of business. The business users of the warehouse may decide that they are no longer interested in seeing any data related to MS Software, so this data should be deleted. One approach to removing a large volume of data is to use parallel delete as shown in the following statement:
DELETE FROM sales WHERE sales_product_id IN (SELECT product_id FROM product WHERE product_category = 'MS Software');
This SQL statement will spawn one parallel process for each partition. This approach will be much more efcient than a serial DELETE statement, and none of the data in the sales table will need to be moved. However, this approach also has some disadvantages. When removing a large percentage of rows, the DELETE statement will leave many empty row-slots in the existing partitions. If new data is being loaded using a rolling window technique (or is being loaded using direct-path INSERT or load), then this storage space will not be reclaimed. Moreover, even though the DELETE statement is parallelized, there might be more efcient methods. An alternative method is to re-create the entire sales table, keeping the data for all product categories except MS Software.
CREATE TABLE sales2 AS SELECT * FROM sales, product WHERE sales.sales_product_id = product.product_id AND product_category <> 'MS Software' NOLOGGING PARALLEL (DEGREE 8) #PARTITION ... ; #create indexes, constraints, and so on DROP TABLE SALES; RENAME SALES2 TO SALES;
This approach may be more efcient than a parallel delete. However, it is also costly in terms of the amount of disk space, because the sales table must effectively be instantiated twice. An alternative method to utilize less space is to re-create the sales table one partition at a time:
CREATE TABLE sales_temp AS SELECT * FROM sales WHERE 1=0; INSERT INTO sales_temp PARTITION (sales_99jan) SELECT * FROM sales, product WHERE sales.sales_product_id = product.product_id AND product_category <> 'MS Software'; <create appropriate indexes and constraints on sales_temp> ALTER TABLE sales EXCHANGE PARTITION sales_99jan WITH TABLE sales_temp;
DBMS_MVIEW.REFRESH_DEPENDENT Refresh all materialized views that depend on a specied master table or materialized view or list of master tables or materialized views.
See Also: "Manual Refresh Using the DBMS_MVIEW Package"
on page 15-16 for more information about this package Performing a refresh operation requires temporary space to rebuild the indexes and can require additional space for performing the refresh operation itself. Some sites might prefer not to refresh all of their materialized views at the same time: as soon as some underlying detail data has been updated, all materialized views using this data will become stale. Therefore, if you defer refreshing your materialized views, you can either rely on your chosen rewrite integrity level to determine whether or not a stale materialized view can be used for query rewrite, or you can temporarily disable query rewrite with an ALTER SYSTEM SET QUERY_REWRITE_ENABLED = FALSE statement. After refreshing the materialized views, you can re-enable query rewrite as the default for all sessions in the current database instance by specifying ALTER SYSTEM SET QUERY_REWRITE_ENABLED as TRUE. Refreshing a materialized view automatically updates all of its indexes. In the case of full refresh,
this requires temporary sort space to rebuild all indexes during refresh. This is because the full refresh truncates or deletes the table before inserting the new full data volume. If insufcient temporary space is available to rebuild the indexes, then you must explicitly drop each index or mark it UNUSABLE prior to performing the refresh operation. If you anticipate performing insert, update or delete operations on tables referenced by a materialized view concurrently with the refresh of that materialized view, and that materialized view includes joins and aggregation, Oracle recommends you use ON COMMIT fast refresh rather than ON DEMAND fast refresh.
Complete Refresh
A complete refresh occurs when the materialized view is initially dened as BUILD IMMEDIATE, unless the materialized view references a prebuilt table. For materialized views using BUILD DEFERRED, a complete refresh must be requested before it can be used for the rst time. A complete refresh may be requested at any time during the life of any materialized view. The refresh involves reading the detail tables to compute the results for the materialized view. This can be a very time-consuming process, especially if there are huge amounts of data to be read and processed. Therefore, you should always consider the time required to process a complete refresh before requesting it. There are, however, cases when the only refresh method available for an already built materialized view is complete refresh because the materialized view does not satisfy the conditions specied in the following section for a fast refresh.
Fast Refresh
Most data warehouses have periodic incremental updates to their detail data. As described in "Materialized View Schema Design" on page 8-8, you can use the SQL*Loader or any bulk load utility to perform incremental loads of detail data. Fast refresh of your materialized views is usually efcient, because instead of having to recompute the entire materialized view, the changes are applied to the existing data. Thus, processing only the changes can result in a very fast refresh time.
materialized view is enabled only if all the conditions described in "Partition Change Tracking" on page 9-2 are satised. In the absence of partition maintenance operations on detail tables, when you request for a FAST method (method => 'F') of refresh through procedures in DBMS_MVIEW package, Oracle will choose PCT refresh if it is enabled on the materialized view and is determined to be the better than log-based fast refresh. Similarly, when you request for a FORCE method (method => '?'), Oracle will choose PCT refresh if it is enabled on the materialized view and is calculated to be the better than log-based fast refresh and complete refresh. Alternatively, you can request the PCT method (method => 'P'), and Oracle will use the PCT method provided all PCT requirements are satised. Oracle can use TRUNCATE PARTITION on a materialized view if it satises the conditions in "Benets of Partitioning a Materialized View" on page 9-8 and hence, make the PCT refresh process more efcient.
ON COMMIT Refresh
A materialized view can be refreshed automatically using the ON COMMIT method. Therefore, whenever a transaction commits which has updated the tables on which a materialized view is dened, those changes will be automatically reected in the materialized view. The advantage of using this approach is you never have to remember to refresh the materialized view. The only disadvantage is the time required to complete the commit will be slightly longer because of the extra processing involved. However, in a data warehouse, this should not be an issue because there is unlikely to be concurrent processes trying to update the same table.
Table 151
FAST_PCT FORCE
P ?
Refreshes by recomputing the rows in the materialized view affected by changed partitions in the detail tables. Attempts a fast refresh. If that is not possible, it does a complete refresh. For local materialized views, it chooses the refresh method which is estimated by optimizer to be most efcient. The refresh methods considered are log based FAST, FAST_PCT, and COMPLETE.
Three refresh procedures are available in the DBMS_MVIEW package for performing ON DEMAND refresh. Each has its own unique set of parameters.
See Also: PL/SQL Packages and Types Reference for detailed information about the DBMS_MVIEW package and Oracle Database Advanced Replication for information showing how to use it in a replication environment
The comma-delimited list of materialized views to refresh The refresh method: F-Fast, P-Fast_PCT, ?-Force, C-Complete The rollback segment to use Refresh after errors (TRUE or FALSE) A Boolean parameter. If set to TRUE, the number_of_failures output parameter will be set to the number of refreshes that failed, and a generic error message will indicate that failures occurred. The alert log for the instance will give details of refresh errors. If set to FALSE, the default, then refresh will stop after it encounters the rst error, and any remaining materialized views in the list will not be refreshed.
The following four parameters are used by the replication process. For warehouse refresh, set them to FALSE, 0,0,0.
Atomic refresh (TRUE or FALSE) If set to TRUE, then all refreshes are done in one transaction. If set to FALSE, then the refresh of each specied materialized view is done in a separate transaction. If set to FALSE, Oracle can optimize refresh by using parallel DML and truncate DDL on a materialized views.
For example, to perform a fast refresh on the materialized view cal_month_ sales_mv, the DBMS_MVIEW package would be called as follows:
DBMS_MVIEW.REFRESH('CAL_MONTH_SALES_MV', 'F', '', TRUE, FALSE, 0,0,0, FALSE);
Multiple materialized views can be refreshed at the same time, and they do not all have to use the same refresh method. To give them different refresh methods, specify multiple method codes in the same order as the list of materialized views (without commas). For example, the following species that cal_month_sales_ mv be completely refreshed and fweek_pscat_sales_mv receive a fast refresh:
DBMS_MVIEW.REFRESH('CAL_MONTH_SALES_MV, FWEEK_PSCAT_SALES_MV', 'CF', '', TRUE, FALSE, 0,0,0, FALSE);
If the refresh method is not specied, the default refresh method as specied in the materialized view denition will be used.
The number of failures (this is an OUT variable) The refresh method: F-Fast, P-Fast_PCT, ?-Force, C-Complete Refresh after errors (TRUE or FALSE) A Boolean parameter. If set to TRUE, the number_of_failures output parameter will be set to the number of refreshes that failed, and a generic error message will indicate that failures occurred. The alert log for the instance will give details of refresh errors. If set to FALSE, the default, then refresh will stop after it encounters the rst error, and any remaining materialized views in the list will not be refreshed.
If set to TRUE, then all refreshes are done in one transaction. If set to FALSE, then the refresh of each specied materialized view is done in a separate transaction. If set to FALSE, Oracle can optimize refresh by using parallel DML and truncate DDL on a materialized views. An example of refreshing all materialized views is the following:
DBMS_MVIEW.REFRESH_ALL_MVIEWS(failures,'C','', TRUE, FALSE);
The number of failures (this is an OUT variable) The dependent table The refresh method: F-Fast, P-Fast_PCT, ?-Force, C-Complete The rollback segment to use Refresh after errors (TRUE or FALSE) A Boolean parameter. If set to TRUE, the number_of_failures output parameter will be set to the number of refreshes that failed, and a generic error message will indicate that failures occurred. The alert log for the instance will give details of refresh errors. If set to FALSE, the default, then refresh will stop after it encounters the rst error, and any remaining materialized views in the list will not be refreshed.
Atomic refresh (TRUE or FALSE) If set to TRUE, then all refreshes are done in one transaction. If set to FALSE, then the refresh of each specied materialized view is done in a separate transaction. If set to FALSE, Oracle can optimize refresh by using parallel DML and truncate DDL on a materialized views.
Whether it is nested or not If set to TRUE, refresh all the dependent materialized views of the specied set of tables based on a dependency order to ensure the materialized views are truly fresh with respect to the underlying base tables.
To perform a full refresh on all materialized views that reference the customers table, specify:
DBMS_MVIEW.REFRESH_DEPENDENT(failures, 'CUSTOMERS', 'C', '', FALSE, FALSE );
To obtain the list of materialized views that are directly dependent on a given object (table or materialized view), use the procedure DBMS_MVIEW.GET_MV_ DEPENDENCIES to determine the dependent materialized views for a given table, or for deciding the order to refresh nested materialized views.
DBMS_MVIEW.GET_MV_DEPENDENCIES(mvlist IN VARCHAR2, deplist OUT VARCHAR2)
The input to this function is the name or names of the materialized view. The output is a comma separated list of the materialized views that are dened on it. For example, the following statement:
DBMS_MVIEW.GET_MV_DEPENDENCIES("JOHN.SALES_REG, SCOTT.PROD_TIME", deplist)
This populates deplist with the list of materialized views dened on the input arguments. For example:
deplist <= "JOHN.SUM_SALES_WEST, JOHN.SUM_SALES_EAST, SCOTT.SUM_PROD_MONTH".
PARALLEL_MAX_SERVERS should be set high enough to take care of parallelism. You need to consider the number of slaves needed for the refresh statement. For example, with a DOP of eight, you need 16 slave processes. PGA_AGGREGATE_TARGET should be set for the instance to manage the memory usage for sorts and joins automatically. If the memory parameters are set manually, SORT_AREA_SIZE should be less than HASH_AREA_SIZE. OPTIMIZER_MODE should equal all_rows.
Remember to analyze all tables and indexes for better optimization. See Chapter 24, "Using Parallel Execution" for further information.
Monitoring a Refresh
While a job is running, you can query the V$SESSION_LONGOPS view to tell you the progress of each materialized view being refreshed.
SELECT * FROM V$SESSION_LONGOPS;
If the compile_state column shows NEEDS COMPILE, the other displayed column values cannot be trusted as reecting the true status. To revalidate the materialized view, issue the following statement:
ALTER MATERIALIZED VIEW [materialized_view_name] COMPILE;
Scheduling Refresh
Very often you will have multiple materialized views in the database. Some of these can be computed by rewriting against others. This is very common in data warehousing environment where you may have nested materialized views or materialized views at different levels of some hierarchy. In such cases, you should create the materialized views as BUILD DEFERRED, and then issue one of the refresh procedures in DBMS_MVIEW package to refresh all the materialized views. Oracle Database will compute the dependencies and refresh the materialized views in the right order. Consider the example of a complete hierarchical cube described in "Examples of Hierarchical Cube Materialized Views" on page 20-32. Suppose all the materialized views have been created as BUILD DEFERRED. Creating the materialized views as BUILD DEFERRED will only create the metadata for all the materialized views. And, then, you can just call one of the refresh procedures in DBMS_MVIEW package to refresh all the materialized views in the right order:
EXECUTE DBMS_MVIEW.REFRESH_DEPENDENT(list=>'SALES', method => 'C');
The procedure will refresh the materialized views in the order of their dependencies (rst sales_hierarchical_mon_cube_mv, followed by sales_ hierarchical_qtr_cube_mv, then, sales_hierarchical_yr_cube_mv and
nally, sales_hierarchical_all_cube_mv). Each of these materialized views will get rewritten against the one prior to it in the list). The same kind of rewrite can also be used while doing PCT refresh. PCT refresh recomputes rows in a materialized view corresponding to changed rows in the detail tables. And, if there are other fresh materialized views available at the time of refresh, it can go directly against them as opposed to going against the detail tables. Hence, it is always benecial to pass a list of materialized views to any of the refresh procedures in DBMS_MVIEW package (irrespective of the method specied) and let the procedure gure out the order of doing refresh on materialized views.
For fast refresh, create materialized view logs on all detail tables involved in a materialized view with the ROWID, SEQUENCE and INCLUDING NEW VALUES clauses. Include all columns from the table likely to be used in materialized views in the materialized view logs. Fast refresh may be possible even if the SEQUENCE option is omitted from the materialized view log. If it can be determined that only inserts or deletes will occur on all the detail tables, then the materialized view log does not require the SEQUENCE clause. However, if updates to multiple tables are likely or required or if the specic update scenarios are unknown, make sure the SEQUENCE clause is included.
Use Oracle's bulk loader utility or direct-path INSERT (INSERT with the APPEND hint for loads). This is a lot more efcient than conventional insert. During loading, disable all constraints and re-enable when nished loading. Note that materialized view logs are required regardless of whether you use direct load or conventional DML. Try to optimize the sequence of conventional mixed DML operations, direct-path INSERT and the fast refresh of materialized views. You can use fast refresh with a mixture of conventional DML and direct loads. Fast refresh can perform signicant optimizations if it nds that only direct loads have occurred, as illustrated in the following:
1. 2. 3. 4.
Direct-path INSERT (SQL*Loader or INSERT /*+ APPEND */) into the detail table Refresh materialized view Conventional mixed DML Refresh materialized view
You can use fast refresh with conventional mixed DML (INSERT, UPDATE, and DELETE) to the detail tables. However, fast refresh will be able to perform signicant optimizations in its processing if it detects that only inserts or deletes have been done to the tables, such as:
s
DML INSERT or DELETE to the detail table Refresh materialized views DML update to the detail table Refresh materialized view
Even more optimal is the separation of INSERT and DELETE. If possible, refresh should be performed after each type of data change (as shown earlier) rather than issuing only one refresh at the end. If that is not possible, restrict the conventional DML to the table to inserts only, to get much better refresh performance. Avoid mixing deletes and direct loads. Furthermore, for refresh ON COMMIT, Oracle keeps track of the type of DML done in the committed transaction. Therefore, do not perform direct-path INSERT and DML to other tables in the same transaction, as Oracle may not be able to optimize the refresh phase. For ON COMMIT materialized views, where refreshes automatically occur at the end of each transaction, it may not be possible to isolate the DML statements, in which case keeping the transactions short will help. However, if you plan to make numerous modications to the detail table, it may be better to perform them in one transaction, so that refresh of the materialized view will be performed just once at commit time rather than after each update.
s
Parallel DML For large loads or refresh, enabling parallel DML will help shorten the length of time for the operation.
You can refresh your materialized views fast after partition maintenance operations on the detail tables. "Partition Change Tracking" on page 9-2 for details on enabling PCT for materialized views.
s
Partitioning the materialized view will also help refresh performance as refresh can update the materialized view using parallel DML. For example, assume that the detail tables and materialized view are partitioned and have a parallel clause. The following sequence would enable Oracle to parallelize the refresh of the materialized view.
1. 2. 3.
Bulk load into the detail table. Enable parallel DML with an ALTER SESSION ENABLE PARALLEL DML statement. Refresh the materialized view.
For COMPLETE refresh, this will TRUNCATE to delete existing rows in the materialized view, which is faster than a delete. For PCT refresh, if the materialized view is partitioned appropriately, this will use TRUNCATE PARTITION to delete rows in the affected partitions of the materialized view, which is faster than a delete. For FAST or FORCE refresh, if COMPLETE or PCT refresh is chosen, this will be able to use the TRUNCATE optimizations described earlier.
When using DBMS_MVIEW.REFRESH with JOB_QUEUES, remember to set atomic to FALSE. Otherwise, JOB_QUEUES will not get used. Set the number of job queue processes greater than the number of processors. If job queues are enabled and there are many materialized views to refresh, it is faster to refresh all of them in a single command than to call them individually.
Use REFRESH FORCE to ensure refreshing a materialized view so that it can denitely be used for query rewrite. The best refresh method will be chosen. If a fast refresh cannot be done, a complete refresh will be performed. Refresh all the materialized views in a single procedure call. This gives Oracle an opportunity to schedule refresh of all the materialized views in the right order taking into account dependencies imposed by nested materialized views and potential for efcient refresh by using query rewrite against other materialized views.
Indexes should be created on columns sales_rid, times_rid and cust_rid. Partitioning is highly recommended, as is enabling parallel DML in the session before invoking refresh, because it will greatly enhance refresh performance. This type of materialized view can also be fast refreshed if DML is performed on the detail table. It is recommended that the same procedure be applied to this type of materialized view as for a single table aggregate. That is, perform one type of change (direct-path INSERT or DML) and then refresh the materialized view. This is because Oracle Database can perform signicant optimizations if it detects that only one type of change has been done. Also, Oracle recommends that the refresh be invoked after each table is loaded, rather than load all the tables and then perform the refresh. For refresh ON COMMIT, Oracle keeps track of the type of DML done in the committed transaction. Oracle therefore recommends that you do not perform direct-path and conventional DML to other tables in the same transaction because Oracle may not be able to optimize the refresh phase. For example, the following is not recommended:
1. 2. 3.
Direct load new data into the fact table DML into the store table Commit
Also, try not to mix different types of conventional DML statements if possible. This would again prevent using various optimizations during fast refresh. For example, try to avoid the following:
1. 2.
Insert into the fact table Delete from the fact table
3.
Commit
If many updates are needed, try to group them all into one transaction because refresh will be performed just once at commit time, rather than after each update. When you use the DBMS_MVIEW package to refresh a number of materialized views containing only joins with the ATOMIC parameter set to TRUE, if you disable parallel DML, refresh performance may degrade. In a data warehousing environment, assuming that the materialized view has a parallel clause, the following sequence of steps is recommended:
1. 2. 3. 4.
Bulk load into the fact table Enable parallel DML An ALTER SESSION ENABLE PARALLEL DML statement Refresh the materialized view
These procedures have the following behavior when used with nested materialized views:
s
If REFRESH is applied to a materialized view my_mv that is built on other materialized views, then my_mv will be refreshed with respect to the current contents of the other materialized views (that is, the other materialized views will not be made fresh rst) unless you specify nested => TRUE. If REFRESH_DEPENDENT is applied to materialized view my_mv, then only materialized views that directly depend on my_mv will be refreshed (that is, a materialized view that depends on a materialized view that depends on my_mv will not be refreshed) unless you specify nested => TRUE. If REFRESH_ALL_MVIEWS is used, the order in which the materialized views will be refreshed is guaranteed to respect the dependencies between nested materialized views. GET_MV_DEPENDENCIES provides a list of the immediate (or direct) materialized view dependencies for an object.
The form of a maintenance marker column, column MARKER in the example, must be numeric_or_string_literal AS column_alias, where each UNION ALL member has a distinct value for numeric_or_string_literal.
normally enabled with the NOVALIDATE or RELY options. An important decision to make before performing a refresh operation is whether the refresh needs to be recoverable. Because materialized view data is redundant and can always be reconstructed from the detail tables, it might be preferable to disable logging on the materialized view. To disable logging and run incremental refresh non-recoverably, use the ALTER MATERIALIZED VIEW ... NOLOGGING statement prior to refreshing. If the materialized view is being refreshed using the ON COMMIT method, then, following refresh operations, consult the alert log alert_SID.log and the trace le ora_SID_number.trc to check that no errors have occurred.
All detail tables must have materialized view logs. To avoid redundancy, only the materialized view log for the sales table is provided in the following:
CREATE MATERIALIZED VIEW LOG ON SALES WITH ROWID, SEQUENCE (prod_id, time_id, quantity_sold, amount_sold) INCLUDING NEW VALUES; 2.
3.
You can use the DBMS_MVIEW.EXPLAIN_MVIEW procedure to determine which tables will allow PCT refresh. See "Analyzing Materialized View Capabilities" on page 8-37 for how to use this procedure.
CAPABILITY_NAME --------------PCT PCT_TABLE PCT_TABLE PCT_TABLE POSSIBLE -------Y Y N N RELATED_TEXT -----------SALES SALES PRODUCTS TIMES MSGTXT ----------------
As can be seen from the partial sample output from EXPLAIN_MVIEW, any partition maintenance operation performed on the sales table will allow PCT fast refresh. However, PCT is not possible after partition maintenance operations or updates to the products table as there is insufcient information contained in cust_mth_ sales_mv for PCT refresh to be possible. Note that the times table is not partitioned and hence can never allow for PCT refresh. Oracle will apply PCT refresh if it can determine that the materialized view has sufcient information to support PCT for all the updated tables.
1.
Suppose at some later point, a SPLIT operation of one partition in the sales table becomes necessary.
ALTER TABLE SALES SPLIT PARTITION month3 AT (TO_DATE('05-02-1998', 'DD-MM-YYYY')) INTO (PARTITION month3_1 TABLESPACE summ, PARTITION month3 TABLESPACE summ);
2. 3.
Insert some data into the sales table. Fast refresh cust_mth_sales_mv using the DBMS_MVIEW.REFRESH procedure.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F', '',TRUE,FALSE,0,0,0,FALSE);
Fast refresh will automatically do a PCT refresh as it is the only fast refresh possible in this scenario. However, fast refresh will not occur if a partition maintenance operation occurs when any update has taken place to a table on which PCT is not enabled. This is shown in "PCT Fast Refresh Scenario 2". "PCT Fast Refresh Scenario 1" would also be appropriate if the materialized view was created using the PMARKER clause as illustrated in the following:
CREATE MATERIALIZED VIEW cust_sales_marker_mv BUILD IMMEDIATE REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT DBMS_MVIEW.PMARKER(s.rowid) s_marker, SUM(s.quantity_sold), SUM(s.amount_sold), p.prod_name, t.calendar_month_name, COUNT(*), COUNT(s.quantity_sold), COUNT(s.amount_sold) FROM sales s, products p, times t WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id GROUP BY DBMS_MVIEW.PMARKER(s.rowid), p.prod_name, t.calendar_month_name;
The same as in "PCT Fast Refresh Scenario 1". The same as in "PCT Fast Refresh Scenario 1". The same as in "PCT Fast Refresh Scenario 1". The same as in "PCT Fast Refresh Scenario 1". After issuing the same SPLIT operation, as shown in "PCT Fast Refresh Scenario 1", some data will be inserted into the times table.
ALTER TABLE SALES SPLIT PARTITION month3 AT (TO_DATE('05-02-1998', 'DD-MM-YYYY'))
Refresh cust_mth_sales_mv.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F', '',TRUE,FALSE,0,0,0,FALSE); ORA-12052: cannot fast refresh materialized view SH.CUST_MTH_SALES_MV
The materialized view is not fast refreshable because DML has occurred to a table on which PCT fast refresh is not possible. To avoid this occurring, Oracle recommends performing a fast refresh immediately after any partition maintenance operation on detail tables for which partition tracking fast refresh is available. If the situation in "PCT Fast Refresh Scenario 2" occurs, there are two possibilities; perform a complete refresh or switch to the CONSIDER FRESH option outlined in the following, if suitable. However, it should be noted that CONSIDER FRESH and partition change tracking fast refresh are not compatible. Once the ALTER MATERIALIZED VIEW cust_mth_sales_mv CONSIDER FRESH statement has been issued, PCT refresh will not longer be applied to this materialized view, until a complete refresh is done. Moreover, you should not use CONSIDER FRESH unless you have taken manual action to ensure that the materialized view is indeed fresh. A common situation in a data warehouse is the use of rolling windows of data. In this case, the detail table and the materialized view may contain say the last 12 months of data. Every month, new data for a month is added to the table and the oldest month is deleted (or maybe archived). PCT refresh provides a very efcient mechanism to maintain the materialized view in this case.
The new data is usually added to the detail table by adding a new partition and exchanging it with a table containing the new data.
ALTER TABLE sales ADD PARTITION month_new ... ALTER TABLE sales EXCHANGE PARTITION month_new month_new_table
2.
3.
Now, if the materialized view satises all conditions for PCT refresh.
EXECUTE DBMS_MVIEW.REFRESH('CUST_MTH_SALES_MV', 'F', '', TRUE, FALSE,0,0,0,FALSE);
Fast refresh will automatically detect that PCT is available and perform a PCT refresh.
The materialized view is now considered stale and requires a refresh because of the partition operation. However, as the detail table no longer contains all the data associated with the partition fast refresh cannot be attempted.
2.
This statement informs Oracle that cust_mth_sales_mv is fresh for your purposes. However, the materialized view now has a status that is neither known fresh nor known stale. Instead, it is UNKNOWN. If the materialized view has query rewrite enabled in QUERY_REWRITE_INTEGRITY=stale_ tolerated mode, it will be used for rewrite.
3. 4.
Because the fast refresh detects that only INSERT statements occurred against the sales table it will update the materialized view with the new data. However, the status of the materialized view will remain UNKNOWN. The only way to return the materialized view to FRESH status is with a complete refresh which, also will remove the historical data from the materialized view.
16
Change Data Capture
Change Data Capture efciently identies and captures data that has been added to, updated in, or removed from, Oracle relational tables and makes this change data available for use by applications or individuals. Change Data Capture is provided as a database component beginning with Oracle9i. This chapter describes Change Data Capture in the following sections:
s
Overview of Change Data Capture Change Sources and Modes of Data Capture Change Sets Change Tables Getting Information About the Change Data Capture Environment Preparing to Publish Change Data Publishing Change Data Subscribing to Change Data Considerations for Asynchronous Change Data Capture Managing Published Data Implementation and System Conguration
See PL/SQL Packages and Types Reference for reference information about the Change Data Capture publish and subscribe PL/SQL packages.
Moreover, you can obtain the deleted rows and old versions of updated rows with the following query:
SELECT * FROM old_version MINUS SELECT * FROM new_version;
It requires that the new version of the entire table be transported to the staging database, not just the change data, thereby greatly increasing transport costs. The computational cost of performing the two MINUS operations on the staging database can be very high. Table differencing cannot capture data changes that have reverted to their old values. For example, suppose the price of a product changes several times between the old version and the new version of the product's table. If the price in the new version ends up being the same as the old, table differencing cannot detect that the price has uctuated. Moreover, any intermediate price values between the old and new versions of the product's table cannot be captured using table differencing.
16-2
There is no way to determine which changes were made as part of the same transaction. For example, suppose a sales manager creates a special discount to close a deal. The fact that the creation of the discount and the creation of the sale occurred as part of the same transaction cannot be captured, unless the source database is specically designed to do so.
Change-value selection involves capturing the data on the source database by selecting the new and changed data from the source tables based on the value of a specic column. For example, suppose the source table has a LAST_UPDATE_DATE column. To capture changes, you base your selection from the source table on the LAST_UPDATE_DATE column value. However, there are also several problems with this method:
s
The overhead of capturing the change data must be borne on the source database, and you must run potentially expensive queries against the source table on the source database. The need for these queries may force you to add indexes that would otherwise be unneeded. There is no way to ofoad this overhead to the staging database. This method is no better at capturing intermediate values than the table differencing method. If the price in the product's table uctuates, you will not be able to capture all the intermediate values, or even tell if the price had changed, if the ending value is the same as it was the last time that you captured change data. This method is also no better than the table differencing method at capturing which data changes were made together in the same transaction. If you need to capture information concerning which changes occurred together in the same transaction, you must include specic designs for this purpose in your source database. The granularity of the change-value column may not be ne enough to uniquely identify the new and changed rows. For example, suppose the following: You capture data changes using change-value selection on a date column such as LAST_UPDATE_DATE. The capture happens at a particular instant in time, 14-FEB-2003 17:10:00. Additional updates occur to the table during the same second that you performed your capture.
When you next capture data changes, you will select rows with a LAST_ UPDATE_DATE strictly after 14-FEB-2003 17:10:00, and thereby miss the changes that occurred during the remainder of that second. To use change-value selection, you either have to accept that anomaly, add an articial change-value column with the granularity you need, or lock out changes to the source table during the capture process, thereby further burdening the performance of the source database.
s
You have to design your source database in advance with this capture mechanism in mind all tables from which you wish to capture change data must have a change-value column. If you want to build a data warehouse with data sources from legacy systems, those legacy systems may not supply the necessary change-value columns you need.
Change Data Capture does not depend on expensive and cumbersome table differencing or change-value selection mechanisms. Instead, it captures the change data resulting from INSERT, UPDATE, and DELETE operations made to user tables. The change data is then stored in a relational table called a change table, and the change data is made available to applications or individuals in a controlled way.
Synchronous Change data is captured immediately, as each SQL statement that performs a data manipulation language (DML) operation (INSERT, UPDATE, or DELETE) is made, by using triggers on the source database. In this mode, change data is captured as part of the transaction modifying the source table. Synchronous Change Data Capture is available with Oracle Standard Edition and Enterprise Edition.
Asynchronous Change data is captured after a SQL statement that performs a DML operation is committed, by taking advantage of the data sent to the redo log les. In this mode, change data is not captured as part of the transaction that is modifying the source table, and therefore has no effect on that transaction. Asynchronous Change Data Capture is available with Oracle Enterprise Edition only.
16-4
Asynchronous Change Data Capture is built on, and provides a relational interface to, Oracle Streams. See Oracle Streams Concepts and Administration for information on Oracle Streams. The following list describes the advantages of capturing change data with Change Data Capture:
s
Completeness Change Data Capture can capture all effects of INSERT, UPDATE, and DELETE operations, including data values before and after UPDATE operations.
Performance Asynchronous Change Data Capture can be congured to have minimal performance impact on the source database.
Interface Change Data Capture includes the PL/SQL DBMS_CDC_PUBLISH and DBMS_ CDC_SUBSCRIBE packages, which provide easy-to-use publish and subscribe interfaces.
Cost Change Data Capture reduces overhead cost because it simplies the extraction of change data from the database and is part of Oracle9i and later databases.
A Change Data Capture system is based on the interaction of a publisher and subscribers to capture and distribute change data, as described in the next section.
Publisher
The publisher is usually a database administrator (DBA) who creates and maintains the schema objects that make up the Change Data Capture system. Typically, a publisher deals with two databases:
s
Source database This is the production database that contains the data of interest. Its associated tables are referred to as the source tables.
Staging database This is the database where the change data capture takes place. Depending on the capture mode that the publisher uses, the staging database can be the same as, or different from, the source database. The following Change Data Capture objects reside on the staging database: Change table A change table is a relational table that contains change data for a single source table. To subscribers, a change table is known as a publication. Change set A change set is a set of change data that is guaranteed to be transactionally consistent. It contains one or more change tables. Change source The change source is a logical representation of the source database. It contains one or more change sets.
Determines the source databases and tables from which the subscribers are interested in viewing change data, and the mode (synchronous or asynchronous) in which to capture the change data. Uses the Oracle-supplied package, DBMS_CDC_PUBLISH, to set up the system to capture change data from the source tables of interest. Allows subscribers to have controlled access to the change data in the change tables by using the SQL GRANT and REVOKE statements to grant and revoke the SELECT privilege on change tables for users and roles. (Keep in mind, however, that subscribers use views, not change tables directly, to access change data.)
16-6
In Figure 161, the publisher determines that subscribers are interested in viewing change data from the HQ source database. In particular, subscribers are interested in change data from the SH.SALES and SH.PROMOTIONS source tables. The publisher creates a change source HQ_SRC on the DW staging database, a change set, SH_SET, and two change tables: sales_ct and promo_ct. The sales_ct change table contains all the columns from the source table, SH.SALES. For the promo_ct change table, however, the publisher has decided to exclude the PROMO_ COST column.
Figure 161 Publisher Components in a Change Data Capture System Staging Database: DW Change Source: HQ_SRC Change Set: SH_SET
SourceTable: SH.SALES PROD_ID CUST_ID TIME_ID PROMO_ID QUANTITY_SOLD AMOUNT_SOLD Source Table: SH.PROMOTIONS PROMO_ID PROMO_SUBCATEGORY PROMO_CATEGORY PROMO_COST PROMO_END_DATE PROMO_BEGIN_DATE Change Table: sales_ct PROD_ID CUST_ID TIME_ID PROMO_ID QUANTITY_SOLD AMOUNT_SOLD sales_ct Change Table Change Table: promo_ct PROMO_ID PROMO_SUBCATEGORY PROMO_CATEGORY PROMO_END_DATE PROMO_BEGIN_DATE
Source Database: HQ
Subscribers
The subscribers are consumers of the published change data. A subscriber performs the following tasks:
s
Uses the Oracle supplied package, DBMS_CDC_SUBSCRIBE, to: Create subscriptions A subscription controls access to the change data from one or more source tables of interest within a single change set. A subscription contains one or more subscriber views.
A subscriber view is a view that species the change data from a specic publication in a subscription. The subscriber is restricted to seeing change data that the publisher has published and has granted the subscriber access to use. See "Subscribing to Change Data" on page 16-42 for more information on choosing a method for specifying a subscriber view. Notify Change Data Capture when ready to receive a set of change data A subscription window denes the time range of rows in a publication that the subscriber can currently see in subscriber views. The oldest row in the window is called the low boundary; the newest row in the window is called the high boundary. Each subscription has its own subscription window that applies to all of its subscriber views.
s
Notify Change Data Capture when nished with a set of change data
Uses SELECT statements to retrieve change data from the subscriber views.
A subscriber has the privileges of the user account under which the subscriber is running, plus any additional privileges that have been granted to the subscriber. In Figure 162, the subscriber is interested in a subset of columns that the publisher (in Figure 161) has published. Note that the publications shown in Figure 162, are represented as change tables in Figure 161; this reects the different terminology used by subscribers and publishers, respectively. The subscriber creates a subscription, sales_promos_list and two subscriber views (spl_sales and spl_promos) on the SH_SET change set on the DW staging database. Within each subscriber view, the subscriber includes a subset of the columns that were made available by the publisher. Note that because the publisher did not create a change table that includes the PROMO_COST column, there is no way for the subscriber to view change data for that column.
16-8
Figure 162
Subscription: sales_promos_list
Guarantees that each subscriber sees all the changes Keeps track of multiple subscribers and gives each subscriber shared access to change data Handles all the storage management by automatically removing data from change tables when it is no longer required by any of the subscribers
Note: Oracle provides the previously listed benets only when the
Synchronous
This mode uses triggers on the source database to capture change data. It has no latency because the change data is captured continuously and in real time on the source database. The change tables are populated when DML operations on the source table are committed. While the synchronous mode of Change Data Capture adds overhead to the source database at capture time, this mode can reduce costs (as compared to attempting to extract change data using table differencing or change-value section) by simplifying the extraction of change data. There is a single, predened synchronous change source, SYNC_SOURCE, that represents the source database. This is the only synchronous change source. It cannot be altered or dropped. Change tables for this mode of Change Data Capture must reside locally in the source database. Figure 163 illustrates the synchronous conguration. Triggers executed after DML operations occur on the source tables populate the change tables in the change sets within the SYNC_SOURCE change source.
Figure 163
Source Database
Source Database Transactions SYNC_SOURCE Change Source Change Set Source Tables Trigger Execution Change Tables
Subscriber Views
Asynchronous
This mode captures change data after the changes have been committed to the source database by using the database redo log les. The asynchronous mode of Change Data Capture is dependent on the level of supplemental logging enabled at the source database. Supplemental logging adds redo logging overhead at the source database, so it must be carefully balanced with the needs of the applications or individuals using Change Data Capture. See "Asynchronous Change Data Capture and Supplemental Logging" on page 16-50 for information on supplemental logging. There are two methods of capturing change data asynchronously, HotLog and AutoLog, as described in the following sections:
16-11
HotLog
Change data is captured from the online redo log le on the source database. There is a brief latency between the act of committing source table transactions and the arrival of change data. There is a single, predened HotLog change source, HOTLOG_SOURCE, that represents the current redo log les of the source database. This is the only HotLog change source. It cannot be altered or dropped. Change tables for this mode of Change Data Capture must reside locally in the source database. Figure 164, illustrates the asynchronous HotLog conguration. The Logwriter Process (LGWR) records committed transactions in the online redo log les on the source database. Change Data Capture uses Oracle Streams processes to automatically populate the change tables in the change sets within the HOTLOG_ SOURCE change source as newly committed transactions arrive.
Figure 164 Asynchronous HotLog Conguration
Source Database
Source Database Transactions HOTLOG_SOURCE Change Source Change Set Source Tables LGWR
tur e
Change Tables
St
rea
o sL
ca
p Ca
Subscriber Views
AutoLog
Change data is captured from a set of redo log les managed by log transport services. Log transport services control the automated transfer of redo log les from the source database to the staging database. Using database initialization parameters (described in "Initialization Parameters for Asynchronous AutoLog Publishing" on page 16-22), the publisher congures log transport services to copy the redo log les from the source database system to the staging database system and to automatically register the redo log les. Change sets are populated automatically as new redo log les arrive. The degree of latency depends on frequency of redo log switches on the source database. There is no predened AutoLog change source. The publisher provides information about the source database to create an AutoLog change source. See "Performing Asynchronous AutoLog Publishing" on page 16-35 for details. Change sets for this mode of Change Data Capture can be remote from or local to the source database. Typically, they are remote. Figure 165 shows a typical Change Data Capture asynchronous AutoLog conguration in which, when the log switches on the source database, archiver processes archive the redo log le on the source database to the destination specied by the LOG_ARCHIVE_DEST_1 parameter and copy the redo log le to the staging database as specied by the LOG_ARCHIVE_DEST_2 parameter. (Although the image presents these parameters as LOG_ARCHIVE_DEST_1 and LOG_ ARCHIVE_DEST_2, the integer value in these parameter strings can be any value between 1 and 10.) Note that the archiver processes use Oracle Net to send redo data over the network to the remote le server (RFS) process. Transmitting redo log les to a remote destination requires uninterrupted connectivity through Oracle Net. On the staging database, the RFS process writes the redo data to the copied log les in the location specied by the value of the TEMPLATE attribute in the LOG_ ARCHIVE_DEST_2 parameter (specied in the source database initialization parameter le). Then, Change Data Capture uses Oracle Streams downstream capture to populate the change tables in the change sets within the AutoLog change source.
16-13
Change Sets
Source Database
Source Database Transactions RFS
Staging Database
AutoLog Change Source Change Set
Source Tables
LGWR
Change Tables
CHI
VE_
DES
T_2
Subscriber Views
LOG
Oracle Net
ARCn
LOG_ARCHIVE_DEST_1
Change Sets
A change set is a logical grouping of change data that is guaranteed to be transactionally consistent and that can be managed as a unit. A change set is a member of one (and only one) change source. A change source can contain one or more change sets. Conceptually, a change set shares the same mode as its change source. For example, an AutoLog change set is a change set contained in an AutoLog change source. When a publisher includes two or more change tables in the same change set, subscribers can perform join operations across the tables represented within the change set and be assured of transactional consistency.
Change Sets
Synchronous New change data arrives automatically as DML operations on the source tables are committed. Publishers can dene new change sets in the predened SYNC_ SOURCE change source or use the predened change set, SYNC_SET. The SYNC_ SET change set cannot be altered or dropped.
Asynchronous HotLog New change data arrives automatically, on a transaction-by-transaction basis from the current online redo log le. Publishers dene change sets in the predened HOTLOG_SOURCE change source.
Asynchronous AutoLog New change data arrives automatically, on a log-by-log basis, as log transport services makes redo log les available. Publishers dene change sets in publisher-dened AutoLog change sources.
Publishers can purge unneeded change data from change tables at the change set level to keep the change tables in the change set from growing larger indenitely. See "Purging Change Tables of Unneeded Data" on page 16-65 for more information on purging change data.
16-15
Change Tables
staging database can be the same, this arrangement is rarely used. AutoLog examples in this chapter assume that the source and staging databases are different.
Change Tables
A given change table contains the change data resulting from DML operations performed on a given source table. A change table consists of two things: the change data itself, which is stored in a database table; and the system metadata necessary to maintain the change table, which includes control columns. The publisher species the source columns that are to be included in the change table. Typically, for a change table to contain useful data, the publisher needs to include the primary key column in the change table along with any other columns of interest to subscribers. For example, suppose subscribers are interested in changes that occur to the UNIT_COST and the UNIT_PRICE columns in the SH.COSTS table. If the publisher does not include the PROD_ID column in the change table, subscribers will know only that the unit cost and unit price of some products have changed, but will be unable to determine for which products these changes have occurred. There are optional and required control columns. The required control columns are always included in a change table; the optional ones are included if specied by the publisher when creating the change table. Control columns are managed by Change Data Capture. See "Understanding Change Table Control Columns" on page 16-60 and "Understanding TARGET_COLMAP$ and SOURCE_COLMAP$ Values" on page 16-62 for detailed information on control columns.
A view with the ALL prex allows the user to display all the information accessible to the user, including information from the current user's schema as
well as information from objects in other schemas, if the current user has access to those objects by way of grants of privileges or roles.
s
A view with the USER prex allows the user to display all the information from the schema of the user issuing the query without the use of additional special privileges or roles.
Note: To look at all the views (those intended for both the
publisher and the subscriber), a user must have the SELECT_ CATALOG_ROLE privilege.
Views Intended for Use by Change Data Capture Publishers Description Describes existing change sources. Describes existing change sets. Describes existing change tables. Describes all existing source tables in the database. Describes all published columns of source tables in the database. Describes all subscriptions. Describes all source tables to which any subscriber has subscribed. Describes the columns of source tables to which any subscriber has subscribed.
CHANGE_SOURCES CHANGE_SETS CHANGE_TABLES DBA_SOURCE_TABLES DBA_PUBLISHED_ COLUMNS DBA_SUBSCRIPTIONS DBA_SUBSCRIBED_ TABLES DBA_SUBSCRIBED_ COLUMNS
Views Intended for Use by Change Data Capture Subscribers Description Describes all existing source tables accessible to the current user. Describes all existing source tables owned by the current user. Describes all published columns of source tables accessible to the current user. Describes all published columns of source tables owned by the current user. Describes all subscriptions accessible to the current user.
16-17
(Cont.) Views Intended for Use by Change Data Capture Subscribers Description Describes all the subscriptions owned by the current user. Describes the source tables to which any subscription accessible to the current user has subscribed. Describes the source tables to which the current user has subscribed. Describes the columns of source tables to which any subscription accessible to the current user has subscribed. Describes the columns of source tables to which the current user has subscribed.
See Oracle Database Reference for complete information about these views.
Gather requirements from the subscribers. Determine which source database contains the relevant source tables. Choose the capture mode: synchronous, asynchronous HotLog, or asynchronous AutoLog, as described in "Determining the Mode in Which to Capture Data" on page 16-20. Ensure that the source and staging database DBAs have set database initialization parameters, as described in"Setting Initialization Parameters for Change Data Capture Publishing" on page 16-21 and "Publishing Change Data" on page 16-27.
EXECUTE_CATALOG_ROLE privilege SELECT_CATALOG_ROLE privilege CREATE TABLE and CREATE SESSION privileges CONNECT and RESOURCE roles
In addition, for asynchronous HotLog and AutoLog publishing, the publisher must:
s
Be granted the CREATE SEQUENCE privilege Be granted the DBA role Be the GRANTEE specied in a DBMS_STREAMS_AUTH.GRANT_ADMIN_ PRIVILEGE() subprogram issued by the staging database DBA
In other words, for asynchronous publishing, the publisher should be congured as an Oracle Streams administrator. See Oracle Streams Concepts and Administration for information on conguring an Oracle Streams administrator.
16-19
This example creates a password le with 10 entries, where the password for SYS is mypassword. For redo log le transmission to succeed, the password for the SYS user account must be identical for the source and staging databases.
Whether or not the staging database is remote from the source database Tolerance for latency between changes made on the source database and changes captured by Change Data Capture Performance impact on the source database transactions and overall database performance
Table 164 summarizes the factors that might inuence the mode decision.
Table 164 Mode Synchronous Factors Inuencing Choice of Change Data Capture Mode Location of Staging Database Latency Must be the same as the source database. None - change data is automatically committed as part of the same transaction it reects. Change data is captured from the current online redo log le. Change sets are populated automatically as new committed transactions arrive. Depends on the frequency of redo log switches on the source database. Change sets are populated automatically as new log les arrive. Source Database Performance Impact Adds overhead to source database transactions to perform change data capture. Minimal impact on source database transactions to perform supplemental logging. Additional source database overhead to perform change data capture.
Minimal impact on source database transactions to perform supplemental logging. Minimal source database overhead for log transport services. When the source database is remote from the staging database, then this mode has the least impact on the source database.
PARALLEL_MAX_SERVERS (current value) + (5 * (the number of change sets planned)) PROCESSES (current value) + (7 * (the number of change sets planned))
16-21
(Cont.) Source Database Initialization Parameters for Asynchronous HotLog Publishing Recommended Value (current value) + (2 * (the number of change sets planned))
s
STREAMS_POOL_SIZE
If the current value of the STREAMS_POOL_SIZE parameter is 50 MB or greater, then set this parameter to: (current value) + ((the number of change sets planned) * (21 MB)) If the current value of the STREAMS_POOL_SIZE parameter is less than 50 MB, then set the value of this parameter to: 50 MB + ((the number of change sets planned) * (21 MB))
See Oracle Streams Concepts and Administration for information on how the STREAMS_POOL_SIZE parameter is applied when changed dynamically. UNDO_RETENTION 3600
(Cont.) Source Database Initialization Parameters for Asynchronous AutoLog Publishing Recommended Value This parameter must include the SERVICE, ARCH or LGWR ASYNC, OPTIONAL, NOREGISTER, and REOPEN attributes so that log transport services are congured to copy the redo log les from the source database to the staging database. These attributes are set as follows:
s s
LOG_ARCHIVE_DEST_21
SERVICE species the network name of the staging database. ARCH or LGWR ASYNC ARCH species that the archiver process (ARCn) copy the redo log les to the staging database after a source database log switch occurs. LGWR ASYNC species that the log writer process (LGWR) copy redo data to the staging database as the redo is generated on the source database. Note that, the copied redo data becomes available to Change Data Capture only after a source database log switch occurs.
OPTIONAL species that the copying of a redo log le to the staging database need not succeed before the corresponding online redo log at the source database can be overwritten. This is needed to avoid stalling operations on the source database due to a transmission failure to the staging database. The original redo log le remains available to the source database in either archived or backed up form, if it is needed. NOREGISTER species that the staging database location is not recorded in the staging database control le. REOPEN species the minimum number of seconds the archiver process (ARCn) should wait before trying to access the staging database if a previous attempt to access this location failed. TEMPLATE denes a directory specication and a format template for the le name used for the redo log les that are copied to the staging database.2
LOG_ARCHIVE_DEST_ STATE_11
ENABLE Indicates that log transport services can transmit archived redo log les to this destination. ENABLE Indicates that log transport services can transmit redo log les to this destination. "arch1_%s_%t_%r.dbf" Species a format template for the default le name when archiving redo log les.2 The string value (arch1) and the le name extension (.dbf) do not have to be exactly as specied here.
REMOTE_LOGIN_ PASSWORDFILE
SHARED
16-23
The integer value in this parameter can be any value between 1 and 10. In this manual, the values 1 and 2 are used. For each LOG_ARCHIVE_DEST_n parameter, there must be a corresponding LOG_ARCHIVE_DEST_STATE_n parameter that species the same value for n. In the format template, %t corresponds to the thread number, %s corresponds to the sequence number, and %r corresponds to the resetlogs ID. Together, these ensure that unique names are constructed for the copied redo log les.
Staging Database Initialization Parameters for Asynchronous AutoLog Publishing Recommended Value 10.1.0 TRUE 50000000 2
PARALLEL_MAX_SERVERS (current value) + (5 * (the number of change sets planned)) PROCESSES REMOTE_LOGIN_ PASSWORDFILE SESSIONS STREAMS_POOL_SIZE (current value) + (7 * (the number of change sets planned)) SHARED (current value)+ (2 * (the number of change sets planned))
s
If the current value of the STREAMS_POOL_SIZE parameter is 50 MB or greater, then set this parameter to: (current value) + ((the number of change sets planned) * (21 MB)) If the current value of the STREAMS_POOL_SIZE parameter is less than 50 MB, then set the value of this parameter to: 50 MB + ((the number of change sets planned) * (21 MB))
See Oracle Streams Concepts and Administration for information on how the STREAMS_POOL_SIZE parameter is applied when changed dynamically. UNDO_RETENTION 3600
values are increased after the change sets are created, the source or staging database DBA (depending on the mode of capture) must adjust initialization parameter values as described in "Adjusting Initialization Parameter Values When Oracle Streams Values Change" on page 16-25.
The current setting of a particular initialization parameter can be displayed, in this case, STREAMS_POOL_SIZE, using a SQL statement such as the following:
SQL> SHOW PARAMETER STREAMS_POOL_SIZE
If the database is using a ple, manually update the ple with any parameter values you change with the SQL ALTER SYSTEM statement. If the database is using an sple, then parameter values you change with the SQL ALTER SYSTEM statement are automatically changed in the parameter le. If the database is using an sple and a given initialization parameter is not or cannot be changed dynamically (such as the PROCESS, LOG_ARCHIVE_ FORMAT, and REMOTE_LOGIN_ENABLE parameters), you must change the value with the SQL ALTER SYSTEM statement, and then restart the database. If the ORA-04031 error is returned when the DBA attempts to set the JAVA_ POOL_SIZE dynamically, he or she should place the parameter in the database initialization parameter le and restart the database.
See Oracle Database Administrator's Guide for information on managing initialization parameters using a ple or an sple. In any case, if a parameter is reset dynamically, the new value should also be placed in the initialization parameter le, so that the new value is retained if the database is restarted.
16-25
capture process and a Streams apply process, with an accompanying queue and queue table. Each Streams conguration uses additional processes, parallel execution servers, and memory. Oracle Streams capture and apply processes each have a parallelism parameter that is used to improve performance. When a publisher rst creates a change set, its capture parallelism value and apply parallelism value are each 1. If desired, a publisher can increase one or both of these values using Streams interfaces. If Oracle Streams capture parallelism and apply parallelism values are increased after the change sets are created, the staging database DBA must adjust initialization parameter values accordingly. Example 161 and Example 162 demonstrate how to obtain the capture parallelism and apply parallelism values for change set CHICAGO_DAILY. By default, each parallelism value is 1, so the amount by which a given parallelism value has been increased is the returned value minus 1.
Example 161 Obtaining the Oracle Streams Capture Parallelism Value
SELECT cp.value FROM DBA_CAPTURE_PARAMETERS cp, CHANGE_SETS cset WHERE cset.SET_NAME = 'CHICAGO_DAILY' and cset.CAPTURE_NAME = cp.CAPTURE_NAME and cp.PARAMETER = 'PARALLELISM'; Example 162 Obtaining the Oracle Streams Apply Parallelism Value
SELECT ap.value FROM DBA_APPLY_PARAMETERS ap, CHANGE_SETS cset WHERE cset.SET_NAME = 'CHICAGO_DAILY' and cset.APPLY_NAME = ap.APPLY_NAME and ap.parameter = 'PARALLELISM';
The staging database DBA must adjust the staging database initialization parameters as described in the following list to accommodate the parallel execution servers and other processes and memory required for asynchronous Change Data Capture:
s
PARALLEL_MAX_SERVERS For each change set for which Oracle Streams capture or apply parallelism values were increased, increase the value of this parameter by the sum of increased Streams parallelism values. For example, if the statement in Example 161 returns a value of 2, and the statement in Example 162 returns a value of 3, then the staging database DBA should increase the value of the PARALLEL_MAX_SERVERS parameter by (2-1)
+ (3-1), or 3 for the CHICAGO_DAILY change set. If the Streams capture or apply parallelism values have increased for other change sets, increases for those change sets must also be made.
s
PROCESSES For each change set for which Oracle Streams capture or apply parallelism values were changed, increase the value of this parameter by the sum of increased Streams parallelism values. See the previous list item, PARALLEL_ MAX_SERVERS, for an example.
STREAMS_POOL_SIZE For each change set for which Oracle Streams capture or apply parallelism values were changed, increase the value of this parameter by (10MB * (the increased capture parallelism value)) + (1MB * increased apply parallelism value). For example, if the statement in Example 161 returns a value of 2, and the statement in Example 162 returns a value of 3, then the staging database DBA should increase the value of the STREAMS_POOL_SIZE parameter by (10 MB * (2-1) + 1MB * (3-1)), or 12MB for the CHICAGO_DAILY change set. If the Oracle Streams capture or apply parallelism values have increased for other change sets, increases for those change sets must also be made. See Oracle Streams Concepts and Administration for more information on Streams capture parallelism and apply parallelism values. See Oracle Database Reference for more information about database initialization parameters.
Performing Synchronous Publishing Performing Asynchronous HotLog Publishing Performing Asynchronous AutoLog Publishing
16-27
tables on source tables owned by SYS or SYSTEM because triggers will not re and therefore changes will not be captured. This example shows how to create a change set. If the publisher wants to use the predened SYNC_SET, he or she should skip Step 3 and specify SYNC_SET as the change set name in the remaining steps. This example assumes that the publisher and the source database DBA are two different people. Step 1 Source Database DBA: Set the JAVA_POOL_SIZE parameter. The source database DBA sets the database initialization parameters, as described in "Setting Initialization Parameters for Change Data Capture Publishing" on page 16-21.
java_pool_size = 50000000
Step 2 Source Database DBA: Create and grant privileges to the publisher. The source database DBA creates a user (for example, cdcpub), to serve as the Change Data Capture publisher and grants the necessary privileges to the publisher so that he or she can perform the operations needed to create Change Data Capture change sets and change tables on the source database, as described in"Creating a User to Serve As a Publisher" on page 16-18. This example assumes that the tablespace ts_cdcpub has already been created.
CREATE USER cdcpub IDENTIFIED BY cdcpub DEFAULT TABLESPACE ts_cdcpub QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX; GRANT CREATE SESSION TO cdcpub; GRANT CREATE TABLE TO cdcpub; GRANT CREATE TABLESPACE TO cdcpub; GRANT UNLIMITED TABLESPACE TO cdcpub; GRANT SELECT_CATALOG_ROLE TO cdcpub; GRANT EXECUTE_CATALOG_ROLE TO cdcpub; GRANT CONNECT, RESOURCE TO cdcpub;
Step 3 Staging Database Publisher: Create a change set. The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_SET procedure on the staging database to create change sets. The following example shows how to create a change set called CHICAGO_DAILY:
BEGIN
DBMS_CDC_PUBLISH.CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'Change set for job history info', change_source_name => 'SYNC_SOURCE'); END; /
The change set captures changes from the predened change source SYNC_SOURCE. Because begin_date and end_date parameters cannot be specied for synchronous change sets, capture begins at the earliest available change data and continues capturing change data indenitely. Step 4 Staging Database Publisher: Create a change table. The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE procedure to create change tables. The publisher can set the options_string eld of the DBMS_CDC_ PUBLISH.CREATE_CHANGE_TABLE procedure to have more control over the physical properties and tablespace properties of the change table. The options_ string eld can contain any option, except partitioning, that is available in the CREATE TABLE statement. The following example creates a change table that captures changes that occur on a source table. The example uses the sample table HR.JOB_HISTORY as the source table. It assumes that the publisher has already created the TS_CHICAGO_DAILY tablespace.
BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE( owner => 'cdcpub', change_table_name => 'jobhist_ct', change_set_name => 'CHICAGO_DAILY', source_schema => 'HR', source_table => 'JOB_HISTORY', column_type_list => 'EMPLOYEE_ID NUMBER(6),START_DATE DATE, END_DATE DATE,JOB_ID VARCHAR2(10), DEPARTMENT_ID NUMBER(4)', capture_values => 'both', rs_id => 'y', row_id => 'n', user_id => 'n', timestamp => 'n', object_id => 'n', source_colmap => 'y',
16-29
This statement creates a change table named jobhist_ct within the change set CHICAGO_DAILY. The column_type_list parameter identies the columns captured by the change table. The source_schema and source_table parameters identify the schema and source table that reside in the source database. The capture_values setting in the example indicates that for update operations, the change data will contain two separate rows for each row that changed: one row will contain the row values before the update occurred, and the other row will contain the row values after the update occurred. Step 5 Staging Database Publisher: Grant access to subscribers. The publisher controls subscriber access to change data by granting and revoking the SELECT privilege on change tables for users and roles. The publisher grants access to specic change tables. Without this step, a subscriber cannot access any change data. This example assumes that user subscriber1 already exists.
GRANT SELECT ON cdcpub.jobhist_ct TO subscriber1;
The Change Data Capture synchronous system is now ready for subscriber1 to create subscriptions.
Step 1 Source Database DBA: Set the database initialization parameters. The source database DBA sets the database initialization parameters, as described in "Setting Initialization Parameters for Change Data Capture Publishing" on page 16-21. In this example, one change set will be dened and the current value of the STREAMS_POOL_SIZE parameter is 50 MB or greater.
compatible = 10.1.0 java_pool_size = 50000000; job_queue_processes = 2 parallel_max_servers = <current value> + 5 processes = <current value> + 7 sessions = <current value> + 2 streams_pool_size = <current value> + 21 MB undo_retention = 3600
Step 2 Source Database DBA: Alter the source database. The source database DBA performs the following three tasks. The second is required. The rst and third are optional, but recommended. It is assumed that the database is currently running in ARCHIVELOG mode.
1.
Place the database into FORCE LOGGING logging mode to protect against unlogged direct write operations in the source database that cannot be captured by asynchronous Change Data Capture:
ALTER DATABASE FORCE LOGGING;
2.
Enable supplemental logging. Supplemental logging places additional column data into a redo log le whenever an UPDATE operation is performed. Minimally, database-level minimal supplemental logging must be enabled for any Change Data Capture source database:
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
3.
Create an unconditional log group on all columns to be captured in the source table. Source table columns that are unchanged and are not in an unconditional log group, will be null in the change table, instead of reecting their actual source table values. (This example captures rows in the HR.JOB_HISTORY table only. The source database DBA would repeat this step for each source table for which change tables will be created.)
ALTER TABLE HR.JOB_HISTORY ADD SUPPLEMENTAL LOG GROUP log_group_jobhist (EMPLOYEE_ID, START_DATE, END_DATE, JOB_ID, DEPARTMENT_ID) ALWAYS;
16-31
If you intend to capture all the column values in a row whenever a column in that row is updated, you can use the following statement instead of listing each column one-by-one in the ALTER TABLE statement. However, do not use this form of the ALTER TABLE statement if all columns are not needed. Logging all columns incurs more overhead than logging selected columns.
ALTER TABLE HR.JOB_HISTORY ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
See Oracle Database Administrator's Guide for information about running a database in ARCHIVELOG mode. See "Asynchronous Change Data Capture and Supplemental Logging" on page 16-50 for more information on enabling supplemental logging. Step 3 Source Database DBA: Create and grant privileges to the publisher. The source database DBA creates a user, (for example, cdcpub), to serve as the Change Data Capture publisher and grants the necessary privileges to the publisher so that he or she can perform the underlying Oracle Streams operations needed to create Change Data Capture change sets, and change tables on the source database, as described in "Creating a User to Serve As a Publisher" on page 16-18. This example assumes that the ts_cdcpub tablespace has already been created. For example:
CREATE USER cdcpub IDENTIFIED BY cdcpub DEFAULT TABLESPACE ts_cdcpub QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX; GRANT CREATE SESSION TO cdcpub; GRANT CREATE TABLE TO cdcpub; GRANT CREATE TABLESPACE TO cdcpub; GRANT UNLIMITED TABLESPACE TO cdcpub; GRANT SELECT_CATALOG_ROLE TO cdcpub; GRANT EXECUTE_CATALOG_ROLE TO cdcpub; GRANT CREATE SEQUENCE TO cdcpub; GRANT CONNECT, RESOURCE, DBA TO cdcpub; EXECUTE DBMS_STREAMS_AUTH.GRANT_ADMIN_PRIVILEGE(GRANTEE => 'cdcpub');
Note that for HotLog Change Data Capture, the source database and the staging database are the same database. Step 4 Source Database DBA: Prepare the source tables. The source database DBA must prepare the source tables on the source database for asynchronous Change Data Capture by instantiating each source table so that the underlying Oracle Streams environment records the information it needs to capture
each source table's changes. The source table structure and the column datatypes must be supported by Change Data Capture. See "Datatypes and Table Structures Supported for Asynchronous Change Data Capture" on page 16-51 for more information.
BEGIN DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION(TABLE_NAME => 'hr.job_history'); END; /
Step 5 Staging Database Publisher: Create change sets. The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_SET procedure on the staging database to create change sets. Note that when Change Data Capture creates a change set, its associated Oracle Streams capture and apply processes are also created (but not started). The following example creates a change set called CHICAGO_DAILY that captures changes starting today, and stops capturing change data 5 days from now.
BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'Change set for job history info', change_source_name => 'HOTLOG_SOURCE', stop_on_ddl => 'y', begin_date => sysdate, end_date => sysdate+5); END; /
The change set captures changes from the predened HOTLOG_SOURCE change source. Step 6 Staging Database Publisher: Create the change tables that will contain the changes to the source tables. The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE procedure on the staging database to create change tables. The publisher creates one or more change tables for each source table to be published, species which columns should be included, and species the combination of before and after images of the change data to capture.
16-33
The following example creates a change table on the staging database that captures changes made to a source table on the source database. The example uses the sample table HR.JOB_HISTORY as the source table.
BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE( owner => 'cdcpub', change_table_name => 'job_hist_ct', change_set_name => 'CHICAGO_DAILY', source_schema => 'HR', source_table => 'JOB_HISTORY', column_type_list => 'EMPLOYEE_ID NUMBER(6),START_DATE DATE,END_DATE DATE, JOB_ID VARCHAR(10), DEPARTMENT_ID NUMBER(4)', capture_values => 'both', rs_id => 'y', row_id => 'n', user_id => 'n', timestamp => 'n', object_id => 'n', source_colmap => 'n', target_colmap => 'y', options_string => 'TABLESPACE TS_CHICAGO_DAILY'); END; /
This statement creates a change table named job_history_ct within change set CHICAGO_DAILY. The column_type_list parameter identies the columns to be captured by the change table. The source_schema and source_table parameters identify the schema and source table that reside on the source database. The capture_values setting in this statement indicates that for update operations, the change data will contain two separate rows for each row that changed: one row will contain the row values before the update occurred and the other row will contain the row values after the update occurred. The options_string parameter in this statement species a tablespace for the change table. (This example assumes that the publisher previously created the TS_ CHICAGO_DAILY tablespace.) Step 7 Staging Database Publisher: Enable the change set. Because asynchronous change sets are always disabled when they are created, the publisher must alter the change set to enable it. The Oracle Streams capture and apply processes are started when the change set is enabled.
BEGIN
Step 8 Staging Database Publisher: Grant access to subscribers. The publisher controls subscriber access to change data by granting and revoking the SELECT privilege on change tables for users and roles. The publisher grants access to specic change tables. Without this step, a subscriber cannot access change data. This example assumes that user subscriber1 already exists.
GRANT SELECT ON cdcpub.job_hist_ct TO subscriber1;
The Change Data Capture Asynchronous HotLog system is now ready for subscriber1 to create subscriptions.
16-35
1.
The source database DBA congures Oracle Net so that the source database can communicate with the staging database. (See Oracle Net Services Administrator's Guide for information about Oracle Net). The source database DBA sets the database initialization parameters on the source database as described in "Setting Initialization Parameters for Change Data Capture Publishing" on page 16-21. In the following code example stagingdb is the network name of the staging database:
compatible = 10.1.0 java_pool_size = 50000000 log_archive_dest_1="location=/oracle/dbs mandatory reopen=5" log_archive_dest_2 = "service=stagingdb arch optional noregister reopen=5 template = /usr/oracle/dbs/arch1_%s_%t_%r.dbf" log_archive_dest_state_1 = enable log_archive_dest_state_2 = enable log_archive_format="arch1_%s_%t_%r.dbf" remote_login_passwordfile=shared
2.
See Oracle Data Guard Concepts and Administration for information on log transport services. Step 2 Staging Database DBA: Set the database initialization parameters. The staging database DBA sets the database initialization parameters on the staging database, as described in "Setting Initialization Parameters for Change Data Capture Publishing" on page 16-21. In this example, one change set will be dened and the current value for the STREAMS_POOL_SIZE parameter is 50 MB or greater:
compatible = 10.1.0 global_names = true java_pool_size = 50000000 job_queue_processes = 2 parallel_max_servers = <current_value> + 5 processes = <current_value> + 7 remote_login_passwordfile = shared sessions = <current value> + 2 streams_pool_size = <current_value> + 21 MB undo_retention = 3600
Step 3 Source Database DBA: Alter the source database. The source database DBA performs the following three tasks. The second is required. The rst and third are optional, but recommended. It is assumed that the database is currently running in ARCHIVELOG mode.
1.
Place the database into FORCE LOGGING logging mode to protect against unlogged direct writes in the source database that cannot be captured by asynchronous Change Data Capture:
ALTER DATABASE FORCE LOGGING;
2.
Enable supplemental logging. Supplemental logging places additional column data into a redo log le whenever an update operation is performed.
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
3.
Create an unconditional log group on all columns to be captured in the source table. Source table columns that are unchanged and are not in an unconditional log group, will be null in the change table, instead of reecting their actual source table values. (This example captures rows in the HR.JOB_HISTORY table only. The source database DBA would repeat this step for each source table for which change tables will be created).
ALTER TABLE HR.JOB_HISTORY ADD SUPPLEMENTAL LOG GROUP log_group_job_hist (EMPLOYEE_ID, START_DATE, END_DATE, JOB_ID, DEPARTMENT_ID) ALWAYS;
If you intend to capture all the column values in a row whenever a column in that row is updated, you can use the following statement instead of listing each column one-by-one in the ALTER TABLE statement. However, do not use this form of the ALTER TABLE statement if all columns are not needed. Logging all columns incurs more overhead than logging selected columns.
ALTER TABLE HR.JOB_HISTORY ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
See Oracle Database Administrator's Guide for information about running a database in ARCHIVELOG mode. See "Asynchronous Change Data Capture and Supplemental Logging" on page 16-50 for more information on enabling supplemental logging. Step 4 Staging Database DBA: Create and grant privileges to the publisher. The staging database DBA creates a user, (for example, cdcpub), to serve as the Change Data Capture publisher and grants the necessary privileges to the publisher so that he or she can perform the underlying Oracle Streams operations needed to create Change Data Capture change sources, change sets, and change tables on the staging database, as described in "Creating a User to Serve As a Publisher" on page 16-18. For example:.
CREATE USER cdcpub IDENTIFIED BY cdcpub DEFAULT TABLESPACE ts_cdcpub
16-37
QUOTA UNLIMITED ON SYSTEM QUOTA UNLIMITED ON SYSAUX; GRANT CREATE SESSION TO cdcpub; GRANT CREATE TABLE TO cdcpub; GRANT CREATE TABLESPACE TO cdcpub; GRANT UNLIMITED TABLESPACE TO cdcpub; GRANT SELECT_CATALOG_ROLE TO cdcpub; GRANT EXECUTE_CATALOG_ROLE TO cdcpub; GRANT CONNECT, RESOURCE, DBA TO cdcpub; GRANT CREATE SEQUENCE TO cdcpub; EXECUTE DBMS_STREAMS_AUTH.GRANT_ADMIN_PRIVILEGE(grantee => 'cdcpub');
Step 5 Source Database DBA: Build the LogMiner data dictionary. The source database DBA builds a LogMiner data dictionary at the source database so that log transport services can transport this data dictionary to the staging database. This LogMiner data dictionary build provides the table denitions as they were just prior to beginning to capture change data. Change Data Capture automatically updates the data dictionary with any source table data denition language (DDL) operations that are made during the course of change data capture to ensure that the dictionary is always synchronized with the source database tables. When building the LogMiner data dictionary, the source database DBA should get the SCN value of the data dictionary build. In Step 8, when the publisher creates a change source, he or she will need to provide this value as the first_scn parameter.
SET SERVEROUTPUT ON VARIABLE f_scn NUMBER; BEGIN :f_scn := 0; DBMS_CAPTURE_ADM.BUILD(:f_scn); DBMS_OUTPUT.PUT_LINE('The first_scn value is ' || :f_scn); END; / The first_scn value is 207722
For asynchronous AutoLog publishing to work, it is critical that the source database DBA build the data dictionary before the source tables are prepared. The source database DBA must be careful to follow Step 5 and Step 6 in the order they are presented here. See Oracle Streams Concepts and Administration for more information on the LogMiner data dictionary.
Step 6 Source Database DBA: Prepare the source tables. The source database DBA must prepare the source tables on the source database for asynchronous Change Data Capture by instantiating each source table so that the underlying Oracle Streams environment records the information it needs to capture each source table's changes. The source table structure and the column datatypes must be supported by Change Data Capture. See "Datatypes and Table Structures Supported for Asynchronous Change Data Capture" on page 16-51 for more information.
BEGIN DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION( TABLE_NAME => 'hr.job_history'); END; /
Step 7 Source Database DBA: Get the global name of the source database. In Step 8, the publisher will need to reference the global name of the source database. The source database DBA can query the GLOBAL_NAME column in the GLOBAL_NAME view on the source database to retrieve this information for the publisher:
SELECT GLOBAL_NAME FROM GLOBAL_NAME; GLOBAL_NAME ---------------------------------------------------------------------------HQDB
Step 8 Staging Database Publisher: Identify each change source database and create the change sources. The publisher uses the DBMS_CDC_PUBLISH.CREATE_AUTOLOG_CHANGE_SOURCE procedure on the staging database to create change sources. The process of managing the capture system begins with the creation of a change source. A change source describes the source database from which the data will be captured, and manages the relationship between the source database and the staging database. A change source always species the SCN of a data dictionary build from the source database as its first_scn parameter. The publisher gets the SCN of the data dictionary build and the global database name from the source database DBA (as shown in Step 5 and Step 7, respectively). If the publisher cannot get the value to use for the first_scn parameter value from the source database DBA, then, with the appropriate privileges, he or she can query the V$ARCHIVED_LOG view on the source database to determine the value. This is
16-39
described in the DBMS_CDC_PUBLISH chapter of the PL/SQL Packages and Types Reference. On the staging database, the publisher creates the AutoLog change source and species the global name as the source_database parameter value and the SCN of the data dictionary build as the first_scn parameter value:
BEGIN DBMS_CDC_PUBLISH.CREATE_AUTOLOG_CHANGE_SOURCE( change_source_name => 'CHICAGO', description => 'test source', source_database => 'HQDB', first_scn => 207722); END; /
Step 9 Staging Database Publisher: Create change sets. The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_SET procedure on the staging database to create change sets. The publisher can optionally provide beginning and ending dates to indicate where to begin and end the data capture. Note that when Change Data Capture creates a change set, its associated Oracle Streams capture and apply processes are also created (but not started). The following example shows how to create a change set called CHICAGO_DAILY that captures changes starting today, and continues capturing change data indenitely. (If, at some time in the future, the publisher decides that he or she wants to stop capturing change data for this change set, he or she should disable the change set and then drop it.)
BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', description => 'change set for job history info', change_source_name => 'CHICAGO', stop_on_ddl => 'y'); END; /
Step 10 Staging Database Publisher: Create the change tables. The publisher uses the DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE procedure on the staging database to create change tables.
The publisher creates one or more change tables for each source table to be published, species which columns should be included, and species the combination of before and after images of the change data to capture. The publisher can set the options_string eld of the DBMS_CDC_ PUBLISH.CREATE_CHANGE_TABLE procedure to have more control over the physical properties and tablespace properties of the change tables. The options_ string eld can contain any option available (except partitioning) on the CREATE TABLE statement. In this example, it species a tablespace for the change set. (This example assumes that the publisher previously created the TS_CHICAGO_DAILY tablespace.) The following example creates a change table on the staging database that captures changes made to a source table in the source database. The example uses the sample table HR.JOB_HISTORY.
BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE( owner => 'cdcpub', change_table_name => 'JOB_HIST_CT', change_set_name => 'CHICAGO_DAILY', source_schema => 'HR', source_table => 'JOB_HISTORY', column_type_list => 'EMPLOYEE_ID NUMBER(6),START_DATE DATE,END_DATE DATE, JOB_ID VARCHAR2(10), DEPARTMENT_ID NUMBER(4)', capture_values => 'both', rs_id => 'y', row_id => 'n', user_id => 'n', timestamp => 'n', object_id => 'n', source_colmap => 'n', target_colmap => 'y', options_string => 'TABLESPACE TS_CHICAGO_DAILY'); END; /
This example creates a change table named job_hist_ct within change set CHICAGO_DAILY. The column_type_list parameter identies the columns captured by the change table. The source_schema and source_table parameters identify the schema and source table that reside in the source database, not the staging database. The capture_values setting in the example indicates that for update operations, the change data will contain two separate rows for each row that changed: one row
16-41
will contain the row values before the update occurred and the other row will contain the row values after the update occurred. Step 11 Staging Database Publisher: Enable the change set. Because asynchronous change sets are always disabled when they are created, the publisher must alter the change set to enable it. The Oracle Streams capture and apply processes are started when the change set is enabled.
BEGIN DBMS_CDC_PUBLISH.ALTER_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', enable_capture => 'y'); END; /
Step 12 Source Database DBA: Switch the redo log les at the source database. To begin capturing data, a log le must be archived. The source database DBA can initiate the process by switching the current redo log le:
ALTER SYSTEM ARCHIVE LOGFILE;
Step 13 Staging Database Publisher: Grant access to subscribers. The publisher controls subscriber access to change data by granting and revoking the SQL SELECT privilege on change tables for users and roles on the staging database. The publisher grants access to specic change tables. Without this step, a subscriber cannot access any change data. This example assumes that user subscriber1 already exists.
GRANT SELECT ON cdcpub.job_hist_ct TO subscriber1;
The Change Data Capture asynchronous AutoLog system is now ready for subscriber1 to create subscriptions.
may subscribe to any source tables for which the publisher has created one or more change tables by doing one of the following:
s
Specifying the source tables and columns of interest. When there are multiple publications that contain the columns of interest, then Change Data Capture selects one on behalf of the user.
Specifying the publication IDs and columns of interest. When there are multiple publications on a single source table and these publications share some columns, the subscriber should specify publication IDs (rather than source tables) if any of the shared columns will be used in a single subscription.
The following steps provide an example to demonstrate the second scenario: Step 1 Find the source tables for which the subscriber has access privileges. The subscriber queries the ALL_SOURCE_TABLES view to see all the published source tables for which the subscriber has access privileges:
SELECT * FROM ALL_SOURCE_TABLES; SOURCE_SCHEMA_NAME SOURCE_TABLE_NAME ------------------------------ -----------------------------HR JOB_HISTORY
Step 2 Find the change set names and columns for which the subscriber has access privileges. The subscriber queries the ALL_PUBLISHED_COLUMNS view to see all the change sets, columns, and publication IDs for the HR.JOB_HISTORY table for which the subscriber has access privileges:
SELECT UNIQUE CHANGE_SET_NAME, COLUMN_NAME, PUB_ID FROM ALL_PUBLISHED_COLUMNS WHERE SOURCE_SCHEMA_NAME ='HR' AND SOURCE_TABLE_NAME = 'JOB_HISTORY'; CHANGE_SET_NAME ---------------CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY COLUMN_NAME -----------------DEPARTMENT_ID EMPLOYEE_ID END_DATE JOB_ID START_DATE PUB_ID -----------34883 34883 34883 34883 34883
16-43
Step 3 Create a subscription. The subscriber calls the DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION procedure to create a subscription. The following example shows how the subscriber identies the change set of interest (CHICAGO_DAILY), and then species a unique subscription name that will be used throughout the life of the subscription:
BEGIN DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION( change_set_name => 'CHICAGO_DAILY', description => 'Change data for JOB_HISTORY', subscription_name => 'JOBHIST_SUB'); END; /
Step 4 Subscribe to a source table and the columns in the source table. The subscriber calls the DBMS_CDC_SUBSCRIBE.SUBSCRIBE procedure to specify which columns of the source tables are of interest to the subscriber. A subscription can contain one or more source tables referenced by the same change set. In the following example, the subscriber wants to see the EMPLOYEE_ID, START_ DATE, and END_DATE columns from the JOB_HISTORY table. Because all these columns are contained in the same publication (and the subscriber has privileges to access that publication) as shown in the query in Step 2, the following call can be used:
BEGIN DBMS_CDC_SUBSCRIBE.SUBSCRIBE( subscription_name => 'JOBHIST_SUB', source_schema => 'HR', source_table => 'JOB_HISTORY', column_list => 'EMPLOYEE_ID, START_DATE, END_DATE, JOB_ID', subscriber_view => 'JOBHIST_VIEW'); END; /
However, assume that for security reasons the publisher has not created a single change table that includes all these columns. Suppose that instead of the results shown in Step 2, the query of the ALL_PUBLISHED_COLUMNS view shows that the columns of interest are included in multiple publications as shown in the following example:
SELECT UNIQUE CHANGE_SET_NAME, SOURCE_TABLE_NAME, COLUMN_NAME, PUB_ID FROM ALL_PUBLISHED_COLUMNS WHERE SOURCE_SCHEMA_NAME ='HR' AND SOURCE_TABLE_NAME = 'JOB_HISTORY'; CHANGE_SET_NAME ---------------CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY CHICAGO_DAILY COLUMN_NAME -----------------DEPARTMENT_ID EMPLOYEE_ID END_DATE JOB_ID START_DATE EMPLOYEE_ID PUB_ID -----------34883 34883 34885 34883 34885 34885
This returned data shows that the EMPLOYEE_ID column is included in both publication 34883 and publication 34885. A single subscribe call must specify columns available in a single publication. Therefore, if the subscriber wants to subscribe to columns in both publications, using EMPLOYEE_ID to join across the subscriber views, then the subscriber must use two calls, each specifying a different publication ID:
BEGIN DBMS_CDC_SUBSCRIBE.SUBSCRIBE( subscription_name => 'MULTI_PUB', publication_id => 34885, column_list => 'EMPLOYEE_ID, START_DATE, END_DATE', subscriber_view => 'job_dates'); DBMS_CDC_SUBSCRIBE.SUBSCRIBE( subscription_name => 'MULTI_PUB', publication_id => 34883, column_list => 'EMPLOYEE_ID, JOB_ID', subscriber_view => 'job_type'); END; /
Note that each DBMS_CDC_SUBSCRIBE.SUBSCRIBE call species a unique subscriber view. Step 5 Activate the subscription. The subscriber calls the DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION procedure to activate the subscription. A subscriber calls this procedure when nished subscribing to source tables (or publications), and ready to receive change data. Whether subscribing to one or
16-45
multiple source tables, the subscriber needs to call the ACTIVATE_SUBSCRIPTION procedure only once. The ACTIVATE_SUBSCRIPTION procedure creates empty subscriber views. At this point, no additional source tables can be added to the subscription.
BEGIN DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION( subscription_name => 'JOBHIST_SUB'); END; /
Step 6 Get the next set of change data. The subscriber calls the DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW procedure to get the next available set of change data. This sets the high boundary of the subscription window. For example:
BEGIN DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW( subscription_name => 'JOBHIST_SUB'); END; /
If this is the subscriber's rst call to the EXTEND_WINDOW procedure, then the subscription window contains all the change data in the publication. Otherwise, the subscription window contains all the new change data that was created since the last call to the EXTEND_WINDOW procedure. If no new change data has been added, then the subscription window remains unchanged. Step 7 Read and query the contents of the subscriber views. The subscriber uses the SQL SELECT statement on the subscriber view to query the change data (within the current boundaries of the subscription window). The subscriber can do this for each subscriber view in the subscription. For example:
SELECT EMPLOYEE_ID, START_DATE, END_DATE FROM JOBHIST_VIEW; EMPLOYEE_ID ----------176 180 190 START_DAT --------24-MAR-98 24-MAR-98 01-JAN-99 END_DATE --------31-DEC-98 31-DEC-98 31-DEC-99
200
01-JAN-99
31-DEC-99
The subscriber view name, JOBHIST_VIEW, was specied when the subscriber called the DBMS_CDC_SUBSCRIBE.SUBSCRIBE procedure in Step 4. Step 8 Indicate that the current set of change data is no longer needed. The subscriber uses the DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure to let Change Data Capture know that the subscriber no longer needs the current set of change data. This helps Change Data Capture to manage the amount of data in the change table and sets the low boundary of the subscription window. Calling the DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure causes the subscription window to be empty. For example:
BEGIN DBMS_CDC_SUBSCRIBE.PURGE_WINDOW( subscription_name => 'JOBHIST_SUB'); END; /
Step 9 Repeat Steps 6 through 8. The subscriber repeats Steps 6 through 8 as long as the subscriber is interested in additional change data. Step 10 End the subscription. The subscriber uses the DBMS_CDC_SUBSCRIBE.DROP_SUBSCRIPTION procedure to end the subscription. This is necessary to prevent the publications that underlie the subscription from holding change data indenitely.
BEGIN DBMS_CDC_SUBSCRIBE.DROP_SUBSCRIPTION( subscription_name => 'JOBHIST_SUB'); END; /
16-47
HotLog Asynchronous HotLog Change Data Capture reads online redo log les whenever possible and archived redo log les otherwise.
AutoLog Asynchronous AutoLog Change Data Capture reads redo log les that have been copied from the source database to the staging database by log transport services. In ARCH mode, log transport services copies archived redo log les to the staging database after a log switch occurs on the source database. In LGWR mode, log transport services copies redo data to the staging database while it is being written to the online redo log le on the source database, and then makes it available to Change Data Capture when a log switch occurs on the source database.
For log les to be archived, the source databases for asynchronous Change Data Capture must run in ARCHIVELOG mode, as specied with the following SQL statement:
ALTER DATABASE ARCHIVELOG;
See Oracle Database Administrator's Guide for information about running a database in ARCHIVELOG mode. A redo log le used by Change Data Capture must remain available on the staging database until Change Data Capture has captured it. However, it is not necessary that the redo log le remain available until the Change Data Capture subscriber is done with the change data. To determine which redo log les are no longer needed by Change Data Capture for a given change set, the publisher alters the change set's Streams capture process, which causes Streams to perform some internal cleanup and populates the DBA_ LOGMNR_PURGED_LOG view. The publisher follows these steps:
1.
Uses the following query on the staging database to get the three SCN values needed to determine an appropriate new first_scn value for the change set, CHICAGO_DAILY:
SELECT cap.CAPTURE_NAME, cap.FIRST_SCN, cap.APPLIED_SCN, cap.SAFE_PURGE_SCN FROM DBA_CAPTURE cap, CHANGE_SETS cset
WHERE cset.SET_NAME = 'CHICAGO_DAILY' AND cap.CAPTURE_NAME = cset.CAPTURE_NAME; CAPTURE_NAME FIRST_SCN APPLIED_SCN SAFE_PURGE_SCN ------------------------------ ---------- ----------- -------------CDC$C_CHICAGO_DAILY 778059 778293 778293 2.
Determines a new first_scn value that is greater than the original first_ scn value and less than or equal to the applied_scn and safe_purge_scn values returned by the query in step 1. In this example, this value is 778293, and the capture process name is CDC$C_CHICAGO_DAILY, therefore the publisher can alter the first_scn value for the capture process as follows:
BEGIN DBMS_CAPTURE_ADM.ALTER_CAPTURE( capture_name => 'CDC$C_CHICAGO_DAILY', first_scn => 778293); END; /
If there if not an SCN value that meets these criteria, then the change set needs all of its redo log les.
3.
Queries the DBA_LOGMNR_PURGED_LOG view to see any log les that are no longer needed by Change Data Capture:
SELECT FILE_NAME FROM DBA_LOGMNR_PURGED_LOG;
Note: Redo log les may be required on the staging database for
purposes other than Change Data Capture. Before deleting a redo log le, the publisher should be sure that no other users need it. See the information on setting the rst SCN for an existing capture process and on capture process checkpoints in Oracle Streams Concepts and Administration for more information. The first_scn value can be updated for all change sets in an AutoLog change source by using the DBMS_CDC_PUBLISH.ALTER_AUTOLOG_CHANGE_SOURCE first_scn parameter. Note that the new first_scn value must meet the criteria stated in step 2 of the preceding list for all change sets in the AutoLog change source.
16-49
Both the size of the redo log les and the frequency with which a log switch occurs can affect the generation of the archived log les at the source database. For Change Data Capture, the most important factor in deciding what size to make a redo log le is the tolerance for latency between when a change is made and when that change data is available to subscribers. However, because the Oracle Database software attempts a check point at each log switch, if the redo log le is too small, frequent log switches will lead to frequent checkpointing and negatively impact the performance of the source database. See Oracle Data Guard Concepts and Administration for step-by-step instructions on monitoring log le archival information. Substitute the terms source and staging database for the Oracle Data Guard terms primary database and archiving destinations, respectively. When using log transport services to supply redo log les to an AutoLog change source, gaps in the sequence of redo log les are automatically detected and resolved. If a situation arises where it is necessary to manually add a log le to an AutoLog change set, the publisher can use instructions on explicitly assigning log les to a downstream capture process described in the Oracle Streams Concepts and Administration. These instructions require the name of the capture process for the AutoLog change set. The publisher can obtain the name of the capture process for an AutoLog change set from the CHANGE_SETS data dictionary view.
Supplementally log all source table columns that are part of a primary key or function to uniquely identify a row. This can be done using database-level or table-level identication key logging, or through a table-level unconditional log group. Create an unconditional log group for all source table columns that are captured by any asynchronous change table. This should be done before any change tables are created on a source table.
ALTER TABLE SH.PROMOTIONS ADD SUPPLEMENTAL LOG GROUP log_group_cust (PROMO_NAME, PROMO_SUBCATEGORY, PROMO_CATEGORY) ALWAYS;
If an unconditional log group is not created for all source table columns to be captured, then when an update DML operation occurs, some unchanged user column values in change tables will be null instead of reecting the actual source table value. For example, suppose a source table contains two columns, X and Y, and that the source database DBA has dened an unconditional log group for that table that includes only column Y. Furthermore, assume that a user updates only column Y in that table row. When the subscriber views the change data for that row, the value of the unchanged column X will be null. However, because the actual column value for X is excluded from the redo log le and therefore cannot be included in the change table, the subscriber cannot assume that the actual source table value for column X is null. The subscriber must rely on the contents of the TARGET_COLMAP$ control column to determine whether the actual source table value for column X is null or it is unchanged. See Oracle Database Utilities for more information on the various types of supplemental logging.
Datatypes and Table Structures Supported for Asynchronous Change Data Capture
Asynchronous Change Data Capture supports columns of all built-in Oracle datatypes except the following:
s
BFILE BLOB CLOB LONG NCLOB ROWID UROWID object types (for example, XMLType)
Asynchronous Change Data Capture does not support the following table structures:
16-51
Source tables that are temporary tables Source tables that are object tables Index-organized tables with columns of unsupported datatypes (including LOB columns) or with overow segments
Managing Asynchronous Change Sets Managing Change Tables Considerations for Exporting and Importing Change Data Capture Objects Impact on Subscriptions When the Publisher Makes Changes
Creating Asynchronous Change Sets with Starting and Ending Dates Enabling and Disabling Asynchronous Change Sets Stopping Capture on DDL for Asynchronous Change Sets Recovering from Errors Returned on Asynchronous Change Sets
The following example creates a change set, JOBHIST_SET, in the AutoLog change source, HQ_SOURCE, that starts capture two days from now and continues indenitely:
BEGIN DBMS_CDC_PUBLISH.CREATE_CHANGE_SET( change_set_name => 'JOBHIST_SET', description => 'Job History Application Change Set', change_source_name => 'HQ_SOURCE', stop_on_ddl => 'Y', begin_date => sysdate+2); END; /
The Oracle Streams capture and apply processes for the change set are started when the change set is enabled. The publisher can disable the JOBHIST_SET asynchronous change set with the following call:
BEGIN DBMS_CDC_PUBLISH.ALTER_CHANGE_SET( change_set_name => 'JOBHIST_SET', enable_capture => 'n'); END; /
16-53
The Oracle Streams capture and apply processes for the change set are stopped when the change set is disabled. Although a disabled change set cannot process new change data, it does not lose any change data provided that the necessary archived redo log les remain available until the change set is enabled and processes them. Oracle recommends that change sets be enabled as much as possible to avoid accumulating archived redo log les. See "Asynchronous Change Data Capture and Redo Log Files" on page 16-48 for more information. Change Data Capture can automatically disable an asynchronous change set if DDL is encountered during capture and the stop_on_ddl parameter is set to 'Y', or if there is an internal capture error. The publisher must check the alert log for more information, take any necessary actions to adjust to the DDL or recover from the internal error, and explicitly enable the change set. See "Recovering from Errors Returned on Asynchronous Change Sets" on page 16-55 for more information.
The publisher can alter the JOBHIST_SET change set so that it does not stop on DDL by using the following call:
BEGIN DBMS_CDC_PUBLISH.ALTER_CHANGE_SET( change_set_name => 'JOBHIST_SET', stop_on_ddl => 'n'); END;
If a DDL statement causes processing to stop, a message is written to the alert log indicating the DDL statement and change set involved. For example, if a TRUNCATE TABLE DDL statement causes the JOB_HIST change set to stop processing, the alert log contains lines such as the following:
Change Data Capture received DDL for change set JOB_HIST Change Data Capture received DDL and stopping: truncate table job_history
Because they do not affect the column data itself, the following DDL statements do not cause Change Data Capture to stop capturing change data when the stop_on_ ddl parameter is set to 'Y':
s
ANALYZE TABLE LOCK TABLE GRANT privileges to access a table REVOKE privileges to access a table COMMENT on a table COMMENT on a column
These statements can be issued on the source database without concern for their impact on Change Data Capture processing. For example, when an ANALYZE TABLE command is issued on the JOB_HISTORY source table, the alert log on the staging database will contain a line similar to the following when the stop_on_ ddl parameter is set to 'Y':
Change Data Capture received DDL and ignoring: analyze table job_history compute statistics
16-55
The publisher must check the alert log for more information and attempt to x the underlying problem. The publisher can then attempt to recover from the error by calling ALTER_CHANGE_SET with the recover_after_error and remove_ddl parameters set appropriately. The publisher can retry this procedure as many times as necessary to resolve the problem. When recovery succeeds, the error is removed from the change set and the publisher can enable the asynchronous change set (as described in "Enabling and Disabling Asynchronous Change Sets" on page 16-53). If more information is needed to resolve capture errors, the publisher can query the DBA_APPLY_ERROR view to see information about Streams apply errors; capture errors correspond to Streams apply errors. The publisher must always use the DBMS_CDC_PUBLISH.ALTER_CHANGE_SET procedure to recover from capture errors because both Streams and Change Data Capture actions are needed for recovery and only the DBMS_CDC_PUBLISH.ALTER_CHANGE_SET procedure performs both sets of actions. See Oracle Streams Concepts and Administration for information about the error queue and apply errors. The following two scenarios demonstrate how a publisher might investigate and then recover from two different types of errors returned to Change Data Capture: An Error Due to Running Out of Disk Space The publisher can view the contents of the alert log to determine which error is being returned for a given change set and which SCN is not being processed. For example, the alert log may contain lines such as the following (where LCR refers to a logical change record):
Change Data Capture has encountered error number: 1688 for change set: CHICAGO_ DAILY Change Data Capture did not process LCR with scn 219337
The publisher can determine the message associated with the error number specied in the alert log by querying the DBA_APPLY_ERROR view for the error message text, where the APPLY_NAME in the DBA_APPLY_ERROR view equals the APPLY_NAME of the change set specied in the alert log. For example:
SQL> SELECT ERROR_MESSAGE FROM DBA_APPLY_ERROR WHERE APPLY_NAME = (SELECT APPLY_NAME FROM CHANGE_SETS WHERE SET_NAME ='CHICAGO_DAILY'); ERROR_MESSAGE -------------------------------------------------------------------------------ORA-01688: unable to extend table LOGADMIN.CT1 partition P1 by 32 in tablespace
TS_CHICAGO_DAILY
After taking action to x the problem that is causing the error, the publisher can attempt to recover from the error. For example, the publisher can attempt to recover the CHICAGO_DAILY change set after an error with the following call:
BEGIN DBMS_CDC_PUBLISH.ALTER_CHANGE_SET( change_set_name => 'CHICAGO_DAILY', recover_after_error => 'y'); END; /
If the recovery does not succeed, then an error is returned and the publisher can take further action to attempt to resolve the problem. The publisher can retry the recovery procedure as many times as necessary to resolve the problem.
Note: When recovery succeeds, the publisher must remember to
enable the change set. An Error Due to Stopping on DDL Suppose a SQL TRUNCATE TABLE statement is issued against the JOB_HISTORY source table and the stop_on_ddl parameter is set to 'Y', then an error such as the following is returned from an attempt to enable the change set:
ERROR at line 1: ORA-31468: cannot process DDL change record ORA-06512: at "SYS.DBMS_CDC_PUBLISH", line 79 ORA-06512: at line 2
16-57
Because the TRUNCATE TABLE statement removes all rows from a table, the publisher will want to notify subscribers before taking action to reenable Change Data Capture processing. He or she might suggest to subscribers that they purge and extend their subscription windows. The publisher can then attempt to restore Change Data Capture processing by altering the change set and specifying the remove_ddl => 'Y' parameter along with the recover_after_error => 'Y' parameter, as follows:
BEGIN DBMS_CDC_PUBLISH.ALTER_CHANGE_SET( change_set_name => 'JOB_HIST', recover_after_error => 'y', remove_ddl => 'y'); END; /
After this procedure completes, the alert log will contain lines similar to the following:
Mon Jun 9 16:20:17 2003 Change Data Capture received DDL and ignoring: truncate table JOB_HISTORY The scn for the truncate statement is 202998
Creating Change Tables Understanding Change Table Control Columns Understanding TARGET_COLMAP$ and SOURCE_COLMAP$ Values Controlling Subscriber Access to Change Tables Purging Change Tables of Unneeded Data Dropping Change Tables
For all modes of Change Data Capture, publishers should not create change tables in system tablespaces. Either of the following methods can be used to ensure that change tables are created in tablespaces managed by the publisher. The rst method creates all the change tables created by the publisher in a single tablespace, while the second method allows the publisher to specify a different tablespace for each change table. When the database administrator creates the account for the publisher, he or she can specify a default tablespace. For example:
CREATE USER cdcpub DEFAULT TABLESPACE ts_cdcpub;
When the publisher creates a change table, he or she can use the options_ string parameter to specify a tablespace for the change table being created. See Step 4 in "Performing Synchronous Publishing" on page 16-27 for an example.
If both methods are used, the tablespace specied by the publisher in the options_string parameter takes precedence over the default tablespace specied in the SQL CREATE USER statement.
s
For asynchronous Change Data Capture, the publisher should be certain that the source table that will be referenced in a DBMS_CDC_PUBLISH.CREATE_ CHANGE_TABLE procedure has been created prior to calling this procedure, particularly if the change set that will be specied in the procedure has the stop_on_ddl parameter set to 'Y'. Suppose the publisher created a change set with the stop_on_ddl parameter set to 'Y', then created the change table, and then the source table was created. In this scenario, the DDL that creates the source table would trigger the stop_ on_ddl condition and cause Change Data Capture processing to stop.
Note: The publisher must not attempt to control a change table's
partitioning properties. Change Data Capture automatically manages the change table partitioning as part of its change table management.
16-59
For asynchronous Change Data Capture, the source database DBA should create an unconditional log group for all source table columns that will be captured in a change table. This should be done before any change tables are created on a source table. If an unconditional log group is not created for source table columns to be captured, then when an update DML operation occurs, some unchanged user column values in change tables will be null instead of reecting the actual source table value. This will require the publisher to evaluate the TARGET_COLMAP$ control column to distinguish unchanged column values from column values that are actually null. See "Asynchronous Change Data Capture and Supplemental Logging" on page 16-50 for information on creating unconditional log groups and see "Understanding Change Table Control Columns" on page 16-60 for information on control columns.
Control Columns for a Change Table Datatype CHAR(2) Mode All Optional Column Description No The value in this column can be any one of the following1: I: Indicates this row represents an insert operation UO: Indicates this row represents the before-image of an updated source table row for the following cases:
s s
Asynchronous Change Data Capture Synchronous Change Data Capture when the change table includes a primary key-based object ID and a captured column that is not a primary key has changed.
UU: Indicates this row represents the before-image of an updated source table row for synchronous Change Data Capture, in cases other than those represented by UO. UN: Indicates this row represents the after-image of an updated source table row. D: Indicates this row represents a delete operation. CSCN$ RSID$ NUMBER NUMBER All All No Yes Commit SCN of this transaction. Unique row sequence ID within this transaction.2 The RSID$ column reects an operation's capture order within a transaction, but not across transactions. The publisher cannot use the RSID$ column value by itself to order committed operations across transactions; it must be used in conjunction with the CSCN$ column value. Bit mask3 of updated columns in the source table. Bit mask3 of updated columns in the change table. Commit time of this transaction. Time when the operation occurred in the source table.
16-61
(Cont.) Control Columns for a Change Table Datatype Mode Optional Column Description Yes Yes Name of the user who caused the operation. Row ID of affected row in source table. Transaction ID undo segment number. Transaction ID slot number. Transaction ID sequence number. Object ID.
SYS_NC_OID$ RAW(16)
1
If you specify a query based on this column, specify the I or D column values as "I " or "D ", respectively. The OPERATION$ column is a 2-character column; values are left-justied and space-lled. A query that species a value of "I" or "D" will return no values. You can use the RSID$ column to associate the after-image with the before-image of a given operation. The value of the after-image RSID$ column always matches the value of the before-image RSID$ column value for a given update operation. A bit mask is an array of binary values that indicate which columns in a row have been updated.
In Example 163, the rst 'FE' is the low order byte and the last '00' is the high order byte. To correctly interpret the meaning of the values, you must consider which bits
are set in each byte. The bits in the bitmap are counted starting at zero. The rst bit is bit 0, the second bit is bit 1, and so on. Bit 0 is always ignored. For the other bits, if a particular bit is set to 1, it means that the value for that column has been changed. To interpret the string of bytes as presented in the Example 163, you read from left to right. The rst byte is the string 'FE'. Broken down into bits (again from left to right) this string is "1111 1110", which maps to columns " 7,6,5,4 3,2,1,-" in the change table (where the hyphen represents the ignored bit). The rst bit tells you if column 7 in the change table has changed. The right-most bit is ignored. The values in Example 163 indicate that the rst 7 columns have a value present. This is typical - the rst several columns in a change table are control columns. The next byte in Example 163 is the string '11'. Broken down into bits, this string is "0001 0001", which maps to columns "15,14,13,12 11,10,9,8" in the change table. These bits indicate that columns 8 and 12 are changed. Columns 9, 10, 11, 13, 14, 15, are not changed. The rest of the string is all '00', indicating that none of the other columns has been changed. A publisher can issue the following query to determine the mapping of column numbers to column names:
SELECT COLUMN_NAME, COLUMN_ID FROM ALL_TAB_COLUMNS WHERE OWNER='PUBLISHER_ STEWART' AND TABLE_NAME='MY_CT'; COLUMN_NAME COLUMN_ID ------------------------------ ---------OPERATION$ 1 CSCN$ 2 COMMIT_TIMESTAMP$ 3 XIDUSN$ 4 XIDSLT$ 5 XIDSEQ$ 6 RSID$ 7 TARGET_COLMAP$ 8 C_ID 9 C_KEY 10 C_ZIP 11 COLUMN_NAME COLUMN_ID ------------------------------ ---------C_DATE 12 C_1 13 C_3 14 C_5 15 C_7 16
16-63
C_9
17
Using Example 163, the publisher can conclude that following columns were changed in the particular change row in the change table represented by this TARGET_COLMAP$ value: OPERATION$, CSCN$, COMMIT_TIMESTAMP$, XIDUSN$, XIDSLT$, XIDSEQ$, RSID$, TARGET_COLMAP$, and C_DATE. Note that Change Data Capture generates values for all control columns in all change rows, so the bits corresponding to control columns are always set to 1 in every TARGET_COLMAP$ column. Bits that correspond to user columns that have changed are set to 1 for the OPERATION$ column values UN and I, as appropriate. (See Table 168 for information about the OPERATION$ column values.) A common use for the values in the TARGET_COLMAP$ column is for determining the meaning of a null value in a change table. A column value in a change table can be null for two reasons: the value was changed to null by a user or application, or Change Data Capture inserted a null value into the column because a value was not present in the redo data from the source table. If a user changed the value to null, the bit for that column will be set to 1; if Change Data Capture set the value to null, then the column will be set to 0. Values in the SOURCE_COLMAP$ column are interpreted in a similar manner, with the following exceptions:
s
The SOURCE_COLMAP$ column refers to columns of source tables, not columns of change tables. The SOURCE_COLMAP$ column does not reference control columns because these columns are not present in the source table. Changed source columns are set to 1 in the SOURCE_COLMAP$ column for OPERATION$ column values UO, UU, UN, and I, as appropriate. (See Table 168 for information about the OPERATION$ column values.) The SOURCE_COLMAP$ column is valid only for synchronous change tables.
for users and roles. The publisher must grant the SELECT privilege before a subscriber can subscribe to the change table. The publisher must not grant any DML access (use of INSERT, UPDATE, or DELETE statements) to the subscribers on the change tables because of the risk that a subscriber might inadvertently change the data in the change table, making it inconsistent with its source. Furthermore, the publisher should avoid creating change tables in schemas to which subscribers have DML access.
Subscriber When nished using change data, a subscriber must call the DBMS_CDC_ SUBSCRIBE.PURGE_WINDOW procedure. This indicates to Change Data Capture and the publisher that the change data is no longer needed by this subscriber. The DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure does not physically remove rows from the change tables. In addition, as shown in "Subscribing to Change Data" beginning on page 16-42, the subscriber should call the DBMS_CDC_SUBSCRIBE.DROP_SUBSCRIPTION procedure to drop unneeded subscriptions. See PL/SQL Packages and Types Reference for information about the DBMS_CDC_ SUBSCRIBE.DROP_SUBSCRIPTION and the DBMS_CDC_SUBSCRIBE.PURGE_ WINDOW procedures.
Change Data Capture Change Data Capture creates a purge job using the DBMS_JOB PL/SQL package (which runs under the account of the publisher who created the rst change table). This purge job automatically calls the DBMS_CDC_PUBLISH.PURGE procedure to remove data that subscribers are no longer using from the change tables. This ensures that the size of the change tables does not grow without limit. The automatic call to the DBMS_CDC_PUBLISH.PURGE procedure evaluates all active subscription windows to determine which change data is still needed. It will not purge any data that could be referenced by one or more subscribers with active subscription windows.
16-65
By default, this purge job runs every 24 hours. The publisher who created the rst change table can adjust this interval using the PL/SQL DBMS_ JOB.CHANGE procedure. The values for the JOB parameter for this procedure can be found by querying the USER_JOBS view for the job number that corresponds to the WHAT column containing the string 'SYS.DBMS_CDC_ PUBLISH.PURGE();'. See PL/SQL Packages and Types Reference for information about the DBMS_JOB package and the Oracle Database Reference for information about the USER_JOBS view.
s
Publisher The publisher can manually execute a purge operation at any time. The publisher has the ability to perform purge operations at a ner granularity than the automatic purge operation performed by Change Data Capture. There are three purge operations available to the publisher: DBMS_CDC_PUBLISH.PURGE Purges all change tables on the staging database. This is the same PURGE operation as is performed automatically by Change Data Capture. DBMS_CDC_PUBLISH.PURGE_CHANGE_SET Purges all the change tables in a named change set. DBMS_CDC_PUBLISH.PURGE_CHANGE_TABLE Purges a named changed table.
Thus, calls to the DBMS_CDC_SUBSCRIBE.PURGE_WINDOW procedure by subscribers and calls to the PURGE procedure by Change Data Capture (or one of the PURGE procedures by the publisher) work together: when each subscriber purges a subscription window, it indicates change data that is no longer needed; the PURGE procedure evaluates the sum of the input from all the subscribers before actually purging data. Note that it is possible that a subscriber could fail to call PURGE_WINDOW, with the result being that unneeded rows would not be deleted by the purge job. The publisher can query the DBA_SUBSCRIPTIONS view to determine if this is happening. In extreme circumstances, a publisher may decide to manually drop an active subscription so that space can be reclaimed. One such circumstance is a subscriber that is an applications program that fails to call the PURGE_WINDOW procedure when appropriate. The DBMS_CDC_PUBLISH.DROP_SUBSCRIPTION procedure lets the publisher drop active subscriptions if circumstances require it;
however, the publisher should rst consider that subscribers may still be using the change data.
The DBMS_CDC_PUBLISH.DROP_CHANGE_TABLE procedure also safeguards the publisher from inadvertently dropping a change table while there are active subscribers using the change table. If DBMS_CDC_PUBLISH.DROP_CHANGE_TABLE is called while subscriptions are active, the procedure will fail with the following error:
ORA-31424 change table has active subscriptions
If the publisher still wants to drop the change table, in spite of active subscriptions, he or she must call the DROP_CHANGE_TABLE procedure using the force_flag => 'Y' parameter. This tells Change Data Capture to override its normal safeguards and allow the change table to be dropped despite active subscriptions. The subscriptions will no longer be valid, and subscribers will lose access to the change data.
Note: The SQL DROP USER CASCADE statement will drop all the
publisher's change tables, and if any other users have active subscriptions to the (dropped) change table, these will no longer be valid. In addition to dropping the change tables, the DROP USER CASCADE statement drops any change sources, change sets, and subscriptions that are owned by the user specied in the DROP USER CASCADE statement.
16-67
Change Data Capture objects are exported and imported only as part of full database export and import operations (those in which the expdp and impdb commands specify the FULL=y parameter). Schema-level import and export operations include some underlying objects (for example, the table underlying a change table), but not the Change Data Capture metadata needed for change data capture to occur. AutoLog change sources, change sets, and change tables are not supported. You should export asynchronous change sets and change tables at a time when users are not making DDL and DML changes to the database being exported. When importing asynchronous change sets and change tables, you must also import the underlying Oracle Streams conguration; set the Oracle Data Pump import parameter STREAMS_CONFIGURATION to y explicitly (or implicitly by accepting the default), so that the necessary Streams objects are imported. If you perform an import operation and specify STREAMS_CONFIGURATION=n, then imported asynchronous change sets and change tables will not be able to continue capturing change data. Change Data Capture objects never overwrite existing objects when they are imported (similar to the effect of the import command TABLE_EXISTS_ ACTION=skip parameter for tables). Change Data Capture generates warnings in the import log for these cases. Change Data Capture objects are validated at the end of an import operation to determine if all expected underlying objects are present in the correct form. Change Data Capture generates validation warnings in the import log if it detects validation problems. Imported Change Data Capture objects with validation warnings usually cannot continue capturing change data.
The following are examples of Data Pump export and import commands that support Change Data Capture objects:
> expdp system/manager DIRECTORY=dpump_dir FULL=y > impdp system/manager DIRECTORY=dpump_dir FULL=y STREAMS_CONFIGURATION=y
After a Data Pump full database import operation completes for a database containing AutoLog Change Data Capture objects, the following steps must be performed to restore these objects:
1.
The publisher must manually drop the change tables with the SQL DROP TABLE command. This is needed because the tables are imported without the accompanying Change Data Capture metadata.
2. 3.
The publisher must re-create the AutoLog change sources, change sets, and change tables using the appropriate DBMS_CDC_PUBLISH procedures. Subscribers must re-create their subscriptions to the AutoLog change sets.
Change data may be lost in the interval between a Data Pump full database export operation involving AutoLog Change Data Capture objects and their re-creation after a Data Pump full database import operation in the preceding step. This can be minimized by preventing changes to the source tables during this interval, if possible. See Oracle Database Utilities for information on Oracle Data Pump. The following are publisher considerations for exporting and importing change tables:
s
When change tables are imported, the job queue is checked for a Change Data Capture purge job. If no purge job is found, then one is submitted automatically (using the DBMS_CDC_PUBLISH.PURGE procedure). If a change table is imported, but no subscriptions are taken out before the purge job runs (24 hours later, by default), then all rows in the table will be purged. The publisher can use one of the following methods to prevent the purging of data from a change table: Suspend the purge job using the DBMS_JOB package to either disable the job (using the BROKEN procedure) or execute the job sometime in the future when there are subscriptions (using the NEXT_DATE procedure).
Note: If you disable the purge job by marking it as broken, you
need to remember to reset it once subscriptions have been activated. This prevents the change table from growing indenitely.
s
Create a temporary subscription to preserve the change table data until real subscriptions appear. Then, drop the temporary subscription.
When importing data into a source table for which a change table already exists, the imported data is also recorded in any associated change tables. Assume that the publisher has a source table SALES that has an associated change table ct_sales. When the publisher imports data into SALES, that data is also recorded in ct_sales.
When importing a change table having the optional control ROW_ID column, the ROW_ID columns stored in the change table have meaning only if the
16-69
associated source table has not been imported. If a source table is re-created or imported, each row will have a new ROW_ID that is unrelated to the ROW_ID that was previously recorded in a change table. The original level of export and import support available in Oracle9i is retained for backward compatibility. Synchronous change tables that reside in the SYNC_SET change set can be exported as part of a full database, schema, or individual table export operation and can be imported as needed. The following Change Data Capture objects are not included in the original export and import support: change sources, change sets, change tables that do not reside in the SYNC_SET change set, and subscriptions.
Subscribers do not get explicit notication if the publisher adds a new change table or adds columns to an existing change table. A subscriber can check the ALL_PUBLISHED_COLUMNS view to see if new columns have been added, and whether or not the subscriber has access to them. Table 169 describes what happens when the publisher adds a column to a change table.
Effects of Publisher Adding a Column to a Change Table Then . . . The subscription window for this subscription starts at the point the column was added. The subscription window for this subscription starts at the earliest available change data. The new column will not be seen. The subscription window for this subscription remains unchanged.
Table 169
If the publisher adds And . . . A user column A new subscription includes this column A new subscription does not include this newly added column A subscription exists
A user column
A user column
Table 169
(Cont.) Effects of Publisher Adding a Column to a Change Table Then . . . The subscription window for this subscription starts at the earliest available change data. The subscription can see the control column immediately. All change table rows that existed prior to adding the control column will have the null value for the newly added control column. This subscription can see the new control column when the subscription window is extended (DBMS_CDC_ PUBLISH.EXTEND_WINDOW procedure) such that the low boundary for the window crosses over the point when the control column was added.
A control column
A subscription exists
16-71
To reinstall Change Data Capture, the SQL script initcdc.sql is provided in the admin directory. It creates the Change Data Capture system triggers and Java classes that are required by Change Data Capture.
17
SQLAccess Advisor
This chapter illustrates how to use the SQLAccess Advisor, which is a tuning tool that provides advice on materialized views, indexes, and materialized view logs. The chapter contains:
s
Overview of the SQLAccess Advisor in the DBMS_ADVISOR Package Using the SQLAccess Advisor Tuning Materialized Views for Fast Refresh and Query Rewrite
17-2
Figure 171
Oracle Warehouse Materialized Views, Indexes, and Dimensions Workload SQL Cache
User-Defined Workload
Using the SQLAccess Advisor Wizard or API, you can do the following:
s
Recommend materialized views and indexes based on collected or hypothetical workload information. Manage workloads. Mark, update, and remove recommendations.
In addition, you can use the SQLAccess Advisor API to do the following:
s
Perform a quick tune using a single SQL statement. Show how to make a materialized view fast refreshable. Show how to change a materialized view so that general query rewrite is possible.
The SQLAccess Advisor's recommendations are signicantly improved if you gather structural statistics about table and index cardinalities, and the distinct cardinalities of every dimension level column, JOIN KEY column, and fact table key
column. You do this by gathering either exact or estimated statistics with the DBMS_ STATS package. Because gathering statistics is time-consuming and extreme statistical accuracy is not required, it is generally preferable to estimate statistics. Without these statistics, any queries referencing that table will be marked as invalid in the workload, resulting in no recommendations being made for those queries. It is also recommended that all existing indexes and materialized views have been analyzed. See PL/SQL Packages and Types Reference for more information regarding the DBMS_STATS package.
Create a task Dene the workload Generate the recommendations View and implement the recommendations
Step 1 Create a task Before any recommendations can be made, a task must be created. The task is important because it is where all information relating to the recommendation process resides, including the results of the recommendation process. If you use the wizard in Oracle Enterprise Manager or the DBMS_ADVISOR.QUICK_TUNE procedure, the task is created automatically for you. In all other cases, you must create a task using the DBMS_ADVISOR.CREATE_TASK procedure. You can control what a task does by dening parameters for that task using the DBMS_ADVISOR.SET_TASK_PARAMETER procedure. See "Creating Tasks" on page 17-10 for more information about creating tasks. Step 2 Dene the workload The workload is one of the primary inputs for the SQLAccess Advisor, and it consists of one or more SQL statements, plus various statistics and attributes that fully describe each statement. If the workload contains all SQL statements from a
17-4
target business application, the workload is considered a full workload; if the workload contains a subset of SQL statements, it is known as a partial workload. The difference between a full and a partial workload is that in the former case, the SQLAccess Advisor may recommend dropping certain existing materialized views and indexes if it nds that they are not being used effectively. Typically, the SQLAccess Advisor uses the workload as the basis for all analysis activities. Although the workload may contain a wide variety of statements, it carefully ranks the entries according to a specic statistic, business importance, or a combination of statistics and business importance. This ranking is critical in that it enables the SQLAccess Advisor to process the most important SQL statements ahead of those with less business impact. For a collection of data to be considered a valid workload, the SQLAccess Advisor may require particular attributes to be present. Although analysis can be performed if some of the items are missing, the quality of the recommendations may be greatly diminished. For example, the SQLAccess Advisor requires a workload to contain a SQL query and the user who executed the query. All other attributes are optional; however, if the workload also contained I/O and CPU information, then SQLAccess Advisor may be able to better evaluate the current efciency of the statement. The workload is stored as a separate object, which is created using the DBMS_ ADVISOR.CREATE_SQLWKLD procedure, and can easily be shared among many Advisor tasks. Because the workload is independent, it must be linked to a task using the DBMS_ADVISOR.ADD_SQLWKLD_REF procedure. Once this link has been established, the workload cannot be deleted or modied until all Advisor tasks have removed their dependency on the workload. A workload reference will be removed when a parent Advisor task is deleted or when the workload reference is manually removed from the Advisor task by the user using the DBMS_ ADVISOR.DELETE_SQLWKLD_REF procedure. You can use the SQLAccess Advisor without a workload, however, for best results, a workload must be provided in the form of a user-supplied table, SQL Tuning Set or imported from the SQL Cache. If a workload is not provided, the SQLAccess Advisor can generate and use a hypothetical workload based on the dimensions dened in your schema. Once the workload is loaded into the repository or at the time the recommendations are generated, a lter can be applied to the workload to restrict what is analyzed. This provides the ability to generate different sets of recommendations based on different workload scenarios. The recommendation process and customization of the workload are controlled by SQLAccess Advisor parameters. These parameters control various aspects of the recommendation process, such as the type of recommendation that is required and
the naming conventions for what it recommends. With respect to the workload, parameters control how long the workload exists and what ltering is to be applied to the workload. To set these parameters, use the SET_TASK_PARAMETER and SET_SQLWKLD_ PARAMETER procedures. Parameters are persistent in that they remain set for the lifespan of the task or workload object. When a parameter value is set using the SET_TASK_PARAMETER procedure, it does not change until you make another call to SET_TASK_PARAMETER. See "Dening the Contents of a Workload" on page 17-14 for more information about workloads. Step 3 Generate the recommendations Once a task exists and a workload is linked to the task and the appropriate parameters are set, you can generate recommendations using the DBMS_ ADVISOR.EXECUTE_TASK procedure. These recommendations are stored in the SQLAccess Advisor Repository. The recommendation process generates a number of recommendations and each recommendation will comprise of one or more actions. For example, create a materialized view and then analyze it to gather statistical information. A task recommendation can range from a simple suggestion to a complex solution that requires implementation of a set of database objects such as indexes, materialized views, and materialized view logs. When an Advisor task is executed, it carefully analyzes collected data and user-adjusted task parameters. The SQLAccess Advisor will then attempt to form a resolution based on its built-in knowledge. The resolutions are then rened and stored in the form of a structured recommendation that can be viewed and implemented by the user. See "Generating Recommendations" on page 17-25 for more information about generating recommendations. Step 4 View and implement the recommendations There are two ways to view the recommendations from the SQLAccess Advisor: using the catalog views or by generating a script using the DBMS_ADVISOR.GET_ TASK_SCRIPT procedure. In Enterprise Manager, the recommendations may be displayed once the SQLAccess Advisor process has completed. See "Viewing the Recommendations" on page 17-26 for a description of using the catalog views to view the recommendations. See "Generating SQL Scripts" on page 17-34 to see how to create a script.
17-6
Not all recommendations have to be accepted and you can mark the ones that should be included in the recommendation script. The nal step is then implementing the recommendations and verifying that query performance has improved.
Collects a complete workload for the SQLAccess Advisor. Supports historical data. Is managed by the server.
SQLAccess Advisor Flowchart SQLAccess Advisor Privileges Creating Tasks SQLAccess Advisor Templates Creating Templates Workload Objects Managing Workloads Linking a Task and a Workload Dening the Contents of a Workload SQL Workload Journal Adding SQL Statements to a Workload Deleting SQL Statements from a Workload Changing SQL Statements in a Workload Maintaining Workloads Removing Workloads
Recommendation Options Generating Recommendations Viewing the Recommendations Access Advisor Journal Stopping the Recommendation Process Marking Recommendations Modifying Recommendations Generating SQL Scripts When Recommendations are No Longer Required Performing a Quick Tune Managing Tasks Using SQLAccess Advisor Constants
17-8
Recommendations
To avoid missing critical workload queries, the current database user must have SELECT privileges on the tables targeted for materialized view analysis. For those tables, these SELECT privileges cannot be obtained through a role.
Creating Tasks
An Advisor task is where you dene what it is you want to analyze and where the results of this analysis should go. A user can create any number of tasks, each with its own specialization. All are based on the same Advisor task model and share the same repository. You create a task using the CREATE_TASK procedure. The syntax is as follows:
DBMS_ADVISOR.CREATE_TASK advisor_name task_id task_name task_desc task_or_template is_template ( IN VARCHAR2, OUT NUMBER, IN OUT VARCHAR2, IN VARCHAR2 := NULL, IN VARCHAR2 := NULL, IN VARCHAR2 := 'FALSE');
See PL/SQL Packages and Types Reference for more information regarding the CREATE_TASK and CREATE_SQLWKLD procedures and their parameters.
task to be a template by setting the template attribute when creating the task or later using the UPDATE_TASK_ATTRIBUTE procedure. To use a task as a template, you tell the SQLAccess Advisor to use a task when a new task is created. At that time, the SQLAccess Advisor copies the task template's data and parameter settings into the newly created task. You can also set an existing task to be a template by setting the template attribute. A workload object can also be used as a template for creating new workload objects. Following the same guidelines for using a task as a template, a workload object can benet from having a well-dened starting point. Like a task template, a template workload object can only be used to create similar workload objects.
Creating Templates
You can create a template as in the following example.
1.
2.
Set template parameters. For example, the following sets the naming conventions for recommended indexes and materialized views and the default tablespaces:
-- set naming conventions for recommended indexes/mvs EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'INDEX_NAME_TEMPLATE', 'SH_IDX$$_<SEQ>'); EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'MVIEW_NAME_TEMPLATE', 'SH_MV$$_<SEQ>'); -- set default tablespace for recommended indexes/mvs EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'DEF_INDEX_TABLESPACE', 'SH_INDEXES'); EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'DEF_MVIEW_TABLESPACE', 'SH_MVIEWS');
3.
This template can now be used as a starting point to create a task as follows:
VARIABLE task_id NUMBER;
SQLAccess Advisor
17-11
VARIABLE task_name VARCHAR2(255); EXECUTE :task_name := 'MYTASK'; EXECUTE DBMS_ADVISOR.CREATE_TASK('SQL Access Advisor', :task_id, :task_name, template=>'MY_TEMPLATE');
The following example uses a pre-dened template SQLACCESS_WAREHOUSE. See Table 174 for more information.
EXECUTE DBMS_ADVISOR.CREATE_TASK('SQL Access Advisor', :task_id, :task_name, template=>'SQLACCESS_WAREHOUSE');
Workload Objects
Because the workload is stored as a separate workload object, it can easily be shared among many Advisor tasks. Once a workload object has been referenced by an Advisor task, a workload object cannot be deleted or modied until all Advisor tasks have removed their dependency on the data. A workload reference will be removed when a parent Advisor task is deleted or when the workload reference is manually removed from the Advisor task by the user. The SQLAccess Advisor performs best when a workload based on usage is available. The SQLAccess Workload Repository is capable of storing multiple workloads, so that the different uses of a real-world data warehousing or transaction processing environment can be viewed over a long period of time and across the life cycle of database instance startup and shutdown. Before the actual SQL statements for a workload can be dened, the workload must be created using the CREATE_SQLWKLD procedure. Then, the workload is loaded using the appropriate IMPORT_SQLWKLD procedure. A specic workload can be removed by calling the DELETE_SQLWKLD procedure and passing it a valid workload name. To remove all workloads for the current user, call DELETE_ SQLWKLD and pass the constant value ADVISOR_ALL or %.
Managing Workloads
The CREATE_SQLWKLD procedure creates a workload and it must exist prior to performing any other workload operations, such as importing or updating SQL statements. The workload is identied by its name, so you should dene a name that is unique and is relevant for the operation. Its syntax is as follows:
DBMS_ADVISOR.CREATE_SQLWKLD ( workload_name IN VARCHAR2, description IN VARCHAR2 := NULL,
template is_template
VARIABLE workload_name VARCHAR2(255); EXECUTE :workload_name := 'MYWORKLOAD'; EXECUTE DBMS_ADVISOR.CREATE_SQLWKLD(:workload_name,'This is my first workload'); Example 172 1. Creating a Workload from a Template
2.
3.
Set template parameters. For example, the following sets the lter so only tables in the sh schema are tuned:
-- set USERNAME_LIST filter to SH EXECUTE DBMS_ADVISOR.SET_SQLWKLD_PARAMETER( :template_name, 'USERNAME_LIST', 'SH');
4.
See PL/SQL Packages and Types Reference for more information regarding the CREATE_SQLWKLD procedure and its parameters.
SQLAccess Advisor
17-13
between an Advisor task and a workload is made, the workload is protected from removal. The syntax is as follows:
DBMS_ADVISOR.ADD_SQLWKLD_REF (task_name workload_name IN VARCHAR2, IN VARCHAR2);
The following example links the MYTASK task created to the MYWORKLOAD SQL workload.
EXECUTE DBMS_ADVISOR.ADD_SQLWKLD_REF('MYTASK', 'MYWORKLOAD');
See PL/SQL Packages and Types Reference for more information regarding the ADD_ SQLWKLD_REF procedure and its parameters.
SQL Tuning Set Loading a User-Dened Workload Loading a SQL Cache Workload Using a Hypothetical Workload Using a Summary Advisor 9i Workload
After a workload has been collected and the statements ltered, the SQLAccess Advisor computes usage statistics with respect to the DML statements in the workload.
The following example creates a workload from a SQL Tuning Set named MY_STS_ WORKLOAD.
VARIABLE sqlsetname VARCHAR2(30); VARIABLE workload_name VARCHAR2(30); VARIABLE saved_stmts NUMBER; VARIABLE failed_stmts NUMBER; EXECUTE :sqlsetname := 'MY_STS_WORKLOAD'; EXECUTE :workload_name := 'MY_WORKLOAD'; EXECUTE DBMS_ADVISOR.CREATE_SQLWKLD (:workload_name); EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_STS (:workload_name , :sqlsetname, 'NEW', 1, :saved_stmts, :failed_stmts);
The following example loads MYWORKLOAD workload created earlier, using a user table SH.USER_WORKLOAD. The table is assumed to be populated with SQL statements and conforms to the format specied in Table 171.
VARIABLE saved_stmts NUMBER; VARIABLE failed_stmts NUMBER; EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_USER( 'MYWORKLOAD', 'NEW', 'SH', 'USER_WORKLOAD', :saved_stmts, :failed_stmts);
See PL/SQL Packages and Types Reference for more information regarding the IMPORT_SQLWKLD_USER procedure and its parameters.
SQLAccess Advisor
17-15
USER_WORKLOAD Table Format Type VARCHAR2(64) VARCHAR2(64) NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER NUMBER Default Empty string Empty string 0 0 0 0 0 1 0 SYSDATE 2 Comments Application module name. Application action. Total buffer-gets for the statement. Total CPU time in seconds for the statement. Total elapsed time in seconds for the statement. Total number of disk-read operations used by the statement. Total number of rows process by this SQL statement. Total number of times the statement is executed. Optimizer's calculated cost value for executing the query. Last time the query is used. Defaults to not available. Must be one of the following values: 1- HIGH, 2- MEDIUM, or 3- LOW
The SQL statement. This is a required column. Period of time that corresponds to the execution statistics in seconds. User submitting the query. This is a required column.
saved_rows failed_rows
See PL/SQL Packages and Types Reference for more information regarding the IMPORT_SQLWKLD_SQLCACHE procedure and its parameters. The following example loads the MYWORKLOAD workload created earlier from the SQL Cache. The priority of the loaded workload statements is 2 (medium).
VARIABLE saved_stmts NUMBER; VARIABLE failed_stmts NUMBER; EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_SQLCACHE ('MYWORKLOAD', 'APPEND', 2, :saved_stmts, :failed_stmts);
The SQLAccess Advisor can retrieve workload information from the SQL cache. If the collected data was retrieved from a server with the instance parameter cursor_ sharing set to SIMILAR or FORCE, then user queries with embedded literal values will be converted to a statement that contains system-generated bind variables. If you are going to use the SQLAccess Advisor to recommend materialized views, then the server should set the instance parameter cursor_sharing to EXACT so that materialized views with WHERE clauses can be recommended.
Require only schema and table relationships. Are effective for modeling what-if scenarios.
Only work if dimensions or primary and foreign key constraints have been dened. Offer no information about the impact of DML on the recommended access structures. Are not necessarily complete.
SQLAccess Advisor
17-17
To successfully import a hypothetical workload, the target schemas must contain dimension or primary and foreign key information. You use the IMPORT_ SQLWKLD_SCHEMA procedure. The syntax is as follows:
DBMS_ADVISOR.IMPORT_SQLWKLD_SCHEMA ( workload_name IN VARCHAR2, import_mode IN VARCHAR2, priority IN VARCHAR2, saved_rows OUT NUMBER, failed_rows OUT NUMBER);
See PL/SQL Packages and Types Reference for more information regarding the IMPORT_SQLWKLD_SCHEMA procedure and its parameters. You must congure external procedures to use this procedure. The following example creates a hypothetical workload called SCHEMA_WKLD, sets VALID_TABLE_LIST to sh and calls IMPORT_SQLWKLD_SCHEMA to generate a hypothetical workload.
VARIABLE workload_name VARCHAR2(255); VARIABLE saved_stmts NUMBER; VARIABLE failed_stmts NUMBER; EXECUTE :workload_name := 'SCHEMA_WKLD'; EXECUTE DBMS_ADVISOR.CREATE_SQLWKLD(:workload_name); EXECUTE DBMS_ADVISOR.SET_SQLWKLD_PARAMETER (:workload_name, VALID_TABLE_LIST, 'SH'); EXECUTE DBMS_ADVISOR.IMPORT_SQLWKLD_SCHEMA ( :workload_name, 'NEW', 2, :saved_stmts, :failed_stmts);
When using IMPORT_SQLWKLD_SCHEMA, the VALID_TABLE_LIST parameter cannot contain wildcards such as SCO% or SCOTT.EMP%. The only form of wildcards supported is SCOTT.%, which species all tables in a given schema.
See PL/SQL Packages and Types Reference for more information regarding the IMPORT_SQLWKLD_SUMADV procedure and its parameters. The following example creates a SQL workload from a Oracle9i Summary Advisor workload. The workload_id of the Oracle9i workload is 777.
1.
2.
3.
SQLAccess Advisor
17-19
See PL/SQL Packages and Types Reference for details of all the settings for the JOURNALING parameter. The information in the journal is for diagnostic purposes only and subject to change in future releases. It should not be used within any application.
See PL/SQL Packages and Types Reference for more information regarding the ADD_ SQLWKLD_STATEMENT procedure and its parameters. The following example adds a single statement to the MYWORKLOAD workload.
VARIABLE sql_text VARCHAR2(400); EXECUTE :sql_text := 'SELECT AVG(amount_sold) FROM sales'; EXECUTE DBMS_ADVISOR.ADD_SQLWKLD_STATEMENT ( 'MYWORKLOAD', 'MONTHLY', 'ROLLUP', priority=>1, executions=>10, username => 'SH', sql_text => :sql_text);
The second form lets you delete statements that match a specied search condition.
DBMS_ADVISOR.DELETE_SQLWKLD_STATEMENT (workload_name search deleted IN VARCHAR2, IN VARCHAR2, OUT NUMBER);
The following example deletes from MYWORKLOAD all statements that satisfy the condition executions less than 5:
VARIABLE deleted_stmts NUMBER; EXECUTE DBMS_ADVISOR.DELETE_SQLWKLD_STATEMENT ( 'MYWORKLOAD', 'executions < 5', :deleted_stmts);
A workload cannot be modied or deleted if it is currently referenced by an active task. A task is considered active if it is not in its initial state. See the RESET_TASK procedure to set a task to its initial state. See PL/SQL Packages and Types Reference for more information regarding the DELETE_SQLWKLD_STATEMENT procedure and its parameters.
SQLAccess Advisor
17-21
The second forms enables you to update all SQL statements satisfying a given search condition.
DBMS_ADVISOR.UPDATE_SQLWKLD_STATEMENT (workload_name search updated module action priority username IN VARCHAR2, IN VARCHAR2, OUT NUMBER, IN VARCHAR2, IN VARCHAR2, IN NUMBER, IN VARCHAR2);
The following examples changes the priority to 3 for all statements in MYWORKLOAD that have executions less than 10. The count of updated statements is returned in the updated_stmts variable.
VARIABLE updated_stmts NUMBER; EXECUTE DBMS_ADVISOR.UPDATE_SQLWKLD_STATEMENT ( 'MYWORKLOAD', 'executions < 10', :updated_stmts, priority => 3);
See PL/SQL Packages and Types Reference for more information regarding the UPDATE_SQLWKLD_STATEMENT procedure and its parameters.
Maintaining Workloads
There are several other operations that can be performed upon a workload, including the following:
Setting Workload Attributes Resetting Workloads Removing a Link Between a Workload and a Task
See PL/SQL Packages and Types Reference for more information regarding the UPDATE_SQLWKLD_ATTRIBUTES procedure and its parameters.
Resetting Workloads
The RESET_SQLWKLD procedure resets a workload to its initial starting point. This has the effect of removing all journal and log messages, recalculating volatility statistics, while the workload data remains untouched. This procedure should be executed after any workload adjustments such as adding or removing SQL statements. The following example resets workload MYWORKLOAD.
EXECUTE DBMS_ADVISOR.RESET_SQLWKLD('MYWORKLOAD');
See PL/SQL Packages and Types Reference for more information regarding the RESET_ SQLWKLD procedure and its parameters.
SQLAccess Advisor
17-23
using DELETE_SQLWKLD_REF procedure. The following example deletes the link between task MYTASK and SQL workload MYWORKLOAD:
EXECUTE DBMS_ADVISOR.DELETE_SQLWKLD_REF('MYTASK', 'MYWORKLOAD');
Removing Workloads
When workloads are no longer needed, they can be removed using the procedure DELETE_SQLWKLD. You can delete all workloads or a specic collection, but a workload cannot be deleted if it is still linked to a task. The following procedure is an example of removing a specic workload. It deletes an existing workload from the repository.
DBMS_ADVISOR.DELETE_SQLWKLD (workload_name IN VARCHAR2); EXECUTE DBMS_ADVISOR.DELETE_SQLWKLD('MYWORKLOAD');
See PL/SQL Packages and Types Reference for more information regarding the DELETE_SQLWKLD procedure and its parameters.
Recommendation Options
Before recommendations can be generated, the parameters for the task must rst be dened using the SET_TASK_PARAMETER procedure. If parameters are not dened, then the defaults are used. You can set task parameters by using the SET_TASK_PARAMETER procedure. The syntax is as follows.
DBMS_ADVISOR.SET_TASK_PARAMETER ( task_name IN VARCHAR2 parameter IN VARCHAR2, value IN VARCHAR2);
There are many task parameters and, to help identify the relevant ones, they have been grouped into categories in Table 172.
Table 172 Types of Advisor Task Parameters And Their Uses
Task Conguration DAYS_TO_EXPIRE REPORT_DATE_FORMAT JOURNALING Schema Attributes DEF_INDEX_OWNER DEF_INDEX_TABLESPACE DEF_MVIEW_OWNER Recommendation Options DML_VOLATILITY EVALUATION_ONLY EXECUTION_TYPE
Table 172
Workload Filtering SQL_LIMIT START_TIME USERNAME_LIST VALID_TABLE_LIST WORKLOAD_SCOPE MODULE_LIMIT TIME_LIMIT END_TIME COMMENTED_FILTER_LIST
In the following example, set the storage change of task MYTASK to 100MB. This indicates 100MB of additional space for recommendations. A zero value would indicate that no additional space can be allocated. A negative value indicates that the advisor must attempt to trim the current space utilization by the specied amount.
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER('MYTASK','STORAGE_CHANGE', 100000000);
In the following example, we set the VALID_TABLE_LIST parameter to lter out all queries that do no consist of tables SH.SALES and SH.CUSTOMERS.
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( 'MYTASK', 'VALID_TABLE_LIST', 'SH.SALES, SH.CUSTOMERS');
See PL/SQL Packages and Types Reference for more information regarding the SET_ TASK_PARAMETER procedure and its parameters.
Generating Recommendations
You can generate recommendations by using the EXECUTE_TASK procedure with your task name. After the procedure nishes, you can check the DBA_ADVISOR_ LOG table for the actual execution status and the number of recommendations and actions that have been produced. The recommendations can be queried by task name in {DBA, USER}_ADVISOR_RECOMMENDATIONS and the actions for these recommendations can be viewed by task in {DBA, USER}_ADVISOR_ACTIONS.
SQLAccess Advisor
17-25
EXECUTE_TASK Procedure
This procedure performs the SQLAccess Advisor analysis or evaluation for the specied task. Task execution is a synchronous operation, so control will not be returned to the user until the operation has completed, or a user-interrupt was detected. Upon return or execution of the task, you can check the DBA_ADVISOR_ LOG table for the actual execution status. Running EXECUTE_TASK generates recommendations, where a recommendation comprises one or more actions, such as creating a materialized view log and a materialized view. The syntax is as follows:
DBMS_ADVISOR.EXECUTE_TASK (task_name IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the EXECUTE_TASK procedure and its parameters.
To identify which query benets from which recommendation, you can use the views DBA_* and USER_ADVISOR_SQLA_WK_STMTS. The precost and postcost numbers are in terms of the estimated optimizer cost (shown in EXPLAIN PLAN) without and with the recommended access structure changes, respectively. To see recommendations for each query, issue the following statement:
SELECT sql_id, rec_id, precost, postcost, (precost-postcost)*100/precost AS percent_benefit FROM USER_ADVISOR_SQLA_WK_STMTS WHERE TASK_NAME = :task_name AND workload_name = :workload_name; SQL_ID REC_ID PRECOST POSTCOST PERCENT_BENEFIT ---------- ---------- ---------- ---------- --------------121 1 3003 249 91.7082917 122 2 1404 182 87.037037 123 3 5503 4 99.9273124 124 4 730 136 81.369863
Each recommendation consists of one or more actions, which must be implemented together to realize the benet provided by the recommendation. The SQLAccess Advisor produces the following types of actions:
s
CREATE|DROP|RETAIN MATERIALIZED VIEW CREATE|ALTER|RETAIN MATERIALIZED VIEW LOG CREATE|DROP|RETAIN INDEX GATHER STATS
The CREATE actions corresponds to new access structures. RETAIN recommendation indicate that existing access structures must be kept. DROP recommendations are only produced if the WORKLOAD_SCOPE parameter is set to FULL. The GATHER STATS action will generate a call to DBMS_STATS procedure to gather statistics on a newly generated access structure. Note that multiple recommendations may refer to the same action, however when generating a script for the recommendation, you will only see each action once. In the following example, you can see how many distinct actions there are for this set of recommendations.
SQLAccess Advisor
17-27
SELECT 'Action Count', COUNT(DISTINCT action_id) cnt FROM user_advisor_actions WHERE task_name = :task_name; 'ACTIONCOUNT CNT ------------ ---------Action Count 20 -- see the actions for each recommendations SELECT rec_id, action_id, SUBSTR(command,1,30) AS command FROM user_advisor_actions WHERE task_name = :task_name ORDER BY rec_id, action_id; REC_ID ACTION_ID ---------- ---------1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 19 1 20 2 5 2 6 2 9 ... COMMAND -----------------------------CREATE MATERIALIZED VIEW LOG ALTER MATERIALIZED VIEW LOG CREATE MATERIALIZED VIEW LOG ALTER MATERIALIZED VIEW LOG CREATE MATERIALIZED VIEW LOG ALTER MATERIALIZED VIEW LOG CREATE MATERIALIZED VIEW GATHER TABLE STATISTICS CREATE INDEX GATHER INDEX STATISTICS CREATE MATERIALIZED VIEW LOG ALTER MATERIALIZED VIEW LOG CREATE MATERIALIZED VIEW LOG
Each action has several attributes that pertain to the properties of the access structure. The name and tablespace for each access structure when applicable are placed in attr1 and attr2 respectively. The space occupied by each new access structure is in num_attr1. All other attributes are different for each action. Table 173 maps SQLAccess Advisor action information to the corresponding column in DBA_ADVISOR_ACTIONS.
Table 173
SQLAccess Advisor Action Attributes ATTR1 ATTR2 ATTR3 Target table REFRESH COMPLETE REFRESH FAST, REFRESH FORCE, NEVER REFRESH ROWID PRIMARY KEY, SEQUENCE OBJECT ID Unused ATTR4 BITMAP or BTREE ENABLE QUERY REWRITE, DISABLE QUERY REWRITE ATTR5 Index column list SQL SELECT statement ATTR6 Unused NUM_ ATTR1 Storage size in bytes for the index Storage size in bytes for the MV
CREATE INDEX
Unused
Equivalent Unused SQL statement Unused Storage size in bytes for the index Storage size in bytes for the MV Unused
Unused
Unused
Unused
Unused
Unused
Unused
Unused
Unused
Unused
Unused
SQLAccess Advisor
17-29
Table 173
(Cont.) SQLAccess Advisor Action Attributes ATTR1 ATTR2 ATTR3 Target table ATTR4 BITMAP or BTREE ATTR5 Index columns SQL SELECT statement ATTR6 Unused NUM_ ATTR1 Storage size in bytes for the index Storage size in bytes for the MV
RETAIN INDEX
Unused
Unused
Unused
Unused
Unused
The following PL/SQL procedure can be used to print out some of the attributes of the recommendations.
CONNECT SH/SH; CREATE OR REPLACE PROCEDURE show_recm (in_task_name IN VARCHAR2) IS CURSOR curs IS SELECT DISTINCT action_id, command, attr1, attr2, attr3, attr4 FROM user_advisor_actions WHERE task_name = in_task_name ORDER BY action_id; v_action number; v_command VARCHAR2(32); v_attr1 VARCHAR2(4000); v_attr2 VARCHAR2(4000); v_attr3 VARCHAR2(4000); v_attr4 VARCHAR2(4000); v_attr5 VARCHAR2(4000); BEGIN OPEN curs; DBMS_OUTPUT.PUT_LINE('========================================='); DBMS_OUTPUT.PUT_LINE('Task_name = ' || in_task_name); LOOP FETCH curs INTO v_action, v_command, v_attr1, v_attr2, v_attr3, v_attr4 ; EXIT when curs%NOTFOUND; DBMS_OUTPUT.PUT_LINE('Action ID: ' || v_action); DBMS_OUTPUT.PUT_LINE('Command : ' || v_command); DBMS_OUTPUT.PUT_LINE('Attr1 (name) : ' || SUBSTR(v_attr1,1,30));
DBMS_OUTPUT.PUT_LINE('Attr2 (tablespace): ' || SUBSTR(v_attr2,1,30)); DBMS_OUTPUT.PUT_LINE('Attr3 : ' || SUBSTR(v_attr3,1,30)); DBMS_OUTPUT.PUT_LINE('Attr4 : ' || v_attr4); DBMS_OUTPUT.PUT_LINE('Attr5 : ' || v_attr5); DBMS_OUTPUT.PUT_LINE('----------------------------------------'); END LOOP; CLOSE curs; DBMS_OUTPUT.PUT_LINE('=========END RECOMMENDATIONS============'); END show_recm; / -- see what the actions are using sample procedure set serveroutput on size 99999 EXECUTE show_recm(:task_name); A fragment of a sample output from this procedure is as follows: Task_name = MYTASK Action ID: 1 Command : CREATE MATERIALIZED VIEW LOG Attr1 (name) : "SH"."CUSTOMERS" Attr2 (tablespace): Attr3 : ROWID, SEQUENCE Attr4 : INCLUDING NEW VALUES Attr5 : ---------------------------------------.. ---------------------------------------Action ID: 15 Command : CREATE MATERIALIZED VIEW Attr1 (name) : "SH"."SH_MV$$_0004" Attr2 (tablespace): "SH_MVIEWS" Attr3 : REFRESH FAST WITH ROWID Attr4 : ENABLE QUERY REWRITE Attr5 : ---------------------------------------.. ---------------------------------------Action ID: 19 Command : CREATE INDEX Attr1 (name) : "SH"."SH_IDX$$_0013" Attr2 (tablespace): "SH_INDEXES" Attr3 : "SH"."SH_MV$$_0002" Attr4 : BITMAP Attr5 :
See PL/SQL Packages and Types Reference for details regarding Attr5 and Attr6.
SQLAccess Advisor
17-31
See PL/SQL Packages and Types Reference for details of all the settings for the journaling parameter. The information in the journal is for diagnostic purposes only and subject to change in future releases. It should not be used within any application.
Canceling Tasks
The CANCEL_TASK procedure causes a currently executing operation to terminate. An Advisor operation may take a few seconds to respond to the call. Because all Advisor task procedures are synchronous, to cancel an operation, you must use a separate database session. A cancel command effective restores the task to its condition prior to the start of the cancelled operation. Therefore, a cancelled task or data object cannot be restarted.
DBMS_ADVISOR.CANCEL_TASK (task_name IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the CANCEL_TASK procedure and its parameters.
Marking Recommendations
By default, all SQLAccess Advisor recommendations are ready to be implemented, however, the user can choose to skip or exclude selected recommendations by using the MARK_RECOMMENDATION procedure. MARK_RECOMMENDATION allows the user to annotate a recommendation with a REJECT or IGNORE setting, which will cause the GET_TASK_SCRIPT to skip it when producing the implementation procedure.
DBMS_ADVISOR.MARK_RECOMMENDATION ( task_name IN VARCHAR2 id IN NUMBER, action IN VARCHAR2);
The following example marks a recommendation with ID 2 as REJECT. This recommendation and any dependent recommendations will not appear in the script.
EXECUTE DBMS_ADVISOR.MARK_RECOMMENDATION('MYTASK', 2, 'REJECT');
See PL/SQL Packages and Types Reference for more information regarding the MARK_ RECOMMENDATIONS procedure and its parameters.
Modifying Recommendations
Using the UPDATE_REC_ATTRIBUTES procedure, the SQLAccess Advisor names and assigns ownership to new objects such as indexes and materialized views during the analysis operation. However, it does not necessarily choose appropriate names, so you may manually set the owner, name, and tablespace values for new objects. For recommendations referencing existing database objects, owner and name values cannot be changed. The syntax is as follows:
DBMS_ADVISOR.UPDATE_REC_ATTRIBUTES ( task_name IN VARCHAR2 rec_id IN NUMBER, action_id IN NUMBER, attribute_name IN VARCHAR2, value IN VARCHAR2);
BASE_TABLE Species the base table reference for the recommended object.
OWNER
SQLAccess Advisor
17-33
The following example modies the attribute TABLESPACE for recommendation ID 1, action ID 1 to SH_MVIEWS.
EXECUTE DBMS_ADVISOR.UPDATE_REC_ATTRIBUTES('MYTASK', 1, 1, 'TABLESPACE', 'SH_MVIEWS');
See PL/SQL Packages and Types Reference for more information regarding the UPDATE_REC_ATTRIBUTES procedure and its parameters.
To save the script to a le, a directory path must be supplied so that the procedure CREATE_FILE knows where to store the script. In addition, read and write privileges must be granted on this directory. The following example shows how to save an advisor script CLOB to a le:
-- create a directory and grant permissions to read/write to it CONNECT SH/SH;
CREATE DIRECTORY ADVISOR_RESULTS AS '/mydir'; GRANT READ ON DIRECTORY ADVISOR_RESULTS TO PUBLIC; GRANT WRITE ON DIRECTORY ADVISOR_RESULTS TO PUBLIC;
The following is a fragment of a script generated by this procedure. The script also includes PL/SQL calls to gather stats on the recommended access structures and marks the recommendations as IMPLEMENTED at the end.
Rem Access Advisor V10.0.0.0.0 - Beta Rem Rem Username: SH Rem Task: MYTASK Rem Execution date: 15/04/2003 11:35 Rem set feedback 1 set linesize 80 set trimspool on set tab off set pagesize 60 whenever sqlerror CONTINUE CREATE MATERIALIZED VIEW LOG ON "SH"."PRODUCTS" WITH ROWID, SEQUENCE("PROD_ID","PROD_SUBCATEGORY") INCLUDING NEW VALUES; ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."PRODUCTS" ADD ROWID, SEQUENCE("PROD_ID","PROD_SUBCATEGORY") INCLUDING NEW VALUES; .. CREATE MATERIALIZED VIEW "SH"."MV$$_00510002" REFRESH FAST WITH ROWID ENABLE QUERY REWRITE AS SELECT SH.CUSTOMERS.CUST_STATE_PROVINCE C1, COUNT(*) M1 FROM SH.CUSTOMERS WHERE (SH.CUSTOMERS.CUST_STATE_PROVINCE = 'CA') GROUP BY SH.CUSTOMERS.CUST_STATE_PROVINCE; BEGIN DBMS_STATS.GATHER_TABLE_STATS('"SH"', '"MV$$_00510002"', NULL, DBMS_STATS.AUTO_SAMPLE_SIZE); END; / .. CREATE BITMAP INDEX "SH"."MV$$_00510004_IDX$$_00510013" ON "SH"."MV$$_00510004" ("C4"); whenever sqlerror EXIT SQL.SQLCODE BEGIN DBMS_ADVISOR.MARK_RECOMMENDATION('"MYTASK"',1,'IMPLEMENTED');
SQLAccess Advisor
17-35
See Also: Oracle Database SQL Reference for CREATE DIRECTORY syntax and PL/SQL Packages and Types Reference for detailed information about the GET_TASK_SCRIPT procedure
See PL/SQL Packages and Types Reference for more information regarding the RESET_ TASK procedure and its parameters.
The following example shows how to quick tune a single SQL statement:
VARIABLE task_name VARCHAR2(255);
VARIABLE sql_stmt VARCHAR2(4000); EXECUTE :sql_stmt := 'SELECT COUNT(*) FROM customers WHERE cust_state_province=''CA'''; EXECUTE :task_name := 'MY_QUICKTUNE_TASK'; EXECUTE DBMS_ADVISOR.QUICK_TUNE(DBMS_ADVISOR.SQLACCESS_ADVISOR, :task_name, :sql_stmt);
See PL/SQL Packages and Types Reference for more information regarding the QUICK_ TUNE procedure and its parameters.
Managing Tasks
Every time recommendations are generated, tasks are created and, unless some maintenance is performed on these tasks, they will grow over time and will occupy storage space. There may be tasks that you want to keep and prevent accidental deletion. Therefore, there are several management operations that can be performed on tasks:
s
Change the name of a task. Give a task a description. Set the task to be read only so it cannot be changed. Make the task a template upon which other tasks can be dened. Changes various attributes of a task or a task template.
SQLAccess Advisor
17-37
See PL/SQL Packages and Types Reference for more information regarding the UPDATE_TASK_ATTRIBUTES procedure and its parameters.
Deleting Tasks
The DELETE_TASK procedure deletes existing Advisor tasks from the repository. The syntax is as follows:
DBMS_ADVISOR.DELETE_TASK (task_name IN VARCHAR2);
See PL/SQL Packages and Types Reference for more information regarding the DELETE_TASK procedure and its parameters.
Setting DAYS_TO_EXPIRE
When a task or workload object is created, the parameter DAYS_TO_EXPIRE is set to 30. The value indicates the number of days until the task or object will automatically be deleted by the system. If you wish to save a task or workload indenitely, the DAYS_TO_EXPIRE parameter should be set to ADVISOR_ UNLIMITED.
Table 174
Constant ADVISOR_ALL
SQLAccess Advisor
17-39
SUM(s.amount_sold) AS dollars, s.channel_id, s.promo_id FROM sales s, times t, products p WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id AND s.prod_id > 10 AND s.prod_id < 50 GROUP BY t.week_ending_day, p.prod_subcategory, s.channel_id, s.promo_id') / -- aggregation with selection INSERT INTO user_workload (username, module, action, priority, sql_text) VALUES ('SH', 'Example1', 'Action', 2, 'SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars FROM sales s , times t WHERE s.time_id = t.time_id AND s.time_id between TO_DATE(''01-JAN-2000'', ''DD-MON-YYYY'') AND TO_DATE(''01-JUL-2000'', ''DD-MON-YYYY'') GROUP BY t.calendar_month_desc') / --Load all SQL queries. INSERT INTO user_workload (username, module, action, priority, sql_text) VALUES ('SH', 'Example1', 'Action', 2, 'SELECT ch.channel_class, c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c, channels ch WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND s.channel_id = ch.channel_id AND c.cust_state_province = ''CA'' AND ch.channel_desc IN (''Internet'',''Catalog'') AND t.calendar_quarter_desc IN (''1999-Q1'',''1999-Q2'') GROUP BY ch.channel_class, c.cust_city, t.calendar_quarter_desc') / -- order by INSERT INTO user_workload (username, module, action, priority, sql_text) VALUES ('SH', 'Example1', 'Action', 2, 'SELECT c.country_id, c.cust_city, c.cust_last_name FROM customers c WHERE c.country_id IN (52790, 52789) ORDER BY c.country_id, c.cust_city, c.cust_last_name') / COMMIT; CONNECT SH/SH; set serveroutput on; VARIABLE task_id NUMBER; VARIABLE task_name VARCHAR2(255);
See "Viewing the Recommendations" on page 17-26 or "Generating SQL Scripts" on page 17-34 for further details.
-- See recommendation for each query. SELECT sql_id, rec_id, precost, postcost, (precost-postcost)*100/precost AS percent_benefit FROM user_advisor_sqla_wk_stmts
SQLAccess Advisor
17-41
WHERE task_name = :task_name AND workload_name = :workload_name; -- See the actions for each recommendations. SELECT rec_id, action_id, SUBSTR(command,1,30) AS command FROM user_advisor_actions WHERE task_name = :task_name ORDER BY rec_id, action_id; -- See what the actions are using sample procedure. SET SERVEROUTPUT ON SIZE 99999 EXECUTE show_recm(:task_name);
Step 2 Set template parameters Set naming conventions for recommended indexes/materialized views.
EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'INDEX_NAME_TEMPLATE', 'SH_IDX$$_<SEQ>'); EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'MVIEW_NAME_TEMPLATE', 'SH_MV$$_<SEQ>'); --Set default owners for recommended indexes/materialized views. EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'DEF_INDEX_OWNER', 'SH'); EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'DEF_MVIEW_OWNER', 'SH');
--Set default tablespace for recommended indexes/materialized views. EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'DEF_INDEX_TABLESPACE', 'SH_INDEXES'); EXECUTE DBMS_ADVISOR.SET_TASK_PARAMETER ( :template_name, 'DEF_MVIEW_TABLESPACE', 'SH_MVIEWS');
SQLAccess Advisor
17-43
SQLAccess Advisor
17-45
SQLAccess Advisor
17-47
With the TUNE_MVIEW procedure, you no longer require a detailed understanding of materialized views to create a materialized view in an application because the materialized view and its required components (such as materialized view log) will be created correctly through the procedure. See PL/SQL Packages and Types Reference for detailed information about the TUNE_ MVIEW procedure.
DBMS_ADVISOR.TUNE_MVIEW Procedure
This section discusses the following information:
s
TUNE_MVIEW Syntax and Operations Accessing TUNE_MVIEW Output Results USER_TUNE_MVIEW and DBA_TUNE_MVIEW Views
The TUNE_MVIEW procedure takes two input parameters: task_name and mv_ create_stmt. task_name is a user-provided task identier used to access the output results. mv_create_stmt is a complete CREATE MATERIALIZED VIEW statement that is to be tuned. If the input CREATE MATERIALIZED VIEW statement does not have the clauses of REFRESH FAST or ENABLE QUERY REWRITE, or both, TUNE_MVIEW will use the default clauses REFRESH FORCE and DISABLE QUERY REWRITE to tune the statement to be fast refreshable if possible or only complete refreshable otherwise. The TUNE_MVIEW procedure handles a broad range of CREATE MATERIALIZED VIEW statements that can have arbitrary dening queries in them. The dening query could be a simple SELECT statement or a complex query with set operators or inline views. When the dening query of the materialized view contains the clause REFRESH FAST, TUNE_MVIEW analyzes the query and checks to see if it is fast refreshable. If it is already fast refreshable, the procedure will return a message saying "the materialized view is already optimal and cannot be further tuned". Otherwise, the TUNE_MVIEW procedure will start the tuning work on the given statement.
The TUNE_MVIEW procedure can generate the output statements that correct the dening query by adding extra columns such as required aggregate columns or x the materialized view logs to achieve the FAST REFRESH goal. In the case of a complex dening query, the TUNE_MVIEW procedure decomposes the query and generates two or more fast refreshable materialized views or will restate the materialized view in a way to fulll fast refresh requirements as much as possible. The TUNE_MVIEW procedure supports dening queries with the following complex query constructs:
s
Set operators (UNION, UNION ALL, MINUS, and INTERSECT) COUNT DISTINCT SELECT DISTINCT Inline views
When the ENABLE QUERY REWRITE clause is specied, TUNE_MVIEW will also x the statement using a process similar to REFRESH FAST, that will redene the materialized view so that as many of the advanced forms of query rewrite are possible. The TUNE_MVIEW procedure generates two sets of output results as executable statements. One set of the output (IMPLEMENTATION) is for implementing materialized views and required components such as materialized view logs or rewrite equivalences to achieve fast refreshability and query rewritablity as much as possible. The other set of the output (UNDO) is for dropping the materialized views and the rewrite equivalences in case you decide they are not required. The output statements for the IMPLEMENTATION process include:
s
CREATE MATERIALIZED VIEW LOG statements: creates any missing materialized view logs required for fast refresh. ALTER MATERIALIZED VIEW LOG FORCE statements: xes any materialized view log related requirements such as missing lter columns, sequence, and so on, required for fast refresh. One or more CREATE MATERIALIZED VIEW statements: In case of one output statement, the original dening query is directly restated and transformed. Simple query transformation could be just adding required columns. For example, add rowid column for materialized join view and add aggregate column for materialized aggregate view. In the case of decomposition, multiple CREATE MATERIALIZED VIEW statements are generated and form a nested materialized view hierarchy in which one or more submaterialized views are referenced by a new top-level materialized view modied from the original
SQLAccess Advisor
17-49
statement. This is to achieve fast refresh and query rewrite as much as possible. Submaterialized views are often fast refreshable.
s
BUILD_SAFE_REWRITE_EQUIVALENCE statement: enables the rewrite of top-level materialized view using submaterialized views. It is required to enable query rewrite when a composition occurs.
Note that the decomposition result implies no sharing of submaterialized views. That is, in the case of decomposition, the TUNE_MVIEW output will always contain new submaterialized view and it will not reference existing materialized views. The output statements for the UNDO process include:
s
DROP MATERIALIZED VIEW statements to reverse the materialized view creations (including submaterialized views) in the IMPLEMENTATION process. DROP_REWRITE_EQUIVALENCE statement to remove the rewrite equivalence relationship built in the IMPLEMENTATION process if needed.
Note that the UNDO process does not include statement to drop materialized view logs. This is because materialized view logs can be shared by many different materialized views, some of which may reside on remote Oracle instances.
Script generation using DBMS_ADVISOR.GET_TASK_SCRIPT function and DBMS_ADVISOR.CREATE_FILE procedure. Use USER_TUNE_MVIEW or DBA_TUNE_MVIEW views.
Now generate both the implementation and undo scripts and place them in /tmp/script_dir/mv_create.sql and /tmp/script_dir/mv_undo.sql, respectively.
EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT(:task_name),'TUNE_RESULTS', 'mv_create.sql'); EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT(:task_name, 'UNDO'), 'TUNE_RESULTS', 'mv_undo.sql');
This example shows how TUNE_MVIEW changes the dening query to be fast refreshable. A MATERIALIZED VIEW CREATE statement is dened in variable create_mv_ddl. This create statement has the REFRESH FAST clause specied. Its dening query contains a single query block in which an aggregate column, SUM(s.amount_sold), does not have the required aggregate columns to support fast refresh. If you execute the TUNE_MVIEW statement with this MATERIALIZED VIEW CREATE statement, the output produced will be fast refreshable:
VARIABLE task_cust_mv VARCHAR2(30); VARIABLE create_mv_ddl VARCHAR2(4000); EXECUTE :task_cust_mv := 'cust_mv'; EXECUTE :create_mv_ddl := ' CREATE MATERIALIZED VIEW cust_mv REFRESH FAST DISABLE QUERY REWRITE AS SELECT s.prod_id, s.cust_id, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id GROUP BY s.prod_id, s.cust_id'; EXECUTE DBMS_ADVISOR.TUNE_MVIEW(:task_cust_mv, :create_mv_ddl);
The projected output of TUNE_MVIEW includes an optimized materialized view dening query as follows:
CREATE MATERIALIZED VIEW SH.CUST_MV REFRESH FAST WITH ROWID DISABLE QUERY REWRITE AS SELECT SH.SALES.PROD_ID C1, SH.CUSTOMERS.CUST_ID C2, SUM("SH"."SALES"."AMOUNT_SOLD") M1, COUNT("SH"."SALES"."AMOUNT_SOLD") M2,
SQLAccess Advisor
17-51
COUNT(*) M3 FROM SH.SALES, SH.CUSTOMERS WHERE SH.CUSTOMERS.CUST_ID = SH.SALES.CUST_ID GROUP BY SH.SALES.PROD_ID, SH.CUSTOMERS.CUST_ID;
The original dening query of cust_mv has been modied by adding aggregate columns in order to be fast refreshable.
Example 174 Access IMPLEMENTATION Output Through USER_TUNE_MVIEW View
SELECT * FROM USER_TUNE_MVIEW WHERE TASK_NAME= :task_cust_mv AND SCRIPT_TYPE='IMPLEMENTATION'; Example 175 Save IMPLEMENTATION Output in a Script File
CREATE DIRECTORY TUNE_RESULTS AS '/myscript' GRANT READ, WRITE ON DIRECTORY TUNE_RESULTS TO PUBLIC; EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT(:task_cust_mv), 'TUNE_RESULTS', 'mv_create.sql'); Example 176 Enable Query Rewrite by Creating Multiple Materialized Views
This example shows how a materialized view's dening query with set operators can be decomposed into a number of submaterialized views. The input detail tables are assumed to be sales, customers, and countries, and they do not have materialized view logs. First, you need to execute the TUNE_MVIEW statement with the materialized view CREATE statement dened in the variable create_mv_ddl.
EXECUTE :task_cust_mv := 'cust_mv2'; EXECUTE :create_mv_ddl := ' CREATE MATERIALIZED VIEW cust_mv ENABLE QUERY REWRITE AS SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs, countries cn WHERE s.cust_id = cs.cust_id AND cs.country_id = cn.country_id AND cn.country_name IN (''USA'',''Canada'') GROUP BY s.prod_id, s.cust_id UNION -
SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id AND s.cust_id IN (1005,1010,1012) GROUP BY s.prod_id, s.cust_id'; EXECUTE DBMS_ADVISOR.TUNE_MVIEW(:task_cust_mv, :create_mv_ddl);
The materialized view dening query contains a UNION set operator and does not support general query rewrite. In order to support general query rewrite, the MATERIALIZED VIEW dening query will be decomposed. The projected output for the IMPLEMENTATION statement will be created along with materialized view log statements and two submaterialized views as follows:
CREATE MATERIALIZED VIEW LOG ON "SH"."SALES" WITH ROWID, SEQUENCE("CUST_ID") INCLUDING NEW VALUES; ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."CUSTOMERS" ADD ROWID, SEQUENCE("CUST_ID") INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW LOG ON "SH"."SALES" WITH ROWID, SEQUENCE("PROD_ID","CUST_ID","AMOUNT_SOLD") INCLUDING NEW VALUES; ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."SALES" ADD ROWID, SEQUENCE("PROD_ID","CUST_ID","AMOUNT_SOLD") INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW LOG ON "SH"."COUNTRIES" WITH ROWID, SEQUENCE("COUNTRY_ID","COUNTRY_NAME") INCLUDING NEW VALUES; ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."COUNTRIES" ADD ROWID, SEQUENCE("COUNTRY_ID","COUNTRY_NAME") INCLUDING NEW VALUES; ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."CUSTOMERS"
SQLAccess Advisor
17-53
ADD ROWID, SEQUENCE("CUST_ID","COUNTRY_ID") INCLUDING NEW VALUES; ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."SALES" ADD ROWID, SEQUENCE("PROD_ID","CUST_ID","AMOUNT_SOLD") INCLUDING NEW VALUES; CREATE MATERIALIZED VIEW SH.CUST_MV$SUB1 REFRESH FAST WITH ROWID ON COMMIT ENABLE QUERY REWRITE AS SELECT SH.SALES.PROD_ID C1, SH.CUSTOMERS.CUST_ID C2, SUM("SH"."SALES"."AMOUNT_SOLD") M1, COUNT("SH"."SALES"."AMOUNT_SOLD") M2, COUNT(*) M3 FROM SH.SALES, SH.CUSTOMERS WHERE SH.CUSTOMERS.CUST_ID = SH.SALES.CUST_ID AND (SH.SALES.CUST_ID IN (1012, 1010, 1005)) GROUP BY SH.SALES.PROD_ID, SH.CUSTOMERS.CUST_ID; CREATE MATERIALIZED VIEW SH.CUST_MV$SUB2 REFRESH FAST WITH ROWID ON COMMIT ENABLE QUERY REWRITE AS SELECT SH.SALES.PROD_ID C1, SH.CUSTOMERS.CUST_ID C2, SH.COUNTRIES.COUNTRY_NAME C3, SUM("SH"."SALES"."AMOUNT_SOLD") M1, COUNT("SH"."SALES". "AMOUNT_SOLD") M2, COUNT(*) M3 FROM SH.SALES, SH.CUSTOMERS, SH.COUNTRIES WHERE SH.CUSTOMERS.CUST_ID = SH.SALES.CUST_ID AND SH.COUNTRIES.COUNTRY_ID = SH.CUSTOMERS.COUNTRY_ID AND (SH.COUNTRIES.COUNTRY_NAME IN ('USA', 'Canada')) GROUP BY SH.SALES.PROD_ID, SH.CUSTOMERS.CUST_ID, SH.COUNTRIES.COUNTRY_NAME; CREATE MATERIALIZED VIEW SH.CUST_MV REFRESH FORCE WITH ROWID ENABLE QUERY REWRITE AS (SELECT "CUST_MV$SUB2"."C1" "PROD_ID","CUST_MV$SUB2"."C2" "CUST_ID",SUM("CUST_MV$SUB2"."M3") "CNT",SUM("CUST_MV$SUB2"."M1") "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB2" "CUST_MV$SUB2" GROUP BY "CUST_MV$SUB2"."C1","CUST_MV$SUB2"."C2")UNION (SELECT "CUST_MV$SUB1"."C1" "PROD_ID","CUST_MV$SUB1"."C2" "CUST_ID",SUM("CUST_MV$SUB1"."M3") "CNT",SUM("CUST_MV$SUB1"."M1") "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1" GROUP BY "CUST_MV$SUB1"."C1","CUST_MV$SUB1"."C2"); BEGIN
DBMS_ADVANCED_REWRITE.BUILD_SAFE_REWRITE_EQUIVALENCE ('SH.CUST_MV$RWEQ', 'SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs, countries cn WHERE s.cust_id = cs.cust_id AND cs.country_id = cn.country_id AND cn.country_name IN (''USA'',''Canada'') GROUP BY s.prod_id, s.cust_id UNION SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id AND s.cust_id IN (1005,1010,1012) GROUP BY s.prod_id, s.cust_id', '(SELECT "CUST_MV$SUB2"."C3" "PROD_ID","CUST_MV$SUB2"."C2" "CUST_ID", SUM("CUST_MV$SUB2"."M3") "CNT", SUM("CUST_MV$SUB2"."M1") "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB2" "CUST_MV$SUB2" GROUP BY "CUST_MV$SUB2"."C3","CUST_MV$SUB2"."C2") UNION (SELECT "CUST_MV$SUB1"."C2" "PROD_ID","CUST_MV$SUB1"."C1" "CUST_ID", "CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1")',-1553577441) END; /;
The original dening query of cust_mv has been decomposed into two submaterialized views seen as cust_mv$SUB1 and cust_mv$SUB2. One additional column count(amount_sold) has been added in cust_mv$SUB1 to make that materialized view fast refreshable. The original dening query of cust_mv has been modied to query the two submaterialized views instead where both submaterialized views are fast refreshable and support general query rewrite. The required materialized view logs are added to enable fast refresh of the submaterialized views. It is noted that for each detail table, two materialized view log statements are generated: one is CREATE MATERIALIZED VIEW statement and
SQLAccess Advisor
17-55
the other is ALTER MATERIALIZED VIEW FORCE statement. This is to ensure the CREATE script can be run multiple times. The BUILD_SAFE_REWRITE_EQUIVALENCE statement is to connect the old dening query to the dening query of the new top-level materialized view. It is to ensure that query rewrite will make use of the new top-level materialized view to answer the query.
Example 177 Access IMPLEMENTATION Output Through USER_TUNE_MVIEW View
SELECT * FROM USER_TUNE_MVIEW WHERE TASK_NAME='cust_mv2' AND SCRIPT_TYPE='IMPLEMENTATION'; Example 178 Save IMPLEMENTATION Output in a Script File
The following statements save the IMPLEMENTATION output in a script le located at /myscript/mv_create2.sql:
CREATE DIRECTORY TUNE_RESULTS AS '/myscript' GRANT READ, WRITE ON DIRECTRY TUNE_RESULTS TO PUBLIC; EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT('cust_mv2'),'TUNE_RESULTS', 'mv_create2.sql');
EXECUTE :task_cust_mv := 'cust_mv3'; EXECUTE :create_mv_ddl := 'CREATE MATERIALIZED VIEW cust_mv REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id AND s.cust_id IN (2005,1020) -
GROUP BY s.prod_id, s.cust_id UNION SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id AND s.cust_id IN (1005,1010,1012) GROUP BY s.prod_id, s.cust_id'; EXECUTE DBMS_ADVISOR.TUNE_MVIEW(:task_cust_mv, :create_mv_ddl);
The materialized view dening query contains a UNION set-operator so that the materialized view itself is not fast-refreshable. However, two subselect queries in the materialized view dening query can be combined as one single query. The projected output for CREATE statement will be created with an optimized submaterialized view combining the two subselect queries and the submaterialized view is referenced by a new top-level materialized view as follows:
CREATE MATERIALIZED VIEW LOG ON "SH"."SALES" WITH ROWID, SEQUENCE ("PROD_ID","CUST_ID","AMOUNT_SOLD") INCLUDING NEW VALUES ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."SALES" ADD ROWID, SEQUENCE ("PROD_ID","CUST_ID","AMOUNT_SOLD") INCLUDING NEW VALUES CREATE MATERIALIZED VIEW LOG ON "SH"."CUSTOMERS" WITH ROWID, SEQUENCE ("CUST_ID") INCLUDING NEW VALUES ALTER MATERIALIZED VIEW LOG FORCE ON "SH"."CUSTOMERS" ADD ROWID, SEQUENCE ("CUST_ID") INCLUDING NEW VALUES CREATE MATERIALIZED VIEW SH.CUST_MV$SUB1 REFRESH FAST WITH ROWID ENABLE QUERY REWRITE AS SELECT SH.SALES.CUST_ID C1, SH.SALES.PROD_ID C2, SUM("SH"."SALES"."AMOUNT_SOLD") M1, COUNT("SH"."SALES"."AMOUNT_SOLD")M2, COUNT(*) M3 FROM SH.CUSTOMERS, SH.SALES WHERE SH.SALES.CUST_ID = SH.CUSTOMERS.CUST_ID AND (SH.SALES.CUST_ID IN (2005, 1020, 1012, 1010, 1005)) GROUP BY SH.SALES.CUST_ID, SH.SALES.PROD_ID CREATE MATERIALIZED VIEW SH.CUST_MV REFRESH FORCE WITH ROWID ENABLE QUERY REWRITE AS (SELECT "CUST_MV$SUB1"."C2" "PROD_ID","CUST_MV$SUB1"."C1" "CUST_ID", "CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1" WHERE "CUST_MV$SUB1"."C1"=2005 OR "CUST_MV$SUB1"."C1"=1020) UNION (SELECT "CUST_MV$SUB1"."C2" "PROD_ID","CUST_MV$SUB1"."C1" "CUST_ID", "CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT"
SQLAccess Advisor
17-57
FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1" WHERE "CUST_MV$SUB1"."C1"=1012 OR "CUST_MV$SUB1"."C1"=1010 OR "CUST_MV$SUB1"."C1"=1005) DBMS_ADVANCED_REWRITE.BUILD_SAFE_REWRITE_EQUIVALENCE ('SH.CUST_MV$RWEQ', 'SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id AND s.cust_id in (2005,1020) GROUP BY s.prod_id, s.cust_id UNION SELECT s.prod_id, s.cust_id, COUNT(*) cnt, SUM(s.amount_sold) sum_amount FROM sales s, customers cs WHERE s.cust_id = cs.cust_id AND s.cust_id IN (1005,1010,1012) GROUP BY s.prod_id, s.cust_id', '(SELECT "CUST_MV$SUB1"."C2" "PROD_ID", "CUST_MV$SUB1"."C1" "CUST_ID", "CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1" WHERE "CUST_MV$SUB1"."C1"=2005OR "CUST_MV$SUB1"."C1"=1020) UNION (SELECT "CUST_MV$SUB1"."C2" "PROD_ID", "CUST_MV$SUB1"."C1" "CUST_ID", "CUST_MV$SUB1"."M3" "CNT","CUST_MV$SUB1"."M1" "SUM_AMOUNT" FROM "SH"."CUST_MV$SUB1" "CUST_MV$SUB1" WHERE "CUST_MV$SUB1"."C1"=1012 OR "CUST_MV$SUB1"."C1"=1010 OR "CUST_MV$SUB1"."C1"=1005)', 1811223110)
The original dening query of cust_mv has been optimized by combining the predicate of the two subselect queries in the sub-materialized view CUST_MV$SUB1. The required materialized view logs are also added to enable fast refresh of the submaterialized views.
Example 1710 View Access IMPLEMENTATION Output Through USER_TUNE_MVIEW
The following query accesses the IMPLEMENTATION output through USER_TUNE_ MVIEW:
SELECT * FROM USER_TUNE_MVIEW WHERE TASK_NAME= 'cust_mv3' AND SCRIPT_TYPE='IMPLEMENTATION'; Example 1711 Save IMPLEMENTATION Output in a Script File
The following statements save the IMPLEMENTATION output in a script le located at /myscript/mv_create3.sql:
CREATE DIRECTORY TUNE_RESULTS AS '/myscript' GRANT READ, WRITE ON DIRECTORY TUNE_RESULTS TO PUBLIC; EXECUTE DBMS_ADVISOR.CREATE_FILE(DBMS_ADVISOR.GET_TASK_SCRIPT('cust_mv3'), 'TUNE_RESULTS', 'mv_create3.sql')
SQLAccess Advisor
17-59
Part V
Data Warehouse Performance
This section deals with ways to improve your data warehouse's performance, and contains the following chapters:
s
Chapter 18, "Query Rewrite" Chapter 19, "Schema Modeling Techniques" Chapter 20, "SQL for Aggregation in Data Warehouses" Chapter 21, "SQL for Analysis and Reporting" Chapter 22, "SQL for Modeling" Chapter 23, "OLAP and Data Mining" Chapter 24, "Using Parallel Execution"
18
Query Rewrite
This chapter discusses query rewrite in Oracle, and contains:
s
Overview of Query Rewrite Enabling Query Rewrite How Oracle Rewrites Queries Did Query Rewrite Occur? Design Considerations for Improving Query Rewrite Capabilities Advanced Rewrite Using Equivalences
It also operates on subqueries in the set operators UNION, UNION ALL, INTERSECT, and MINUS, and subqueries in DML statements such as INSERT, DELETE, and UPDATE. Several factors affect whether or not a given query is rewritten to use one or more materialized views:
s
Enabling or disabling query rewrite By the CREATE or ALTER statement for individual materialized views By the initialization parameter QUERY_REWRITE_ENABLED By the REWRITE and NOREWRITE hints in SQL statements
18-2
The DBMS_MVIEW.EXPLAIN_REWRITE procedure advises whether query rewrite is possible on a query and, if so, which materialized views will be used. It also explains why a query cannot be rewritten.
Cost-Based Rewrite
Query rewrite is available with cost-based optimization. Oracle Database optimizes the input query with and without rewrite and selects the least costly alternative. The optimizer rewrites a query by rewriting one or more query blocks, one at a time. If query rewrite has a choice between several materialized views to rewrite a query block, it will select the ones which can result in reading in the least amount of data. After a materialized view has been selected for a rewrite, the optimizer then tests whether the rewritten query can be rewritten further with other materialized views. This process continues until no further rewrites are possible. Then the rewritten query is optimized and the original query is optimized. The optimizer compares these two optimizations and selects the least costly alternative. Because optimization is based on cost, it is important to collect statistics both on tables involved in the query and on the tables representing materialized views. Statistics are fundamental measures, such as the number of rows in a table, that are used to calculate the cost of a rewritten query. They are created by using the DBMS_ STATS package. Queries that contain inline or named views are also candidates for query rewrite. When a query contains a named view, the view name is used to do the matching between a materialized view and the query. When a query contains an inline view, the inline view can be merged into the query before matching between a materialized view and the query occurs. In addition, if the inline view's text denition exactly matches with that of an inline view present in any eligible materialized view, general rewrite may be possible. This is because, whenever a materialized view contains exactly identical inline view text to the one present in a query, query rewrite treats such an inline view as a named view or a table. Figure 181 presents a graphical view of the cost-based approach used during the rewrite process.
Figure 181
User's SQL
Generate plan
Execute
Query rewrite must be enabled for the session. A materialized view must be enabled for query rewrite. The rewrite integrity level should allow the use of the materialized view. For example, if a materialized view is not fresh and query rewrite integrity is set to ENFORCED, then the materialized view is not used.
18-4
Either all or part of the results requested by the query must be obtainable from the precomputed result stored in the materialized view or views.
To determine this, the optimizer may depend on some of the data relationships declared by the user using constraints and dimensions. Such data relationships include hierarchies, referential integrity, and uniqueness of key data, and so on.
Individual materialized views must have the ENABLE QUERY REWRITE clause. The initialization parameter QUERY_REWRITE_ENABLED must be set to TRUE (this is the default). Cost-based optimization must be used either by setting the initialization parameter OPTIMIZER_MODE to ALL_ROWS or FIRST_ROWS, or by analyzing the tables and setting OPTIMIZER_MODE to CHOOSE.
If step 1 has not been completed, a materialized view will never be eligible for query rewrite. You can specify ENABLE QUERY REWRITE either with the ALTER MATERIALIZED VIEW statement or when the materialized view is created, as illustrated in the following:
CREATE MATERIALIZED VIEW join_sales_time_product_mv ENABLE QUERY REWRITE AS SELECT p.prod_id, p.prod_name, t.time_id, t.week_ending_day, s.channel_id, s.promo_id, s.cust_id, s.amount_sold FROM sales s, products p, times t WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id;
The NOREWRITE hint disables query rewrite in a SQL statement, overriding the QUERY_REWRITE_ENABLED parameter, and the REWRITE hint (when used with mv_name) restricts the eligible materialized views to those named in the hint. You can use TUNE_MVIEW to optimize a CREATE MATERIALIZED VIEW statement to enable general QUERY REWRITE. This procedure is described in "Tuning Materialized Views for Fast Refresh and Query Rewrite" on page 17-47.
OPTIMIZER_MODE = ALL_ROWS, FIRST_ROWS, or CHOOSE With OPTIMIZER_MODE set to choose, a query will not be rewritten unless at least one table referenced by it has been analyzed. This is because the rule-based optimizer is used when OPTIMIZER_MODE is set to choose and none of the tables referenced in a query have been analyzed.
QUERY_REWRITE_ENABLED = TRUE. This option enables the query rewrite feature of the optimizer, enabling the optimizer to utilize materialized view to enhance performance. If set to FALSE, this option disables the query rewrite feature of the optimizer and directs the optimizer to rewrite queries using materialized views even when the estimated query cost of the unrewritten query is lower.
QUERY_REWRITE_ENABLED = FORCE This option enables the query rewrite feature of the optimizer and directs the optimizer to rewrite queries using materialized views even when the estimated query cost of the unwritten query is lower.
QUERY_REWRITE_INTEGRITY This parameter is optional, but must be set to STALE_TOLERATED, TRUSTED, or ENFORCED if it is specied (see "Accuracy of Query Rewrite" on page 18-7). It defaults to ENFORCED if it is undened. By default, the integrity level is set to ENFORCED. In this mode, all constraints must be validated. Therefore, if you use ENABLE NOVALIDATE, certain types of query rewrite might not work. To enable query rewrite in this environment (where constraints have not been validated), you should set the integrity level to a lower level of granularity such as TRUSTED or STALE_TOLERATED.
18-6
You can set the level of query rewrite for a session, thus allowing different users to work at different integrity levels. The possible statements are:
ALTER SESSION SET QUERY_REWRITE_INTEGRITY = STALE_TOLERATED; ALTER SESSION SET QUERY_REWRITE_INTEGRITY = TRUSTED; ALTER SESSION SET QUERY_REWRITE_INTEGRITY = ENFORCED;
ENFORCED This is the default mode. The optimizer only uses fresh data from the materialized views and only use those relationships that are based on ENABLED VALIDATED primary, unique, or foreign key constraints.
TRUSTED In TRUSTED mode, the optimizer trusts that the data in the materialized views is fresh and the relationships declared in dimensions and RELY constraints are correct. In this mode, the optimizer also uses prebuilt materialized views or materialized views based on views, and it uses relationships that are not enforced as well as those that are enforced. In this mode, the optimizer also trusts declared but not ENABLED VALIDATED primary or unique key constraints and data relationships specied using dimensions.
STALE_TOLERATED In STALE_TOLERATED mode, the optimizer uses materialized views that are valid but contain stale data as well as those that contain fresh data. This mode offers the maximum rewrite capability but creates the risk of generating inaccurate results.
If rewrite integrity is set to the safest level, ENFORCED, the optimizer uses only enforced primary key constraints and referential integrity constraints to ensure that the results of the query are the same as the results when accessing the detail tables directly. If the rewrite integrity is set to levels other than ENFORCED, there are several situations where the output with rewrite can be different from that without it:
s
A materialized view can be out of synchronization with the master copy of the data. This generally happens because the materialized view refresh procedure is
pending following bulk load or DML operations to one or more detail tables of a materialized view. At some data warehouse sites, this situation is desirable because it is not uncommon for some materialized views to be refreshed at certain time intervals.
s
The relationships implied by the dimension objects are invalid. For example, values at a certain level in a hierarchy do not roll up to exactly one parent value. The values stored in a prebuilt materialized view table might be incorrect. A wrong answer can occur because of bad data relationships dened by unenforced table or view constraints.
Note that the scope of a rewrite hint is a query block. If a SQL statement consists of several query blocks (SELECT clauses), you need to specify a rewrite hint on each query block to control the rewrite for the entire statement. Using the REWRITE_OR_ERROR hint in a query causes the following error if the query failed to rewrite:
ORA-30393: a query block in the statement did not rewrite
18-8
For example, the following query issues the ORA-30393 error when there are no suitable materialized views for query rewrite to use:
SELECT /*+ REWRITE_OR_ERROR */ p.prod_subcategory, SUM(s.amount_sold) FROM sales s, products p WHERE s.prod_id = p.prod_id GROUP BY p.prod_subcategory;
ENABLE QUERY REWRITE AS SELECT p.prod_id, t.week_ending_day, s.cust_id, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, times t WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id GROUP BY p.prod_id, t.week_ending_day, s.cust_id; CREATE MATERIALIZED VIEW sum_sales_pscat_month_city_mv ENABLE QUERY REWRITE AS SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold, COUNT(s.amount_sold) AS count_amount_sold FROM sales s, products p, times t, customers c WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id AND s.cust_id=c.cust_id GROUP BY p.prod_subcategory, t.calendar_month_desc, c.cust_city;
Although it is not a requirement, it is recommended that you collect statistics on the materialized views so that the optimizer can determine whether to rewrite the queries. You can do this either on a per-object base or for all newly created objects without statistics. The following is an example of a per-object base, shown for join_sales_time_product_mv:
EXECUTE DBMS_STATS.GATHER_TABLE_STATS ( 'SH','JOIN_SALES_TIME_PRODUCT_MV', estimate_percent block_sample => TRUE, cascade => TRUE); => 20, -
The following illustrates a statistics collection for all newly created objects without statistics:
EXECUTE DBMS_STATS.GATHER_SCHEMA_STATS ( 'SH', -
=>
TRUE, -
Joins present in the materialized view are present in the SQL. There is sufcient data in the materialized view or views to answer the query.
After that, it must determine how it will rewrite the query. The simplest case occurs when the result stored in a materialized view exactly matches what is requested by a query. The optimizer makes this type of determination by comparing the text of the query with the text of the materialized view denition. This text match method is most straightforward but the number of queries eligible for this type of query rewrite is minimal. When the text comparison test fails, the optimizer performs a series of generalized checks based on the joins, selections, grouping, aggregates, and column data fetched. This is accomplished by individually comparing various clauses (SELECT, FROM, WHERE, HAVING, or GROUP BY) of a query with those of a materialized view. There are many different types of query rewrite that are possible and they can be categorized into the following areas:
s
Join Back Rollup Using a Dimension Compute Aggregates Filtering the Data
Query Rewrite
18-11
is, the entire SELECT expression), ignoring the white space during text comparison. Given the following query:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold, COUNT(s.amount_sold) AS count_amount_sold FROM sales s, products p, times t, customers c WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id AND s.cust_id=c.cust_id GROUP BY p.prod_subcategory, t.calendar_month_desc, c.cust_city;
This query matches sum_sales_pscat_month_city_mv (white space excluded) and is rewritten as:
SELECT prod_subcategory, calendar_month_desc, cust_city, sum_amount_sold, count_amount_sold FROM sum_sales_pscat_month_city_mv;
When full text match fails, the optimizer then attempts a partial text match. In this method, the text starting from the FROM clause of a query is compared against the text starting with the FROM clause of a materialized view denition. Therefore, the following query can be rewritten:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city, AVG(s.amount_sold) FROM sales s, products p, times t, customers c WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id AND s.cust_id=c.cust_id GROUP BY p.prod_subcategory, t.calendar_month_desc, c.cust_city;
Note that, under the partial text match rewrite method, the average of sales aggregate required by the query is computed using the sum of sales and count of sales aggregates stored in the materialized view. When neither text match succeeds, the optimizer uses a general query rewrite method.
Query Rewrite
18-13
Table 181
Rewrite Checks Join Back Rollup Using a Dimension Aggregate Rollup Compute Aggregates
Join Back
If some column data requested by a query cannot be obtained from a materialized view, the optimizer further determines if it can be obtained based on a data relationship called a functional dependency. When the data in a column can determine data in another column, such a relationship is called a functional dependency or functional determinance. For example, if a table contains a primary key column called prod_id and another column called prod_name, then, given a prod_id value, it is possible to look up the corresponding prod_name. The opposite is not true, which means a prod_name value need not relate to a unique prod_id. When the column data required by a query is not available from a materialized view, such column data can still be obtained by joining the materialized view back to the table that contains required column data provided the materialized view contains a key that functionally determines the required column data. For example, consider the following query:
SELECT p.prod_category, t.week_ending_day, SUM(s.amount_sold) FROM sales s, products p, times t WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id AND p.prod_category='CD' GROUP BY p.prod_category, t.week_ending_day;
The materialized view sum_sales_prod_week_mv contains p.prod_id, but not p.prod_category. However, you can join sum_sales_prod_week_mv back to products to retrieve prod_category because prod_id functionally determines prod_category. The optimizer rewrites this query using sum_sales_prod_ week_mv as follows:
SELECT p.prod_category, mv.week_ending_day, SUM(mv.sum_amount_sold) FROM sum_sales_prod_week_mv mv, products p WHERE mv.prod_id=p.prod_id AND p.prod_category='Photo' GROUP BY p.prod_category, mv.week_ending_day;
Here the products table is called a joinback table because it was originally joined in the materialized view but joined again in the rewritten query. You can declare functional dependency in two ways:
s
Using the primary key constraint (as shown in the previous example) Using the DETERMINES clause of a dimension
The DETERMINES clause of a dimension denition might be the only way you could declare functional dependency when the column that determines another column cannot be a primary key. For example, the products table is a denormalized dimension table that has columns prod_id, prod_name, and prod_ subcategory that functionally determines prod_subcat_desc and prod_ category that determines prod_cat_desc. The rst functional dependency can be established by declaring prod_id as the primary key, but not the second functional dependency because the prod_ subcategory column contains duplicate values. In this situation, you can use the DETERMINES clause of a dimension to declare the second functional dependency. The following dimension denition illustrates how functional dependencies are declared:
CREATE DIMENSION products_dim LEVEL product IS (products.prod_id) LEVEL subcategory IS (products.prod_subcategory) LEVEL category IS (products.prod_category) HIERARCHY prod_rollup ( product CHILD OF subcategory CHILD OF category ) ATTRIBUTE product DETERMINES products.prod_name ATTRIBUTE product DETERMINES products.prod_desc ATTRIBUTE subcategory DETERMINES products.prod_subcategory_desc ATTRIBUTE category DETERMINES products.prod_category_desc;
The hierarchy prod_rollup declares hierarchical relationships that are also 1:n functional dependencies. The 1:1 functional dependencies are declared using the DETERMINES clause, as seen when prod_subcategory functionally determines prod_subcat_desc. Consider the following query:
SELECT p.prod_subcategory_desc, t.week_ending_day, SUM(s.amount_sold) FROM sales s, products p, times t
Query Rewrite
18-15
WHERE s.time_id=t.time_id AND s.prod_id=p.prod_id AND p.prod_subcategory_desc LIKE '%Audio' GROUP BY p.prod_subcategory_desc, t.week_ending_day;
This can be rewritten by joining sum_sales_pscat_week_mv to the products table so that prod_subcat_desc is available to evaluate the predicate. However, the join will be based on the prod_subcategory column, which is not a primary key in the products table; therefore, it allows duplicates. This is accomplished by using an inline view that selects distinct values and this view is joined to the materialized view as shown in the rewritten query.
SELECT iv.prod_subcat_desc, mv.week_ending_day, SUM(mv.sum_amount_sold) FROM sum_sales_pscat_week_mv mv, (SELECT DISTINCT prod_subcategory, prod_subcategory_desc FROM products) iv WHERE mv.prod_subcategory=iv.prod_subcategory AND iv.prod_subcategory_desc LIKE '%Men' GROUP BY iv.prod_subcategory_desc, mv.week_ending_day;
This type of rewrite is possible because prod_subcategory functionally determines prod_subcategory_desc as declared in the dimension.
Because prod_subcategory functionally determines prod_category, sum_ sales_pscat_week_mv can be used with a joinback to products to retrieve prod_category column data, and then aggregates can be rolled up to prod_ category level, as shown in the following:
SELECT pv.prod_category, mv.week_ending_day, SUM(mv.sum_amount_sold) FROM sum_sales_pscat_week_mv mv, (SELECT DISTINCT prod_subcategory, prod_category FROM products) pv WHERE mv.prod_subcategory= pv.prod_subcategory GROUP BY pv.prod_category, mv.week_ending_day;
Compute Aggregates
Query rewrite can also occur when the optimizer determines if the aggregates requested by a query can be derived or computed from one or more aggregates stored in a materialized view. For example, if a query requests AVG(X) and a materialized view contains SUM(X) and COUNT(X), then AVG(X) can be computed as SUM(X)/COUNT(X). In addition, if it is determined that the rollup of aggregates stored in a materialized view is required, then, if it is possible, query rewrite also rolls up each aggregate requested by the query using aggregates in the materialized view. For example, SUM(sales) at the city level can be rolled up to SUM(sales) at the state level by summing all SUM(sales) aggregates in a group with the same state value. However, AVG(sales) cannot be rolled up to a coarser level unless COUNT(sales) is also available in the materialized view. Similarly, VARIANCE(sales) or STDDEV(sales) cannot be rolled up unless COUNT(sales) and SUM(sales) are also available in the materialized view. For example, consider the following query:
ALTER TABLE times MODIFY CONSTRAINT time_pk RELY; ALTER TABLE customers MODIFY CONSTRAINT customers_pk RELY; ALTER TABLE sales MODIFY CONSTRAINT sales_time_pk RELY; ALTER TABLE sales MODIFY CONSTRAINT sales_customer_fk RELY; SELECT p.prod_subcategory, AVG(s.amount_sold) AS avg_sales FROM sales s, products p WHERE s.prod_id = p.prod_id GROUP BY p.prod_subcategory;
This statement can be rewritten with materialized view sum_sales_pscat_ month_city_mv provided the join between sales and times and sales and customers are lossless and non-duplicating. Further, the query groups by prod_ subcategory whereas the materialized view groups by prod_subcategory, calendar_month_desc and cust_city, which means the aggregates stored in
Query Rewrite
18-17
the materialized view will have to be rolled up. The optimizer rewrites the query as the following:
SELECT mv.prod_subcategory, SUM(mv.sum_amount_sold)/COUNT(mv.count_amount_sold) AS avg_sales FROM sum_sales_pscat_month_city_mv mv GROUP BY mv.prod_subcategory;
The argument of an aggregate such as SUM can be an arithmetic expression such as A+B. The optimizer tries to match an aggregate SUM(A+B) in a query with an aggregate SUM(A+B) or SUM(B+A) stored in a materialized view. In other words, expression equivalence is used when matching the argument of an aggregate in a query with the argument of a similar aggregate in a materialized view. To accomplish this, Oracle converts the aggregate argument expression into a canonical form such that two different but equivalent expressions convert into the same canonical form. For example, A*(B-C), A*B-C*A, (B-C)*A, and -A*C+A*B all convert into the same canonical form and, therefore, they are successfully matched.
Query Rewrite Denitions Before describing what is possible when query rewrite works with ltered data, the following denitions are useful:
s
join relop Is one of the following (=, <, <=, >, >=)
selection relop Is one of the following (=, <, <=, >, >=, !=, [NOT] BETWEEN | IN| LIKE |NULL)
join predicate Is of the form (column1 join relop column2), where columns are from different tables within the same FROM clause in the current query block. So, for example, an outer reference is not possible.
selection predicate Is of the form LHS-expression relop RHS-expression, where LHS means left-hand side and RHS means right-hand side. All non-join predicates are selection predicates. The left-hand side usually contains a column and the right-hand side contains the values. For example, color='red' means the left-hand side is color and the right-hand side is 'red' and the relational operator is (=).
LHS-constrained When comparing a selection from the query with a selection from the materialized view, if the left-hand side of both selections match, the selections are said to be LHS-constrained or just constrained for short.
RHS-constrained When comparing a selection from the query with a selection from the materialized view, if the right-hand side of both selections match, the selections are said to be RHS-constrained or just constrained. Note that before comparing the selections, the LHS/RHS-expression is converted to a canonical form and then the comparison is done. This means that expressions such as column1 + 5 and 5 + column1 will match and be constrained.
WHERE Clause Guidelines Although query rewrite on ltered data does not restrict the general form of the WHERE clause, there is an optimal pattern and, normally, most queries fall into this pattern as follows:
(join predicate AND join predicate AND ....) AND (selection predicate AND|OR selection predicate .... )
Query Rewrite
18-19
If the WHERE clause has an OR at the top, then the optimizer rst checks for common predicates under the OR. If found, the common predicates are factored out from under the OR, then joined with an AND back to the OR. This helps to put the WHERE clause into the optimal pattern. This is done only if OR occurs at the top of the WHERE clause. For example, if the WHERE clause is the following:
(sales.prod_id = prod.prod_id AND prod.prod_name = 'Kids Polo Shirt') OR (sales.prod_id = prod.prod_id AND prod.prod_name = 'Kids Shorts')
Thus putting the WHERE clause into the most optimal pattern. Selection Categories Selections are categorized into the following cases:
s
Range Range selections are of a form such as WHERE (cust_last_name BETWEEN 'abacrombe' AND 'anakin'). Note that simple selections with relational operators (<,<=,>,>=)are also considered range selections.
IN-lists Single and multi-column IN-lists such as WHERE(prod_id) IN (102, 233, ....). Note that selections of the form (column1='v1' OR column1='v2' OR column1='v3' OR ....) are treated as a group and classied as an IN-list.
IS [NOT] NULL [NOT] LIKE Other Other selections are when it cannot determine the boundaries for the data. For example, EXISTS.
When comparing a selection from the query with a selection from the materialized view, the left-hand side of both selections are compared and if they match they are said to be LHS-constrained or constrained for short. If the selections are constrained, then the right-hand side values are checked for containment. That is, the RHS values of the query selection must be contained by right-hand side values of the materialized view selection. Examples of Query Rewrite Selection Here are a number of examples showing how query rewrite can still occur when the data is being ltered.
Example 181 Single Value Selection
Then, the selections are constrained on prod_id and the right-hand side value of the query 102 is within the range of the materialized view, so query rewrite is possible.
Example 182 Bounded Range Selection
A selection can be a bounded range (a range with an upper and lower value). For example, if the query contains the following clause:
WHERE prod_id > 10 AND prod_id < 50
Then, the selections are constrained on prod_id and the query range is within the materialized view range. In this example, notice that both query selections are constrained by the same materialized view selection.
Example 183 Selection With Expression
Query Rewrite
18-21
Then, the selections are constrained on (sales.amount_sold *.07) and the right-hand side value of the query is within the range of the materialized view, therefore query rewrite is possible. Complex selections require that both the left-hand side and right-hand side be matched (for example, when the left-hand side and the right-hand side are constrained).
Example 184 Exact Match Selections
If the left-hand side and the right-hand side are constrained and the selection_relop is the same, then the selection can usually be dropped from the rewritten query. Otherwise, the selection must be kept to lter out extra data from the materialized view. If query rewrite can drop the selection from the rewritten query, all columns from the selection may not have to be in the materialized view so more rewrites can be done. This ensures that the materialized view data is not more restrictive that the query.
Example 185 More Selection in the Query
Selections in the query do not have to be constrained by any selections in the materialized view but, if they are, then the right-hand side values must be contained by the materialized view. For example, if the query contains the following clause:
WHERE prod_name = 'Shorts' AND prod_category = 'Men'
Then, in this example, only selection with prod_category is constrained. The query has an extra selection that is not constrained but this is acceptable because if
the materialized view selects prod_name or selects a column that can be joined back to the detail table to get prod_name, then the query rewrite is possible.
Example 186 No Rewrite Because of Fewer Selections in the Query
Then, the materialized view selection with prod_name is not constrained. The materialized view is more restrictive that the query because it only contains the product Shorts, therefore, query rewrite will not occur.
Example 187 Multi-Column IN-List Selections
Query rewrite also checks for cases where the query has a multi-column IN-list where the columns are fully constrained by individual columns from the materialized view single column IN-lists. For example, if the query contains the following:
WHERE (prod_id, cust_id) IN ((1022, 1000), (1033, 2000))
Then, the materialized view IN-lists are constrained by the columns in the query multi-column IN-list. Furthermore, the right-hand side values of the query selection are contained by the materialized view so that rewrite will occur.
Example 188 Selections Using IN-Lists
Selection compatibility also checks for cases where the materialized view has a multi-column IN-list where the columns are fully constrained by individual columns or columns from IN-lists in the query. For example, if the query contains the following:
WHERE prod_id = 1022 AND cust_id IN (1000, 2000)
Query Rewrite
18-23
Then, the materialized view IN-list columns are fully constrained by the columns in the query selections. Furthermore, the right-hand side values of the query selection are contained by the materialized view. So rewrite succeeds.
Example 189 Multiple Selections and Disjuncts
Then, the query has a single disjunct (group of selections separated by AND) and the materialized view has two disjuncts separated by OR. The query disjunct is contained by the second materialized view disjunct so selection compatibility succeeds. It is clear that the materialized view contains more data than needed by the query so the query can be rewritten.
The following query could be rewritten to use cal_month_sales_id_mv because the query asks for the amount where the cust_id is 10 and this is contained in the materialized view.
SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars FROM times t, sales s WHERE s.time_id = t.time_id AND s.cust_id = 10 GROUP BY t.calendar_month_desc;
Because the predicate s.cust_id = 10 selects the same data in the query and in the materialized view, it is dropped from the rewritten query. This means the rewritten query is the following:
SELECT mv.calendar_month_desc, mv.dollars FROM cal_month_sales_id_mv mv;
Query Rewrite
18-25
FROM times t, sales s WHERE s.time_id = t.time_id AND t.time_id = TO_DATE('14-FEB-1999', 'DD-MON-YYYY');
You can also use expressions in selection predicates. This process resembles the following:
expression relational operator constant
Where expression can be any arbitrary arithmetic expression allowed by the Oracle Database. The expression in the materialized view and the query must match. Oracle attempts to discern expressions that are logically equivalent, such as A+B and B+A, and will always recognize identical expressions as being equivalent.
You can also use queries with an expression on both sides of the operator or user-dened functions as operators. Query rewrite occurs when the complex predicate in the materialized view and the query are logically equivalent. This means that, unlike exact text match, terms could be in a different order and rewrite can still occur, as long as the expressions are equivalent. In addition, selection predicates can be joined with an AND operator in a query and the query can still be rewritten to use a materialized view as long as every restriction on the data selected by the query is matched by a restriction in the denition of the materialized view. Again, this does not mean an exact text match, but that the restrictions on the data selected must be a logical match. Also, the query may be more restrictive in its selection of data and still be eligible, but it can never be less restrictive than the denition of the materialized view and still be eligible for rewrite. For example, given the preceding materialized view denition, a query such as the following can be rewritten:
SELECT p.promo_name, SUM(s.amount_sold) FROM promotions p, sales s WHERE s.promo_id = p.promo_id AND promo_name GROUP BY promo_name HAVING SUM(s.amount_sold) > 1000;
= 'coupon'
In this case, the query is more restrictive than the denition of the materialized view, so rewrite can occur. However, if the query had selected promo_category, then it could not have been rewritten against the materialized view, because the materialized view denition does not contain that column. For another example, if the denition of a materialized view restricts a city name column to Boston, then a query that selects Seattle as a value for this column can never be rewritten with that materialized view, but a query that restricts city name to Boston and restricts a column value that is not restricted in the materialized view could be rewritten to use the materialized view. All the rules noted previously also apply when predicates are combined with an OR operator. The simple predicates, or simple predicates connected by OR operators, are considered separately. Each predicate in the query must be contained in the materialized view if rewrite is to occur. For example, the query could have a restriction such as city='Boston' OR city ='Seattle' and to be eligible for rewrite, the materialized view that the query might be rewritten against must have
Query Rewrite
18-27
the same restriction. In fact, the materialized view could have additional restrictions, such as city='Boston' OR city='Seattle' OR city='Cleveland' and rewrite might still be possible. Note, however, that the reverse is not true. If the query had the restriction city = 'Boston' OR city='Seattle' OR city='Cleveland' and the materialized view only had the restriction city='Boston' OR city='Seattle', then rewrite would not be possible because, with a single materialized view, the query seeks more data than is contained in the restricted subset of data stored in the materialized view.
Join Compatibility Check Data Sufciency Check Grouping Compatibility Check Aggregate Computability Check
Common joins that occur in both the query and the materialized view. These joins form the common subgraph. Delta joins that occur in the query but not in the materialized view. These joins form the query delta subgraph. Delta joins that occur in the materialized view but not in the query. These joins form the materialized view delta subgraph.
Figure 182
countries
Common Joins The common join pairs between the two must be of the same type, or the join in the query must be derivable from the join in the materialized view. For example, if a materialized view contains an outer join of table A with table B, and a query contains an inner join of table A with table B, the result of the inner join can be derived by ltering the antijoin rows from the result of the outer join. For example, consider the following query:
SELECT FROM WHERE AND t. p.prod_name, t.week_ending_day, SUM(amount_sold) sales s, products p, times t s.time_id=t.time_id AND s.prod_id = p.prod_id week_ending_day BETWEEN TO_DATE('01-AUG-1999', 'DD-MON-YYYY') AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY') GROUP BY prod_name, week_ending_day;
The common joins between this query and the materialized view join_sales_ time_product_mv are:
s.time_id = t.time_id AND s.prod_id = p.prod_id
Query Rewrite
18-29
SELECT prod_name, week_ending_day, SUM(amount_sold) FROM join_sales_time_product_mv WHERE week_ending_day BETWEEN TO_DATE('01-AUG-1999','DD-MON-YYYY') AND TO_DATE('10-AUG-1999','DD-MON-YYYY') GROUP BY prod_name, week_ending_day;
The query could also be answered using the join_sales_time_product_oj_mv materialized view where inner joins in the query can be derived from outer joins in the materialized view. The rewritten version will (transparently to the user) lter out the antijoin rows. The rewritten query will have the following structure:
SELECT prod_name, week_ending_day, SUM(amount_sold) FROM join_sales_time_product_oj_mv WHERE week_ending_day BETWEEN TO_DATE('01-AUG-1999','DD-MON-YYYY') AND TO_DATE('10-AUG-1999','DD-MON-YYYY') AND prod_id IS NOT NULL GROUP BY prod_name, week_ending_day;
In general, if you use an outer join in a materialized view containing only joins, you should put in the materialized view either the primary key or the rowid on the right side of the outer join. For example, in the previous example, join_sales_time_ product_oj_mv, there is a primary key on both sales and products. Another example of when a materialized view containing only joins is used is the case of a semijoin rewrites. That is, a query contains either an EXISTS or an IN subquery with a single table. Consider the following query, which reports the products that had sales greater than $1,000:
SELECT DISTINCT prod_name FROM products p WHERE EXISTS (SELECT * FROM sales s WHERE p.prod_id=s.prod_id AND s.amount_sold > 1000);
This query contains a semijoin (s.prod_id = p.prod_id) between the products and the sales table. This query can be rewritten to use either the join_sales_time_product_mv materialized view, if foreign key constraints are active or join_sales_time_ product_oj_mv materialized view, if primary keys are active. Observe that both materialized views contain s.prod_id=p.prod_id, which can be used to derive
the semijoin in the query. The query is rewritten with join_sales_time_ product_mv as follows:
SELECT prod_name FROM (SELECT DISTINCT prod_name FROM WHERE amount_sold > 1000); join_sales_time_product_mv
If the materialized view join_sales_time_product_mv is partitioned by time_ id, then this query is likely to be more efcient than the original query because the original join between sales and products has been avoided. The query could be rewritten using join_sales_time_product_oj_mv as follows.
SELECT prod_name FROM (SELECT DISTINCT prod_name FROM join_sales_time_product_oj_mv WHERE amount_sold > 1000 AND prod_id IS NOT NULL);
Rewrites with semi-joins are restricted to materialized views with joins only and are not possible for materialized views with joins and aggregates. Query Delta Joins A query delta join is a join that appears in the query but not in the materialized view. Any number and type of delta joins in a query are allowed and they are simply retained when the query is rewritten with a materialized view. In order for the retained join to work, the materialized view must contain the joining key. Upon rewrite, the materialized view is joined to the appropriate tables in the query delta. For example, consider the following query:
SELECT p.prod_name, t.week_ending_day, c.cust_city, SUM(s.amount_sold) FROM sales s, products p, times t, customers c WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY prod_name, week_ending_day, cust_city;
Using the materialized view join_sales_time_product_mv, common joins are: s.time_id=t.time_id and s.prod_id=p.prod_id. The delta join in the query is s.cust_id=c.cust_id. The rewritten form will then join the join_sales_ time_product_mv materialized view with the customers table as follows:
SELECT mv.prod_name, mv.week_ending_day, c.cust_city, SUM(mv.amount_sold) FROM join_sales_time_product_mv mv, customers c WHERE mv.cust_id = c.cust_id GROUP BY prod_name, week_ending_day, cust_city;
Materialized View Delta Joins A materialized view delta join is a join that appears in the materialized view but not the query. All delta joins in a materialized view are required to be lossless with respect to the result of common joins. A lossless join
Query Rewrite
18-31
guarantees that the result of common joins is not restricted. A lossless join is one where, if two tables called A and B are joined together, rows in table A will always match with rows in table B and no data will be lost, hence the term lossless join. For example, every row with the foreign key matches a row with a primary key provided no nulls are allowed in the foreign key. Therefore, to guarantee a lossless join, it is necessary to have FOREIGN KEY, PRIMARY KEY, and NOT NULL constraints on appropriate join keys. Alternatively, if the join between tables A and B is an outer join (A being the outer table), it is lossless as it preserves all rows of table A. All delta joins in a materialized view are required to be non-duplicating with respect to the result of common joins. A non-duplicating join guarantees that the result of common joins is not duplicated. For example, a non-duplicating join is one where, if table A and table B are joined together, rows in table A will match with at most one row in table B and no duplication occurs. To guarantee a non-duplicating join, the key in table B must be constrained to unique values by using a primary key or unique constraint. Consider the following query that joins sales and times:
SELECT t.week_ending_day, SUM(s.amount_sold) FROM sales s, times t WHERE s.time_id = t.time_id AND t.week_ending_day BETWEEN TO_DATE ('01-AUG-1999', 'DD-MON-YYYY') AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY') GROUP BY week_ending_day;
The materialized view join_sales_time_product_mv has an additional join (s.prod_id=p.prod_id) between sales and products. This is the delta join in join_sales_time_product_mv. You can rewrite the query if this join is lossless and non-duplicating. This is the case if s.prod_id is a foreign key to p.prod_id and is not null. The query is therefore rewritten as:
SELECT week_ending_day, SUM(amount_sold) FROM join_sales_time_product_mv WHERE week_ending_day BETWEEN TO_DATE('01-AUG-1999', 'DD-MON-YYYY') AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY') GROUP BY week_ending_day;
The query can also be rewritten with the materialized view join_sales_time_ product_mv_oj where foreign key constraints are not needed. This view contains an outer join (s.prod_id=p.prod_id(+)) between sales and products. This makes the join lossless. If p.prod_id is a primary key, then the non-duplicating condition is satised as well and optimizer rewrites the query as follows:
SELECT week_ending_day, SUM(amount_sold) FROM join_sales_time_product_oj_mv
WHERE week_ending_day BETWEEN TO_DATE('01-AUG-1999', 'DD-MON-YYYY') AND TO_DATE('10-AUG-1999', 'DD-MON-YYYY') GROUP BY week_ending_day;
Note that the outer join in the denition of join_sales_time_product_mv_oj is not necessary because the parent key - foreign key relationship between sales and products in the sh schema is already lossless. It is used for demonstration purposes only, and would be necessary if sales.prod_id were nullable, thus violating the losslessness of the join condition sales.prod_id = products.prod_id. Current limitations restrict most rewrites with outer joins to materialized views with joins only. There is limited support for rewrites with materialized aggregate views with outer joins, so those views should rely on foreign key constraints to assure losslessness of materialized view delta joins. Join Equivalence Recognition Query rewrite is able to make many transformations based upon the recognition of equivalent joins. Query rewrite recognizes the following construct as being equivalent to a join:
WHERE table1.column1 = F(args) AND table2.column2 = F(args) /* sub-expression A */ /* sub-expression B */
If F(args) is a PL/SQL function that is declared to be deterministic and the arguments to both invocations of F are the same, then the combination of subexpression A with subexpression B be can be recognized as a join between table1.column1 and table2.column2. That is, the following expression is equivalent to the previous expression:
WHERE table1.column1 = F(args) AND table2.column2 = F(args) AND table1.column1 = table2.column2 /* sub-expression A */ /* sub-expression B */ /* join-expression J */
Because join-expression J can be inferred from sub-expression A and subexpression B, the inferred join can be used to match a corresponding join of table1.column1 = table2.column2 in a materialized view.
Query Rewrite
18-33
column A.X in a query with column B.X in a materialized view or vice versa. For example, consider the following query:
SELECT p.prod_name, s.time_id, t.week_ending_day, SUM(s.amount_sold) FROM sales s, products p, times t WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id GROUP BY p.prod_name, s.time_id, t.week_ending_day;
This query can be answered with join_sales_time_product_mv even though the materialized view does not have s.time_id. Instead, it has t.time_id, which, through a join condition s.time_id=t.time_id, is equivalent to s.time_id. Thus, the optimizer might select the following rewrite:
SELECT prod_name, time_id, week_ending_day, SUM(amount_sold) FROM join_sales_time_product_mv GROUP BY prod_name, time_id, week_ending_day;
Query Rewrite Using Partially Stale Materialized Views Query Rewrite Using Nested Materialized Views Query Rewrite When Using GROUP BY Extensions Query Rewrite with Inline Views Query Rewrite with Selfjoins Query Rewrite and View Constraints Query Rewrite and Expression Matching Date Folding Rewrite
Query rewrite can use a materialized view in ENFORCED or TRUSTED mode if the rows from the materialized view used to answer the query are known to be FRESH. The fresh rows in the materialized view are identied by adding selection predicates to the materialized view's WHERE clause. Oracle will rewrite a query with this materialized view if its answer is contained within this (restricted) materialized view.
Query Rewrite
18-35
CREATE MATERIALIZED VIEW sum_sales_per_city_mv ENABLE QUERY REWRITE AS SELECT s.time_id, p.prod_subcategory, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id GROUP BY time_id, prod_subcategory, cust_city;
Also suppose new data will be inserted for December 2000, which will be assigned to partition sales_q4_2000. For testing purposes, you can apply an arbitrary DML operation on sales, changing a different partition than sales_q1_2000 as the following query requests data in this partition when this materialized view is fresh. For example, the following:
INSERT INTO SALES VALUES(17, 10, '01-DEC-2000', 4, 380, 123.45, 54321);
Until a refresh is done, the materialized view is generically stale and cannot be used for unlimited rewrite in enforced mode. However, because the table sales is partitioned and not all partitions have been modied, Oracle can identify all partitions that have not been touched. The optimizer can identify the fresh rows in the materialized view (the data which is unaffected by updates since the last refresh operation) by implicitly adding selection predicates to the materialized view dening query as follows:
SELECT s.time_id, p.prod_subcategory, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id < TO_DATE('01-OCT-2000','DD-MON-YYYY') OR s.time_id >= TO_DATE('01-OCT-2001','DD-MON-YYYY')) GROUP BY time_id, prod_subcategory, cust_city;
Note that the freshness of partially stale materialized views is tracked on a per-partition base, and not on a logical base. Because the partitioning strategy of the sales fact table is on a quarterly base, changes in December 2000 causes the complete partition sales_q4_2000 to become stale. Consider the following query, which asks for sales in quarters 1 and 2 of 2000:
SELECT s.time_id, p.prod_subcategory, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id BETWEEN TO_DATE('01-JAN-2000', 'DD-MON-YYYY') AND TO_DATE('01-JUL-2000', 'DD-MON-YYYY')
Oracle Database knows that those ranges of rows in the materialized view are fresh and can therefore rewrite the query with the materialized view. The rewritten query looks as follows:
SELECT time_id, prod_subcategory, cust_city, sum_amount_sold FROM sum_sales_per_city_mv WHERE time_id BETWEEN TO_DATE('01-JAN-2000', 'DD-MON-YYYY') AND TO_DATE('01-JUL-2000', 'DD-MON-YYYY');
Instead of the partitioning key, a partition marker (a function that identies the partition given a rowid) can be present in the select (and GROUP BY list) of the materialized view. You can use the materialized view to rewrite queries that require data from only certain partitions (identiable by the partition-marker), for instance, queries that have a predicate specifying ranges of the partitioning keys containing entire partitions. See Chapter 9, "Advanced Materialized Views" for details regarding the supplied partition marker function DBMS_MVIEW.PMARKER. The following example illustrates the use of a partition marker in the materialized view instead of directly using the partition key column:
CREATE MATERIALIZED VIEW sum_sales_per_city_2_mv ENABLE QUERY REWRITE AS SELECT DBMS_MVIEW.PMARKER(s.rowid) AS pmarker, t.fiscal_quarter_desc, p.prod_subcategory, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY DBMS_MVIEW.PMARKER(s.rowid), prod_subcategory, cust_city, fiscal_quarter_desc;
Suppose you know that the partition sales_q1_2000 is fresh and DML changes have taken place for other partitions of the sales table. For testing purposes, you can apply an arbitrary DML operation on sales, changing a different partition than sales_q1_2000 when the materialized view is fresh. An example is the following:
INSERT INTO SALES VALUES(17, 10, '01-DEC-2000', 4, 380, 123.45, 54321);
Although the materialized view sum_sales_per_city_2_mv is now considered generically stale, Oracle Database can rewrite the following query using this materialized view. This query restricts the data to the partition sales_q1_2000, and selects only certain values of cust_city, as shown in the following:
Query Rewrite
18-37
SELECT p.prod_subcategory, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id AND c.cust_city= 'Nuernberg' AND s.time_id >=TO_DATE('01-JAN-2000','dd-mon-yyyy') AND s.time_id < TO_DATE('01-APR-2000','dd-mon-yyyy') GROUP BY prod_subcategory, cust_city;
Note that rewrite with a partially stale materialized view that contains a PMARKER function can only take place when the complete data content of one or more partitions is accessed and the predicate condition is on the partitioned fact table itself, as shown in the earlier example. The DBMS_MVIEW.PMARKER function gives you exactly one distinct value for each partition. This dramatically reduces the number of rows in a potential materialized view compared to the partitioning key itself, but you are also giving up any detailed information about this key. The only thing you know is the partition number and, therefore, the lower and upper boundary values. This is the trade-off for reducing the cardinality of the range partitioning column and thus the number of rows. Assuming the value of p_marker for partition sales_q1_2000 is 31070, the previously shown queries can be rewritten against the materialized view as follows:
SELECT mv.prod_subcategory, mv.cust_city, SUM(mv.sum_amount_sold) FROM sum_sales_per_city_2_mv mv WHERE mv.pmarker = 31070 AND mv.cust_city= 'Nuernberg' GROUP BY prod_subcategory, cust_city;
So the query can be rewritten against the materialized view without accessing stale data.
FROM WHERE
CREATE MATERIALIZED VIEW sum_sales_time_product_mv ENABLE QUERY REWRITE AS SELECT mv.prod_name, mv.week_ending_day, COUNT(*) cnt_all, SUM(mv.amount_sold) sum_amount_sold, COUNT(mv.amount_sold) cnt_amount_sold FROM join_sales_time_product_mv mv GROUP BY mv.prod_name, mv.week_ending_day;
Oracle rst tries to rewrite it with a materialized aggregate view and nds there is none eligible (note that single-table aggregate materialized view sum_sales_ store_time_mv cannot yet be used), and then tries a rewrite with a materialized join view and nds that join_sales_time_product_mv is eligible for rewrite. The rewritten query has this form:
SELECT mv.prod_name, mv.week_ending_day, SUM(mv.amount_sold) FROM join_sales_time_product_mv mv GROUP BY mv.prod_name, mv.week_ending_day;
Because a rewrite occurred, Oracle tries the process again. This time, the query can be rewritten with single-table aggregate materialized view sum_sales_store_ time into the following form:
SELECT mv.prod_name, mv.week_ending_day, mv.sum_amount_sold FROM sum_sales_time_product_mv mv;
Query Rewrite
18-39
The term base grouping for queries with GROUP BY extensions denotes all unique expressions present in the GROUP BY clause. In the previous query, the following grouping (p.prod_subcategory, t.calendar_month_desc, c.cust_ city) is a base grouping. The extensions can be present in user queries and in the queries dening materialized views. In both cases, materialized view rewrite applies and you can distinguish rewrite capabilities into the following scenarios:
s
Materialized View has Simple GROUP BY and Query has Extended GROUP BY Materialized View has Extended GROUP BY and Query has Simple GROUP BY Both Materialized View and Query Have Extended GROUP BY
Materialized View has Simple GROUP BY and Query has Extended GROUP BY When a query contains an extended GROUP BY clause, it can be rewritten with a materialized view if its base grouping can be rewritten using the materialized view as listed in the rewrite rules explained in "When Does Oracle Rewrite a Query?" on page 18-4. For example, in the following query:
SELECT p.prod_subcategory, t.calendar_month_desc, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, customers c, products p, times t WHERE s.time_id=t.time_id AND s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY GROUPING SETS ((p.prod_subcategory, t.calendar_month_desc), (c.cust_city, p.prod_subcategory));
The base grouping is (p.prod_subcategory, t.calendar_month_desc, c.cust_city, p.prod_subcategory)) and, consequently, Oracle can rewrite the query using sum_sales_pscat_month_city_mv as follows:
SELECT mv.prod_subcategory, mv.calendar_month_desc, mv.cust_city, SUM(mv.sum_amount_sold) AS sum_amount_sold FROM sum_sales_pscat_month_city_mv mv GROUP BY GROUPING SETS ((mv.prod_subcategory, mv.calendar_month_desc), (mv.cust_city, mv.prod_subcategory));
A special situation arises if the query uses the EXPAND_GSET_TO_UNION hint. See "Hint for Queries with Extended GROUP BY" on page 18-44 for an example of using EXPAND_GSET_TO_UNION.
Materialized View has Extended GROUP BY and Query has Simple GROUP BY In order for a materialized view with an extended GROUP BY to be used for rewrite, it must satisfy two additional conditions:
s
It must contain a grouping distinguisher, which is the GROUPING_ID function on all GROUP BY expressions. For example, if the GROUP BY clause of the materialized view is GROUP BY CUBE(a, b), then the SELECT list should contain GROUPING_ID(a, b). The GROUP BY clause of the materialized view should not result in any duplicate groupings. For example, GROUP BY GROUPING SETS ((a, b), (a, b)) would disqualify a materialized view from general rewrite.
A materialized view with an extended GROUP BY contains multiple groupings. Oracle nds the grouping with the lowest cost from which the query can be computed and uses that for rewrite. For example, consider the following materialized view:
CREATE MATERIALIZED VIEW sum_grouping_set_mv ENABLE QUERY REWRITE AS SELECT p.prod_category, p.prod_subcategory, c.cust_state_province, c.cust_city, GROUPING_ID(p.prod_category,p.prod_subcategory, c.cust_state_province,c.cust_city) AS gid, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY GROUPING SETS ((p.prod_category, p.prod_subcategory, c.cust_city), (p.prod_category, p.prod_subcategory, c.cust_state_province, c.cust_city), (p.prod_category, p.prod_subcategory));
This query will be rewritten with the closest matching grouping from the materialized view. That is, the (prodcategory, prod_subcategory, cust_ city) grouping:
SELECT prod_subcategory, cust_city, SUM(sum_amount_sold) AS sum_amount_sold FROM sum_grouping_set_mv WHERE gid = grouping identifier of (prod_category,prod_subcategory, cust_city) GROUP BY prod_subcategory, cust_city;
Query Rewrite
18-41
Both Materialized View and Query Have Extended GROUP BY When both materialized view and the query contain GROUP BY extensions, Oracle uses two strategies for rewrite: grouping match and UNION ALL rewrite. First, Oracle tries grouping match. The groupings in the query are matched against groupings in the materialized view and if all are matched with no rollup, Oracle selects them from the materialized view. For example, consider the following query:
SELECT p.prod_category, p.prod_subcategory, c.cust_city, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY GROUPING SETS ((p.prod_category, p.prod_subcategory, c.cust_city), (p.prod_category, p.prod_subcategory));
This query matches two groupings from sum_grouping_set_mv and Oracle rewrites the query as the following:
SELECT prod_subcategory, cust_city, sum_amount_sold FROM sum_grouping_set_mv WHERE gid = grouping identifier of (prod_category,prod_subcategory, cust_city) OR gid = grouping identifier of (prod_category,prod_subcategory)
If grouping match fails, Oracle tries a general rewrite mechanism called UNION ALL rewrite. Oracle rst represents the query with the extended GROUP BY clause as an equivalent UNION ALL query. Every grouping of the original query is placed in a separate UNION ALL branch. The branch will have a simple GROUP BY clause. For example, consider this query:
SELECT p.prod_category, p.prod_subcategory, c.cust_state_province, t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY GROUPING SETS ((p.prod_subcategory, t.calendar_month_desc), (t.calendar_month_desc), (p.prod_category, p.prod_subcategory, c.cust_state_province), (p.prod_category, p.prod_subcategory));
UNION ALL SELECT null, null, null, t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY t.calendar_month_desc UNION ALL SELECT p.prod_category, p.prod_subcategory, c.cust_state_province, null, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY p.prod_category, p.prod_subcategory, c.cust_state_province UNION ALL SELECT p.prod_category, p.prod_subcategory, null, null, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY p.prod_category, p.prod_subcategory;
Each branch is then rewritten separately using the rules from "When Does Oracle Rewrite a Query?" on page 18-4. Using the materialized view sum_grouping_ set_mv, Oracle can rewrite only branches three (which requires materialized view rollup) and four (which matches the materialized view exactly). The unrewritten branches will be converted back to the extended GROUP BY form. Thus, eventually, the query is rewritten as:
SELECT null, p.prod_subcategory, null, t.calendar_month_desc, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p, customers c, times t WHERE s.prod_id = p.prod_id AND s.cust_id = c.cust_id GROUP BY GROUPING SETS ((p.prod_subcategory, t.calendar_month_desc), (t.calendar_month_desc),) UNION ALL SELECT prod_category, prod_subcategory, cust_state_province, null, SUM(sum_amount_sold) AS sum_amount_sold FROM sum_grouping_set_mv WHERE gid = <grouping id of (prod_category,prod_subcategory, cust_city)> GROUP BY p.prod_category, p.prod_subcategory, c.cust_state_province UNION ALL SELECT prod_category, prod_subcategory, null, null, sum_amount_sold FROM sum_grouping_set_mv WHERE gid = <grouping id of (prod_category,prod_subcategory)>
Query Rewrite
18-43
Note that a query with extended GROUP BY is represented as an equivalent UNION ALL and recursively submitted for rewrite optimization. The groupings that cannot be rewritten stay in the last branch of UNION ALL and access the base data instead.
And here is the query that will be rewritten to use the materialized view:
SELECT t.calendar_month_name, t.calendar_year, p.prod_category, SUM(X1.revenue) AS sum_revenue FROM times t, products p, (SELECT time_id, prod_id, amount_sold*0.2 AS revenue FROM sales) X1 WHERE t.time_id = X1.time_id AND p.prod_id = X1.prod_id GROUP BY calendar_month_name, calendar_year, prod_category;
The following query fails the exact text match test but is rewritten because the aliases for the table references match:
SELECT s.prod_id, t2.fiscal_week_number - t1.fiscal_week_number AS lag FROM times t1, sales s, times t2 WHERE t1.time_id = s.time_id AND t2.time_id = s.time_id_ship;
Query Rewrite
18-45
Note that Oracle performs other checks to ensure the correct match of an instance of a multiply instanced table in the request query with the corresponding table instance in the materialized view. For instance, in the following example, Oracle correctly determines that the matching alias names used for the multiple instances of table time does not establish a match between the multiple instances of table time in the materialized view. The following query cannot be rewritten using sales_shipping_lag_mv, even though the alias names of the multiply instanced table time match because the joins are not compatible between the instances of time aliased by t2:
SELECT s.prod_id, t2.fiscal_week_number - t1.fiscal_week_number AS lag FROM times t1, sales s, times t2 WHERE t1.time_id = s.time_id AND t2.time_id = s.time_id_paid;
This request query joins the instance of the time table aliased by t2 on the s.time_id_paid column, while the materialized views joins the instance of the time table aliased by t2 on the s.time_id_ship column. Because the join conditions differ, Oracle correctly determines that rewrite cannot occur. The following query does not have any matching alias in the materialized view, sales_shipping_lag_mv, for the table, times. But query rewrite will now compare the joins between the query and the materialized view and correctly match the multiple instances of times.
SELECT s.prod_id, x2.fiscal_week_number - x1.fiscal_week_number AS lag FROM times x1, sales s, times x2 WHERE x1.time_id = s.time_id AND x2.time_id = s.time_id_ship;
constraints on base tables is necessary, not only for data correctness and cleanliness, but also for materialized view query rewrite purposes using the original base objects. Materialized view rewrite extensively uses constraints for query rewrite. They are used for determining lossless joins, which, in turn, determine if joins in the materialized view are compatible with joins in the query and thus if rewrite is possible. DISABLE NOVALIDATE is the only valid state for a view constraint. However, you can choose RELY or NORELY as the view constraint state to enable more sophisticated query rewrites. For example, a view constraint in the RELY state allows query rewrite to occur when the query integrity level is set to TRUSTED. Table 182 illustrates when view constraints are used for determining lossless joins. Note that view constraints cannot be used for query rewrite integrity level ENFORCED. This level enforces the highest degree of constraint enforcement ENABLE VALIDATE.
Table 182 View Constraints and Rewrite Integrity Modes RELY No Yes Yes NORELY No No No
View Constraints
To demonstrate the rewrite capabilities on views, you need to extend the sh sample schema as follows:
CREATE VIEW time_view AS SELECT time_id, TO_NUMBER(TO_CHAR(time_id, 'ddd')) AS day_in_year FROM times;
You can now establish a foreign key/primary key relationship (in RELY ON) mode between the view and the fact table, and thus rewrite will take place as described in Table 182, by adding the following constraints. Rewrite will then work for example in TRUSTED mode.
ALTER VIEW time_view ADD (CONSTRAINT time_view_pk PRIMARY KEY (time_id) DISABLE NOVALIDATE); ALTER VIEW time_view MODIFY CONSTRAINT time_view_pk RELY; ALTER TABLE sales ADD (CONSTRAINT time_view_fk FOREIGN KEY (time_id) REFERENCES time_view(time_id) DISABLE NOVALIDATE);
Query Rewrite
18-47
The following query, omitting the dimension table products, will also be rewritten without the primary key/foreign key relationships, because the suppressed join between sales and products is known to be lossless.
SELECT t.day_in_year, SUM(s.amount_sold) AS sum_amount_sold FROM time_view t, sales s WHERE t.time_id = s.time_id GROUP BY t.day_in_year;
However, if the materialized view sales_pcat_cal_day_mv were dened only in terms of the view time_view, then you could not rewrite the following query, suppressing then join between sales and time_view, because there is no basis for losslessness of the delta materialized view join. With the additional constraints as shown previously, this query will also rewrite.
SELECT p.prod_category, SUM(s.amount_sold) AS sum_amount_sold FROM sales s, products p WHERE p.prod_id = s.prod_id GROUP BY p.prod_category;
To undo the changes you have made to the sh schema, issue the following statements:
ALTER TABLE sales DROP CONSTRAINT time_view_fk; DROP VIEW time_view;
View Constraints Restrictions If the referential constraint denition involves a view, that is, either the foreign key or the referenced key resides in a view, the constraint can only be in DISABLE NOVALIDATE mode. A RELY constraint on a view is allowed only if the referenced UNIQUE or PRIMARY KEY constraint in DISABLE NOVALIDATE mode is also a RELY constraint. The specication of ON DELETE actions associated with a referential Integrity constraint, is not allowed (for example, DELETE cascade). However, DELETE, UPDATE, and INSERT operations are allowed on views and their base tables as view constraints are in DISABLE NOVALIDATE mode.
This query is rewritten in terms of sum_sales_mv based on the matching of the canonical forms of the age bracket expressions (that is, 2000 - c.cust_year_of_ birth)/10-0.5), as follows:
SELECT age_bracket, sum_amount_sold FROM sales_by_age_bracket_mv;
Query Rewrite
18-49
an Oracle DATE. The expression matching is done based on the use of canonical forms for the expressions. DATE is a built-in datatype which represents ordered time units such as seconds, days, and months, and incorporates a time hierarchy (second -> minute -> hour -> day -> month -> quarter -> year). This hard-coded knowledge about DATE is used in folding date ranges from lower-date granules to higher-date granules. Specically, folding a date value to the beginning of a month, quarter, year, or to the end of a month, quarter, year is supported. For example, the date value 1-jan-1999 can be folded into the beginning of either year 1999 or quarter 1999-1 or month 1999-01. And, the date value 30-sep-1999 can be folded into the end of either quarter 1999-03 or month 1999-09.
Note: Due to the way date folding works, you should be careful
when using BETWEEN and date columns. The best way to use BETWEEN and date columns is to increment the later date by 1. In other words, instead of using date_col BETWEEN '1-jan-1999' AND '30-jun-1999', you should use date_ col BETWEEN '1-jan-1999' AND '1-jul-1999'. You could also use the TRUNC function to get the equivalent result, as in TRUNC(date_col) BETWEEN '1-jan-1999' AND '30-jun-1999'. TRUNC will, however, strip time values. Because date values are ordered, any range predicate specied on date columns can be folded from lower level granules into higher level granules provided the date range represents an integral number of higher level granules. For example, the range predicate date_col >= '1-jan-1999' AND date_col < '30-jun-1999' can be folded into either a month range or a quarter range using the TO_CHAR function, which extracts specic date components from a date value. The advantage of aggregating data by folded date values is the compression of data achieved. Without date folding, the data is aggregated at the lowest granularity level, resulting in increased disk space for storage and increased I/O to scan the materialized view. Consider a query that asks for the sum of sales by product types for the years 1998.
SELECT p.prod_category, SUM(s.amount_sold) FROM sales s, products p WHERE s.prod_id=p.prod_id AND s.time_id >= TO_DATE('01-jan-1998', 'dd-mon-yyyy') AND s.time_id < TO_DATE('01-jan-1999', 'dd-mon-yyyy') GROUP BY p.prod_category;
CREATE MATERIALIZED VIEW sum_sales_pcat_monthly_mv ENABLE QUERY REWRITE AS SELECT p.prod_category, TO_CHAR(s.time_id,'YYYY-MM') AS month, SUM(s.amount_sold) AS sum_amount FROM sales s, products p WHERE s.prod_id=p.prod_id GROUP BY p.prod_category, TO_CHAR(s.time_id, 'YYYY-MM'); SELECT p.prod_category, SUM(s.amount_sold) FROM sales s, products p WHERE s.prod_id=p.prod_id AND TO_CHAR(s.time_id, 'YYYY-MM') >= '01-jan-1998' AND TO_CHAR(s.time_id, 'YYYY-MM') < '01-jan-1999' GROUP BY p.prod_category; SELECT mv.prod_category, mv.sum_amount FROM sum_sales_pcat_monthly_mv mv WHERE month >= '01-jan-1998' AND month < '01-jan-1999';
The range specied in the query represents an integral number of years, quarters, or months. Assume that there is a materialized view mv3 that contains pre-summarized sales by prod_type and is dened as follows:
CREATE MATERIALIZED VIEW mv3 ENABLE QUERY REWRITE AS SELECT prod_type, TO_CHAR(sale_date,'yyyy-mm') AS month, SUM(sales) AS sum_sales FROM sales, products WHERE sales.prod_id = products.prod_id GROUP BY prod_type, TO_CHAR(sale_date, 'yyyy-mm');
The query can be rewritten by rst folding the date range into the month range and then matching the expressions representing the months with the month expression in mv3. This rewrite is shown in two steps (rst folding the date range followed by the actual rewrite).
SELECT prod_type, SUM(sales) AS sum_sales FROM sales, products WHERE sales.prod_id = products.prod_id AND TO_CHAR(sale_date, 'yyyy-mm') >= TO_CHAR('01-jan-1998', 'yyyy-mm') AND TO_CHAR('01-jan-1999', 'yyyy-mm') GROUP BY prod_type; SELECT prod_type, sum_sales FROM mv3 WHERE month >= TO_CHAR('01-jan-1998', 'yyyy-mm') AND month < TO_CHAR('01-jan-1999', 'yyyy-mm');
Query Rewrite
18-51
GROUP BY prod_type;
If mv3 had pre-summarized sales by prod_type and year instead of prod_type and month, the query could still be rewritten by folding the date range into year range and then matching the year expressions.
PCT Rewrite Based on LIST Partitioned Tables PCT and PMARKER PCT Rewrite with Materialized Views Based on Range-List Partitioned Tables PCT Rewrite Using Rowid as Pmarker
PARTITION Europe VALUES ('France', 'Spain', 'Ireland')) AS SELECT t.calendar_year, t.calendar_month_number, t.day_number_in_month, c1.country_name, s.prod_id, s.quantity_sold, s.amount_sold FROM times t, countries c1, sales s, customers c2 WHERE s.time_id = t.time_id and s.cust_id = c2.cust_id and c2.country_id = c1.country_id and c1.country_name IN ('United States of America', 'Argentina', 'Japan', 'India', 'France', 'Spain', 'Ireland');
If a materialized view is created on the table sales_par_list, which has a list partitioning key, PCT rewrite will use that materialized view for potential rewrites. To illustrate this feature, the following example creates a materialized view that has the total amounts sold of every product in each country for each year. The view depends on detail tables sales_par_list and product.
CREATE MATERIALIZED VIEW sales_per_country_mv BUILD IMMEDIATE REFRESH FORCE ON DEMAND ENABLE QUERY REWRITE AS SELECT s.calendar_year calendar_year, s.country_name country_name, p.prod_name prod_name, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_list s, products p WHERE s.prod_id = p.prod_id AND s.calendar_year <= 2000 GROUP BY s.calendar_year, s.country_name, prod_name;
sales_per_country_mv supports PCT against sales_par_list as its list partition key country_name is in its SELECT and GROUP BY list. Table products is not partitioned, so sales_per_country_mv does not support PCT against this table. A query could be rewritten (in ENFORCED or TRUSTED modes) in terms of sales_ per_country_mv even if sales_per_country_mv is stale if the incoming query accesses only fresh parts of the materialized view. You can determine which parts of the materialized view are FRESH only if the updated tables are PCT enabled in the materialized view; if non-PCT enabled tables have been updated then the rewrite is not possible with fresh data from that specic materialized view as you cannot identify the FRESH portions of the materialized view. sales_per_country_mv supports PCT on sales_par_list and does not support PCT on table product. If table product is updated, then PCT rewrite is not possible with sales_per_country_mv as you cannot tell which portions of the materialized view are FRESH.
Query Rewrite
18-53
This statement inserted a row into partition Europe in table sales_par_list. Now sales_per_country_mv is stale, but PCT rewrite (in ENFORCED and TRUSTED modes) is possible as this materialized view supports PCT against table sales_par_list. The fresh and stale areas of the materialized view are identied based on the partitioned detail table sales_par_list. Figure 183 illustrates what is fresh and what is stale in this example.
Figure 183 PCT Rewrite and List Partitioning
MV sales_per_country_mv FRESHNESS regions determined by country_name Fresh Fresh Fresh Fresh Stale Fresh Stale Table Products
Table sales_par_list Partion America United States of America Argentina Partion Asia Japan India Updated partion Partion Europe France Spain Ireland
This query accesses partitions America and Asia in sales_par_list; these partition have not been updated so rewrite is possible with stale materialized view
sales_per_country_mv as this query will access only FRESH portions of the materialized view. The query is rewritten in terms of sales_per_country_mv as follows:
SELECT country_name, prod_name, SUM(sum_sales) AS sum_slaes, SUM(cnt) AS cnt FROM sales_per_country_mv WHERE calendar_year = 2000 AND country_name IN ('United States of America', 'Japan') GROUP BY country_name, prod_name;
This query accesses partitions Europe and Asia in sales_par_list; partition Europe has been updated so this query cannot be rewritten in terms of sales_ per_country_mv as the required data from the materialized view is stale. You will be able to rewrite after any kinds of updates to sales_par_list, that is DMLs, direct loads and Partition Maintenance Operations (PMOPs) if the incoming query accesses FRESH parts of the materialized view.
Query Rewrite
18-55
ENABLE QUERY REWRITE AS SELECT s.calendar_year calendar_year, p.prod_name prod_name, DBMS_MVIEW.PMARKER(s.rowid) pmarker, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_list s, products p WHERE s.prod_id = p.prod_id AND s.calendar_year > 2000 GROUP BY s.calendar_year, DBMS_MVIEW.PMARKER(s.rowid), p.prod_name;
The materialized view sales_per_dt_partition_mv provides the sum of sales for each detail table partition. This materialized view supports PCT rewrite against table sales_par_list because the partition marker is in its SELECT and GROUP BY clauses. Table 183 lists the partition names and their pmarkers for this example.
Table 183 Partition Names and Their Pmarkers Pmarker 1000 1001 1002
You have deleted rows from partition Asia in table sales_par_list. Now sales_per_dt_partition_mv is stale, but PCT rewrite (in ENFORCED and TRUSTED modes) is possible as this materialized view supports PCT (pmarker based) against table sales_par_list. Now consider the following query:
SELECT p.prod_name, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_list s, products p WHERE s.prod_id = p.prod_id AND s.calendar_year = 2001 AND s.country_name IN ('United States of America', 'Argentina') GROUP BY p.prod_name;
This query can be rewritten in terms of sales_per_dt_partition_mv as all the data corresponding to a detail table partition is accessed, and the materialized view is FRESH with respect to this data. This query accesses all data in partition America, which has not been updated. The query is rewritten in terms of sales_per_dt_partition_mv as follows:
SELECT prod_name, SUM(sum_sales) AS sum_sales, SUM(cnt) AS cnt FROM sales_per_dt_partition_mv WHERE calendar_year = 2001 AND pmarker = 1000 GROUP BY prod_name;
Query Rewrite
18-57
FROM times t, countries c1, products p, sales s, customers c2 WHERE s.time_id = t.time_id AND s.prod_id = p.prod_id AND s.cust_id = c2.cust_id AND c2.country_id = c1.country_id AND c1.country_name IN ('United States of America', 'Argentina', 'Japan', 'India', 'France', 'Spain', 'Ireland');
Let us consider the following materialized view sum_sales_per_year_month_ mv, which has the total amount of products sold each month of each year:
CREATE MATERIALIZED VIEW sum_sales_per_year_month_mv BUILD IMMEDIATE REFRESH FORCE ON DEMAND ENABLE QUERY REWRITE AS SELECT s.calendar_year, s.calendar_month_number, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_range_list s WHERE s.calendar_year > 1990 GROUP BY s.calendar_year, s.calendar_month_number;
sales_per_country_mv supports PCT against sales_par_range_list at the range partitioning level as its range partition key calendar_month_number is in its SELECT and GROUP BY list:
INSERT INTO sales_par_range_list VALUES (2001, 3, 25, 'Spain', 20, 'PROD20', 300, 20.50);
This statement inserts a row with calendar_month_number = 3 and country_ name = 'Spain'. This row is inserted into partition q1 subpartition Europe. After this INSERT statement, sum_sales_per_year_month_mv is stale with respect to partition q1 of sales_par_range_list. So any incoming query that accesses data from this partition in sales_par_range_list cannot be rewritten, for example, the following statement:
SELECT s.calendar_year, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_range_list s WHERE s.calendar_year = 2000 AND s.calendar_month_number BETWEEN 5 AND 9 GROUP BY s.calendar_year;
An example of a statement that does rewrite after the INSERT statement is the following, because it accesses fresh material:
SELECT s.calendar_year, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_range_list s WHERE s.calendar_year = 2000 AND s.calendar_month_number BETWEEN 2 AND 6 GROUP BY s.calendar_year;
Figure 184 offers a graphical illustration of what is stale and what is fresh.
America q1 (updated) Asia Europe America q2 Asia Europe America q3 Asia Europe America q4 Asia Europe
sales_par_range_list
Let us create the following materialized view on tables, sales_par_list and product_par_list:
Query Rewrite
18-59
CREATE MATERIALIZED VIEW sum_sales_per_category_mv BUILD IMMEDIATE REFRESH FORCE ON DEMAND ENABLE QUERY REWRITE AS SELECT p.rowid prid, p.prod_category, SUM (s.amount_sold) sum_sales, COUNT(*) cnt FROM sales_par_list s, product_par_list p WHERE s.prod_id = p.prod_id and s.calendar_year <= 2000 GROUP BY p.rowid, p.prod_category;
All the limitations that apply to pmarker rewrite will apply here as well. The incoming query should access a whole partition for the query to be rewritten. The following pmarker table used in this case:
product_par_list ---------------prod_cat1 prod_cat2 prod_cat3 pmarker value ------------1000 1001 1002
So sum_sales_per_category_mv is stale with respect to partition prod_list1 from product_par_list. Now consider the following query:
SELECT p.prod_category, SUM(s.amount_sold) AS sum_sales, COUNT(*) AS cnt FROM sales_par_list s, product_par_list p WHERE s.prod_id = p.prod_id AND p.prod_category IN ('Girls', 'Women') AND s.calendar_year <= 2000 GROUP BY p.prod_category;
This query can be rewritten in terms of sum_sales_per_category_mv as all the data corresponding to a detail table partition is accessed, and the materialized view is FRESH with respect to this data. This query accesses all data in partition prod_ cat2, which has not been updated. Following is the rewritten query in terms of sum_sales_per_category_mv:
SELECT prod_category, sum_sales, cnt FROM sum_sales_per_category_mv WHERE dbms_mview.pmarker(srid) IN (1000) GROUP BY prod_category;
Consider the following query, which has a user bind variable, :user_id, in its WHERE clause:
SELECT CUST_ID, PROD_ID, SUM(AMOUNT_SOLD) AS SUM_AMOUNT FROM SALES WHERE CUST_ID > :user_id GROUP BY CUST_ID, PROD_ID;
Because the materialized view, customer_mv, has a selection in its WHERE clause, query rewrite is dependent on the actual value of the user bind variable, user_id, to compute the containment. Because user_id is not available during query rewrite time and query rewrite is dependent on the bind value of user_id, this query cannot be rewritten. Even though the preceding example has a user bind variable in the WHERE clause, the same is true regardless of where the user bind variable appears in the query. In other words, irrespective of where a user bind variable appears in a query, if query rewrite is dependent on its value, then the query cannot be rewritten. Now consider the following query which has a user bind variable, :user_id, in its SELECT list:
SELECT CUST_ID + :user_id, PROD_ID, SUM(AMOUNT_SOLD) AS TOTAL_AMOUNT FROM SALES WHERE CUST_ID >= 2000 GROUP BY CUST_ID, PROD_ID;
Because the value of the user bind variable, user_id, is not required during query rewrite time, the preceding query will rewrite.
Query Rewrite
18-61
If you have the following query, which displays the postal codes for male customers from San Francisco or Los Angeles:
SELECT c.cust_city, FROM customers c WHERE c.cust_city = UNION ALL SELECT c.cust_city, FROM customers c WHERE c.cust_city = c.cust_postal_code 'Los Angeles' AND c.cust_gender = 'M' c.cust_postal_code 'San Francisco' AND c.cust_gender = '';
The rewritten query has dropped the UNION ALL and replaced it with the materialized view. Normally, query rewrite has to use the existing set of general eligibility rules to determine if the SELECT subselections under the UNION ALL are equivalent in the query and the materialized view. If, for example, you have a query that retrieves the postal codes for male customers from San Francisco, Palmdale, or Los Angeles, the same rewrite can occur as in the previous example but query rewrite must keep the UNION ALL with the base tables, as in the following:
SELECT c.cust_city, c.cust_postal_code FROM customers c WHERE c.cust_city= 'Palmdale' AND c.cust_gender ='M'
UNION ALL SELECT c.cust_city, FROM customers c WHERE c.cust_city = UNION ALL SELECT c.cust_city, FROM customers c WHERE c.cust_city =
c.cust_postal_code 'Los Angeles' AND c.cust_gender = 'M' c.cust_postal_code 'San Francisco' AND c.cust_gender = 'M';
So query rewrite will detect the case where a subset of the UNION ALL can be rewritten using the materialized view cust_male_postal_mv. UNION, UNION ALL, and INTERSECT are commutative, so query rewrite can rewrite regardless of the order the subselects are found in the query or materialized view. However, MINUS is not commutative. A MINUS B is not equivalent to B MINUS A. Therefore, the order in which the subselects appear under the MINUS operator in the query and the materialized view must be in the same order for rewrite to happen. As an example, consider the case where there exists an old version of the customer table called customer_old and you want to nd the difference between the old one and the current customer table only for male customers who live in London. That is, you want to nd those customers in the current one that were not in the old one. The following example shows how this is done using a MINUS:
SELECT c.cust_city, c.cust_postal_code FROM customers c WHERE c.cust_city= 'Los Angeles' AND c.cust_gender = 'M' MINUS SELECT c.cust_city, c.cust_postal_code FROM customers_old c WHERE c.cust_city = 'Los Angeles' AND c.cust_gender = 'M';
Switching the subselects would yield a different answer. This illustrates that MINUS is not commutative.
Query Rewrite
18-63
The WHERE clause of the rst subselect includes mv.marker = 2 and mv.cust_ gender = 'M', which selects only the rows that represent male customers in the second subselect of the UNION ALL. The WHERE clause of the second subselect includes mv.marker = 1 and mv.cust_gender = 'F', which selects only those
rows that represent female customers in the rst subselect of the UNION ALL. Note that query rewrite cannot take advantage of set operators that drop duplicate or distinct rows. For example, UNION drops duplicates so query rewrite cannot tell what rows have been dropped. The rules for using a marker are that it must:
s
Be a constant number or string and be the same datatype for all UNION ALL subselects. Yield a constant, distinct value for each UNION ALL subselect. You cannot reuse the same value in more than one subselect. Be in the same ordinal position for all subselects.
Explain Plan
The EXPLAIN PLAN facility is used as described in Oracle Database SQL Reference. For query rewrite, all you need to check is that the object_name column in PLAN_ TABLE contains the materialized view name. If it does, then query rewrite has occurred when this query is executed. An example is the following, which creates the materialized view cal_month_sales_mv:
CREATE MATERIALIZED VIEW cal_month_sales_mv ENABLE QUERY REWRITE AS SELECT t.calendar_month_desc, SUM(s.amount_sold) AS dollars FROM sales s, times t WHERE s.time_id = t.time_id GROUP BY t.calendar_month_desc;
If EXPLAIN PLAN is used on the following SQL statement, the results are placed in the default table PLAN_TABLE. However, PLAN_TABLE must rst be created using the utlxplan.sql script. Note that EXPLAIN PLAN does not actually execute the query.
EXPLAIN PLAN FOR SELECT t.calendar_month_desc, SUM(s.amount_sold)
Query Rewrite
18-65
For the purposes of query rewrite, the only information of interest from PLAN_ TABLE is the OBJECT_NAME, which identies the objects that will be used to execute this query. Therefore, you would expect to see the object name calendar_ month_sales_mv in the output as illustrated in the following:
SELECT OPERATION, OBJECT_NAME FROM PLAN_TABLE; OPERATION -------------------SELECT STATEMENT MAT_VIEW REWRITE ACCESS OBJECT_NAME ----------CALENDAR_MONTH_SALES_MV
DBMS_MVIEW.EXPLAIN_REWRITE Procedure
It can be difcult to understand why a query did not rewrite. The rules governing query rewrite eligibility are quite complex, involving various factors such as constraints, dimensions, query rewrite integrity modes, freshness of the materialized views, and the types of queries themselves. In addition, you may want to know why query rewrite chose a particular materialized view instead of another. To help with this matter, Oracle provides the DBMS_MVIEW.EXPLAIN_REWRITE procedure to advise you when a query can be rewritten and, if not, why not. Using the results from DBMS_MVIEW.EXPLAIN_REWRITE, you can take the appropriate action needed to make a query rewrite if at all possible. Note that the query specied in the EXPLAIN_REWRITE statement is never actually executed.
DBMS_MVIEW.EXPLAIN_REWRITE Syntax
You can obtain the output from DBMS_MVIEW.EXPLAIN_REWRITE in two ways. The rst is to use a table, while the second is to create a varray. The following shows the basic syntax for using an output table:
DBMS_MVIEW.EXPLAIN_REWRITE ( query VARCHAR2, mv VARCHAR2(30), statement_id VARCHAR2(30));
You can create an output table called REWRITE_TABLE by executing the utlxrw.sql script.
The query parameter is a text string representing the SQL query. The parameter, mv, is a fully qualied materialized view name in the form of schema.mv. This is an optional parameter. When it is not specied, EXPLAIN_REWRITE returns any relevant messages regarding all the materialized views considered for rewriting the given query. When schema is omitted and only mv is specied, EXPLAIN_REWRITE looks for the materialized view in the current schema. Therefore, to call the EXPLAIN_REWRITE procedure using an output table is as follows:
DBMS_MVIEW.EXPLAIN_REWRITE ( query [VARCHAR2 | CLOB], mv VARCHAR2(30), statement_id VARCHAR2(30));
If you want to direct the output of EXPLAIN_REWRITE to a varray instead of a table, you should call the procedure as follows:
DBMS_MVIEW.EXPLAIN_REWRITE ( query [VARCHAR2 | CLOB], mv VARCHAR2(30), output_array SYS.RewriteArrayType);
Note that if the query is less than 256 characters long, EXPLAIN_REWRITE can be easily invoked with the EXECUTE command from SQL*Plus. Otherwise, the recommended method is to use a PL/SQL BEGIN... END block, as shown in the examples in /rdbms/demo/smxrw*.
Using REWRITE_TABLE
Output of EXPLAIN_REWRITE can be directed to a table named REWRITE_TABLE. You can create this output table by running the utlxrw.sql script. This script can be found in the admin directory. The format of REWRITE_TABLE is as follows.
CREATE TABLE REWRITE_TABLE( statement_id VARCHAR2(30), mv_owner VARCHAR2(30), mv_name VARCHAR2(30), sequence INTEGER, query VARCHAR2(2000), message VARCHAR2(512), pass VARCHAR2(3), mv_in_msg VARCHAR2(30), measure_in_msg VARCHAR2(30), join_back_tbl VARCHAR2(30), ----------ID for the query MV's schema Name of the MV Seq # of error msg user query EXPLAIN_REWRITE error msg Query Rewrite pass no MV in current message Measure in current message Join back table in current msg
Query Rewrite
18-67
-------
Join back column in current msg Cost of original query Cost of rewritten query Associated flags For future use For future use
The following is another example where you can see a more detailed explanation of why some materialized views were not considered and eventually the materialized view sales_mv was chosen as the best one.
DECLARE qrytext VARCHAR2(500) :='SELECT cust_first_name, cust_last_name, SUM(amount_sold) AS dollar_sales FROM sales s, customers c WHERE s.cust_id= c.cust_id GROUP BY cust_first_name, cust_last_name'; idno VARCHAR2(30) :='ID1'; BEGIN DBMS_MVIEW.EXPLAIN_REWRITE(qrytext, '', idno); END; / SELECT message FROM rewrite_table ORDER BY sequence;
Joining materialized view, CAL_MONTH_SALES_MV, with table, SALES, not possible a more optimal materialized view than PRODUCT_SALES_MV was used to rewrite a more optimal materialized view than FWEEK_PSCAT_SALES_MV was used to rewrite query rewritten with materialized view, SALES_MV
Using a Varray
You can save the output of EXPLAIN_REWRITE in a PL/SQL varray. The elements of this array are of the type RewriteMessage, which is predened in the SYS schema as shown in the following:
TYPE RewriteMessage IS record( mv_owner VARCHAR2(30), -mv_name VARCHAR2(30), -sequence INTEGER, -query_text VARCHAR2(2000),-message VARCHAR2(512), -pass VARCHAR2(3), -mv_in_msg VARCHAR2(30), -measure_in_msg VARCHAR2(30), -join_back_tbl VARCHAR2(30), -join_back_col VARCHAR2(30), -original_cost NUMBER(10), -rewritten_cost NUMBER(10), -flags NUMBER, -reserved1 NUMBER, -reserved2 VARCHAR2(10) -); MV's schema Name of the MV Seq # of error msg user query EXPLAIN_REWRITE error msg Query Rewrite pass no MV in current message Measure in current message Join back table in current msg Join back column in current msg Cost of original query Cost of rewritten query Associated flags For future use For future use
The array type, RewriteArrayType, which is a varray of RewriteMessage objects, is predened in the SYS schema as follows:
s
TYPE RewriteArrayType AS VARRAY(256) OF RewriteMessage; Using this array type, now you can declare an array variable and specify it in the EXPLAIN_REWRITE statement. Each RewriteMessage record provides a message concerning rewrite processing. The parameters are the same as for REWRITE_TABLE, except for statement_ id, which is not used when using a varray as output. The mv_owner eld denes the owner of materialized view that is relevant to the message.
Query Rewrite
18-69
The mv_name eld denes the name of a materialized view that is relevant to the message. The sequence eld denes the sequence in which messages should be ordered. The query_text eld contains the rst 2000 characters of the query text under analysis. The message eld contains the text of message relevant to rewrite processing of query. The flags, reserved1, and reserved2 elds are reserved for future use.
EXPLAIN_REWRITE Using a VARRAY
Example 1812
The query will not rewrite with this materialized view. This can be quite confusing to a novice user as it seems like all information required for rewrite is present in the materialized view. You can nd out from DBMS_MVIEW.EXPLAIN_REWRITE that AVG cannot be computed from the given materialized view. The problem is that a ROLLUP is required here and AVG requires a COUNT or a SUM to do ROLLUP. An example PL/SQL block for the previous query, using a varray as its output, is as follows:
SET SERVEROUTPUT ON DECLARE Rewrite_Array SYS.RewriteArrayType := SYS.RewriteArrayType(); querytxt VARCHAR2(1500) := 'SELECT c.cust_state_province, AVG(s.amount_sold) FROM sales s, customers c WHERE s.cust_id = c.cust_id GROUP BY c.cust_state_province'; i NUMBER;
BEGIN DBMS_MVIEW.EXPLAIN_REWRITE(querytxt, 'AVG_SALES_CITY_STATE_MV', Rewrite_Array); FOR i IN 1..Rewrite_Array.count LOOP DBMS_OUTPUT.PUT_LINE(Rewrite_Array(i).message); END LOOP; END; /
Query Rewrite
18-71
The second argument, mv, and the third argument, statement_id, can be NULL. Similarly, the syntax for using EXPLAIN_REWRITE using CLOB to obtain the output into a varray is shown as follows:
DBMS_MVIEW.EXPLAIN_REWRITE( query IN CLOB, mv IN VARCHAR2, msg_array IN OUT SYS.RewriteArrayType);
As before, the second argument, mv, can be NULL. Note that long query texts in CLOB can be generated using the procedures provided in the DBMS_LOB package.
Query Rewrite Considerations: Constraints Query Rewrite Considerations: Dimensions Query Rewrite Considerations: Outer Joins Query Rewrite Considerations: Text Match Query Rewrite Considerations: Aggregates Query Rewrite Considerations: Grouping Conditions Query Rewrite Considerations: Expression Matching Query Rewrite Considerations: Date Folding Query Rewrite Considerations: Statistics
You should avoid using the ON DELETE clause as it can lead to unexpected results.
Query Rewrite
18-73
cost-based choice. Materialized views should thus have statistics collected using the DBMS_STATS package.
Query Rewrite
18-75
Example 1813
There are two base tables sales_fact and geog_dim. You can compute the total sales for each city, state and region with a rollup, by issuing the following statement:
SELECT g.region, g.state, g.city, GROUPING_ID(g.city, g.state, g.region), SUM(sales) FROM sales_fact f, geog_dim g WHERE f.geog_key = g.geog_key GROUP BY ROLLUP(g.region, g.state, g.city);
OLAP Server would want to materialize this query for quick results. Unfortunately, the resulting materialized view occupies too much disk space. However, if you have a dimension rolling up city to state to region, you can easily compress the three grouping columns into one column using a decode statement. (This is also known as an embedded total):
DECODE (gid, 0, city, 1, state, 3, region, 7, "grand_total")
What this does is use the lowest level of the hierarchy to represent the entire information. For example, saying Boston means Boston, MA, New England Region and saying CA to mean CA, Western Region. OLAP Server stores these embedded total results into a table, say, embedded_total_sales. However, when returning the result back to the user, you would want to have all the data columns (city, state, region). In order to return the results efciently and quickly, OLAP Server may use a custom table function (et_function) to retrieve the data back from the embedded_total_sales table in the expanded form as follows:
SELECT * FROM TABLE (et_function);
In other words, this feature allows OLAP Server to declare the equivalence of the user's preceding query to the alternative query OLAP Server uses to compute it, as in the following:
DBMS_ADVANCED_REWRITE.DECLARE_REWRITE_EQUIVALENCE ( 'OLAPI_EMBEDDED_TOTAL', 'SELECT g.region, g.state, g.city, GROUPING_ID(g.city, g.state, g.region), SUM(sales) FROM sales_fact f, geog_dim g WHERE f.geog_key = g.geog_key GROUP BY ROLLUP(g.region, g.state, g.city)', 'SELECT * FROM TABLE(et_function)');
This invocation of DECLARE_REWRITE_EQUIVALENCE creates an equivalence declaration named OLAPI_EMBEDDED_TOTAL stating that the specied SOURCE_ STMT and the specied DESTINATION_STMT are functionally equivalent, and that the specied DESTINATION_STMT is preferable for performance. After the DBA creates such a declaration, the user need have no knowledge of the space optimization being performed underneath the covers. This capability also allows OLAPI to perform specialized partial materializations of a SQL query. For instance, it could perform a rollup using a UNION ALL of three relations as shown in Example 1814.
Example 1814 Rewrite Using Equivalence (UNION ALL)
CREATE MATERIALIZED VIEW T1 AS SELECT g.region, g.state, g,city, 0 as gid, SUM(sales) sales FROM sales_fact f, geog_dim g WHERE f.geog_key = g.geog_key GROUP BY g.region, g.state, g.city; CREATE MATERIALIZED VIEW T2 AS SELECT region, state, SUM(sales) sales FROM T1 GROUP BY region, state; CREATE VIEW T3 AS SELECT region, SUM(sales) sales FROM T2 GROUP BY region;
By specifying this equivalence, Oracle would use the more efcient second form of the query to compute the ROLLUP query asked by the user.
DBMS_ADVANCED_REWRITE.DECLARE_REWRITE_EQUIVALENCE ( 'OLAPI_ROLLUP', 'SELECT g.region, g.state, g.city, GROUPING_ID(g.city, g.state, g.region), SUM(sales) FROM sales_fact f, geog_dim g WHERE f.geog_key = g.geog_key GROUP BY ROLLUP(g.region, g.state, g.city ', ' SELECT * FROM T1 UNION ALL SELECT region, state, NULL, 1 as gid, sales FROM T2 UNION ALL
Query Rewrite
18-77
SELECT region, NULL, NULL, 3 as gid, sales FROM T3 UNION ALL SELECT NULL, NULL, NULL, 7 as gid, SUM(sales) FROM T3');
Another application of this feature is to provide users special aggregate computations that may be conceptually simple but extremely complex to express in SQL. In this case, OLAP Server asks the user to use a specied custom aggregate function and internally compute it using complex SQL.
Example 1815 Rewrite Using Equivalence (Using a Custom Aggregate)
Suppose the application users want to see the sales for each city, state and region and also additional sales information for specic seasons. For example, the New England user wants additional sales information for cities in New England for the winter months. OLAP Server would provide you a special aggregate Seasonal_ Agg that computes the earlier aggregate. You would ask a classic summary query but use Seasonal_Agg(sales, region) rather than SUM(sales).
SELECT g.region, t,monthname, Seasonal_Agg(sales, region) AS sales FROM sales_fact f, geog_dim g, time t WHERE f.geog_key = g.geog_key AND f.time_key = t.time_key GROUP BY g.region, t.monthname;
Instead of asking the user to write SQL that does the extra computation, OLAP Server does it for them by using this feature. In this example, Seasonal_Agg is computed using the spreadsheet functionality (see Chapter 22, "SQL for Modeling"). Note that even though Seasonal_Agg is a user-dened aggregate, the required behavior is to add extra rows to the query's answer, which cannot be easily done with simple PL/SQL functions.
DBMS_ADVANCED_REWRITE.DECLARE_REWRITE_EQUIVALENCE ( 'OLAPI_SEASONAL_AGG', SELECT g.region, t,monthname, Seasonal_Agg(sales, region) AS sales FROM sales_fact f, geog_dim g, time t WHERE f.geog_key = g.geog_key and f.time_key = t.time_key GROUP BY g.region, t.monthname', 'SELECT g,region, t.monthname, SUM(sales) AS sales FROM sales_fact f, geog_dim g WHERE f.geog_key = g.geog_key and t.time_key = f.time_key GROUP BY g.region, g.state, g.city, t.monthname DIMENSION BY g.region, t.monthname (sales ['New England', 'Winter'] = AVG(sales) OVER monthname IN ('Dec', 'Jan', 'Feb', 'Mar'), sales ['Western', 'Summer' ] = AVG(sales) OVER monthname IN ('May', 'Jun', 'July', 'Aug'), .);
19
Schema Modeling Techniques
The following topics provide information about schemas in a data warehouse:
s
Schemas in Data Warehouses Third Normal Form Star Schemas Optimizing Star Queries
19-2
Star Schemas
Provide a neutral schema design, independent of any application or data-usage considerations May require less data-transformation than more normalized schemas such as star schemas
customers
orders
order items
products
Star Schemas
The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables.
Star Schemas
A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The optimizer recognizes star queries and generates efcient execution plans for them. A typical fact table contains keys and measures. For example, in the sh sample schema, the fact table, sales, contain the measures quantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are customers, times, products, channels, and promotions. The products dimension table, for example, contains information about each product number that appears in the fact table. A star join is a primary key to foreign key join of the dimension tables to a fact table. The main advantages of star schemas are that they:
s
Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data warehouse schema contain dimension tables.
Star schemas are used for both simple data marts and very large data warehouses. Figure 192 presents a graphical representation of a star schema.
Figure 192 Star Schema
products
times
sales (amount_sold, quantity_sold) Fact Table customers Dimension Table channels Dimension Table
19-4
Snowake Schemas
The snowake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowake schema because the diagram of the schema resembles a snowake. Snowake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. Figure 193 presents a graphical representation of a snowake schema.
Figure 193 Snowake Schema
suppliers
products
times
customers
channels
countries
over a snowake schema unless you have a clear reason not to.
A bitmap index should be built on each of the foreign key columns of the fact table or tables. The initialization parameter STAR_TRANSFORMATION_ENABLED should be set to TRUE. This enables an important optimizer feature for star-queries. It is set to FALSE by default for backward-compatibility.
When a data warehouse satises these conditions, the majority of the star queries running in the data warehouse will use a query execution strategy known as the star transformation. The star transformation provides very efcient query performance for star queries.
Oracle Database Enterprise Edition. In Oracle Database Standard Edition, bitmap indexes and star transformation are not available.
19-6
For example, the sales table of the sh sample schema has bitmap indexes on the time_id, channel_id, cust_id, prod_id, and promo_id columns. Consider the following star query:
SELECT ch.channel_class, c.cust_city, t.calendar_quarter_desc, SUM(s.amount_sold) sales_amount FROM sales s, times t, customers c, channels ch WHERE s.time_id = t.time_id AND s.cust_id = c.cust_id AND s.channel_id = ch.channel_id AND c.cust_state_province = 'CA' AND ch.channel_desc in ('Internet','Catalog') AND t.calendar_quarter_desc IN ('1999-Q1','1999-Q2') GROUP BY ch.channel_class, c.cust_city, t.calendar_quarter_desc;
This query is processed in two phases. In the rst phase, Oracle Database uses the bitmap indexes on the foreign key columns of the fact table to identify and retrieve only the necessary rows from the fact table. That is, Oracle Database will retrieve the result set from the fact table using essentially the following query:
SELECT ... FROM sales WHERE time_id IN (SELECT time_id FROM times WHERE calendar_quarter_desc IN('1999-Q1','1999-Q2')) AND cust_id IN (SELECT cust_id FROM customers WHERE cust_state_province='CA') AND channel_id IN (SELECT channel_id FROM channels WHERE channel_desc IN('Internet','Catalog'));
This is the transformation step of the algorithm, because the original star query has been transformed into this subquery representation. This method of accessing the fact table leverages the strengths of bitmap indexes. Intuitively, bitmap indexes provide a set-based processing scheme within a relational database. Oracle has implemented very fast methods for doing set operations such as AND (an intersection in standard set-based terminology), OR (a set-based union), MINUS, and COUNT. In this star query, a bitmap index on time_id is used to identify the set of all rows in the fact table corresponding to sales in 1999-Q1. This set is represented as a bitmap (a string of 1's and 0's that indicates which rows of the fact table are members of the set).
A similar bitmap is retrieved for the fact table rows corresponding to the sale from 1999-Q2. The bitmap OR operation is used to combine this set of Q1 sales with the set of Q2 sales. Additional set operations will be done for the customer dimension and the product dimension. At this point in the star query processing, there are three bitmaps. Each bitmap corresponds to a separate dimension table, and each bitmap represents the set of rows of the fact table that satisfy that individual dimension's constraints. These three bitmaps are combined into a single bitmap using the bitmap AND operation. This nal bitmap represents the set of rows in the fact table that satisfy all of the constraints on the dimension table. This is the result set, the exact set of rows from the fact table needed to evaluate the query. Note that none of the actual data in the fact table has been accessed. All of these operations rely solely on the bitmap indexes and the dimension tables. Because of the bitmap indexes' compressed data representations, the bitmap set-based operations are extremely efcient. Once the result set is identied, the bitmap is used to access the actual data from the sales table. Only those rows that are required for the end user's query are retrieved from the fact table. At this point, Oracle has effectively joined all of the dimension tables to the fact table using bitmap indexes. This technique provides excellent performance because Oracle is joining all of the dimension tables to the fact table with one logical join operation, rather than joining each dimension table to the fact table independently. The second phase of this query is to join these rows from the fact table (the result set) to the dimension tables. Oracle will use the most efcient method for accessing and joining the dimension tables. Many dimension are very small, and table scans are typically the most efcient access method for these dimension tables. For large dimension tables, table scans may not be the most efcient access method. In the previous example, a bitmap index on product.department can be used to quickly identify all of those products in the grocery department. Oracle's optimizer automatically determines which access method is most appropriate for a given dimension table, based upon the optimizer's knowledge about the sizes and data distributions of each dimension table. The specic join method (as well as indexing method) for each dimension table will likewise be intelligently determined by the optimizer. A hash join is often the most efcient algorithm for joining the dimension tables. The nal answer is returned to the user once all of the dimension tables have been joined. The query technique of retrieving only the matching rows from one table and then joining to another table is commonly known as a semijoin.
19-8
CUSTOMERS SALES_CUST_BIX
CHANNELS SALES_CHANNEL_BIX
TIMES SALES_TIME_BIX
In this plan, the fact table is accessed through a bitmap access path based on a bitmap AND, of three merged bitmaps. The three bitmaps are generated by the BITMAP MERGE row source being fed bitmaps from row source trees underneath it. Each such row source tree consists of a BITMAP KEY ITERATION row source which fetches values from the subquery row source tree, which in this example is a full table access. For each such value, the BITMAP KEY ITERATION row source retrieves the bitmap from the bitmap index. After the relevant fact table rows have been retrieved using this access path, they are joined with the dimension tables and temporary tables to produce the answer to the query.
The processing of the same star query using the bitmap join index is similar to the previous example. The only difference is that Oracle will utilize the join index, instead of a single-table bitmap index, to access the customer data in the rst phase of the star query.
SALES_C_STATE_BJIX
CHANNELS SALES_CHANNEL_BIX
TIMES SALES_TIME_BIX
The difference between this plan as compared to the previous one is that the inner part of the bitmap index scan for the customer dimension has no subselect. This is because the join predicate information on customer.cust_state_province can be satised with the bitmap join index sales_c_state_bjix.
Queries with a table hint that is incompatible with a bitmap access path Queries that contain bind variables Tables with too few bitmap indexes. There must be a bitmap index on a fact table column for the optimizer to generate a subquery for it. Remote fact tables. However, remote dimension tables are allowed in the subqueries that are generated. Anti-joined tables Tables that are already used as a dimension table in a subquery
19-11
Tables that are really unmerged views, which are not view partitions
The star transformation may not be chosen by the optimizer for the following cases:
s
Tables that have a good single-table access path Tables that are too small for the transformation to be worthwhile
In addition, temporary tables will not be used by star transformation under the following conditions:
s
The database is in read-only mode The star query is part of a transaction that is in serializable mode
20
SQL for Aggregation in Data Warehouses
This chapter discusses aggregation of SQL, a basic aspect of data warehousing. It contains these topics:
s
Overview of SQL for Aggregation in Data Warehouses ROLLUP Extension to GROUP BY CUBE Extension to GROUP BY GROUPING Functions GROUPING SETS Expression Composite Columns Concatenated Groupings Considerations when Using Aggregation Computation Using the WITH Clause Working with Hierarchical Cubes in SQL
CUBE and ROLLUP extensions to the GROUP BY clause Three GROUPING functions GROUPING SETS expression
The CUBE, ROLLUP, and GROUPING SETS extensions to SQL make querying and reporting easier and faster. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows. ROLLUP calculates aggregations such as SUM, COUNT, MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single statement to calculate all possible combinations of aggregations. The CUBE, ROLLUP, and the GROUPING SETS extension lets you specify just the groupings needed in the GROUP BY clause. This allows efcient analysis across multiple dimensions without performing a CUBE operation. Computing a CUBE creates a heavy processing load, so replacing cubes with grouping sets can signicantly increase performance. To enhance performance, CUBE, ROLLUP, and GROUPING SETS can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make aggregate calculations more efcient, thereby enhancing database performance, and scalability. The three GROUPING functions help you identify the group each row belongs to and enable sorting subtotal rows and ltering results.
20-2
Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000. Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals. List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets. To visualize data that has many dimensions, analysts commonly use the analogy of a data cube, that is, a space where facts are stored at the intersection of n dimensions. Figure 201 shows a data cube and how it can be used differently by various groups. The cube stores sales data organized by the dimensions of product, market, sales, and time. Note that this is only a metaphor: the actual data is physically stored in normal tables. The cube data consists of both detail and aggregated data.
Figure 201 Logical Cubes and Views by Different Users
Market
PR
SALES
Time
Ad Hoc View
You can retrieve slices of data from the cube. These correspond to cross-tabular reports such as the one shown in Table 201. Regional managers might study the data by comparing slices of the cube applicable to different markets. In contrast, product managers might compare slices that apply to different products. An ad hoc user might work with a wide variety of constraints, working in a subset cube. Answering multidimensional questions often involves accessing and querying huge quantities of data, sometimes in millions of rows. Because the ood of detailed data generated by large organizations cannot be interpreted at the lowest level, aggregated views of the information are essential. Aggregations, such as sums and counts, across many dimensions are vital to multidimensional analyses. Therefore, analytical tasks require convenient and efcient data aggregation.
Optimized Performance
Not only multidimensional issues, but all types of processing can benet from enhanced aggregation facilities. Transaction processing, nancial and manufacturing systemsall of these generate large numbers of production reports needing substantial system resources. Improved efciency when creating these reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels will benet from optimized aggregation performance. These extensions provide aggregation features and bring many benets, including:
s
Simplied programming requiring less SQL code for many tasks. Quicker and more efcient query processing. Reduced client processing loads and network trafc because aggregation work is shifted to servers. Opportunities for caching aggregations because similar queries can leverage existing work.
An Aggregate Scenario
To illustrate the use of the GROUP BY extension, this chapter uses the sh data of the sample schema. All the examples refer to data from this scenario. The hypothetical company has sales across the world and tracks sales by both dollars and quantities information. Because there are many rows of data, the queries shown here typically have tight constraints on their WHERE clauses to limit the results to a small number of rows.
20-4
Example 201
Table 201 is a sample cross-tabular report showing the total sales by country_id and channel_desc for the US and France through the Internet and direct sales in September 2000.
Table 201 Channel Simple Cross-Tabular Report With Subtotals Country France Internet Direct Sales Total 9,597 61,202 70,799 US 124,224 638,201 762,425 Total 133,821 699,403 833,224
Consider that even a simple report such as this, with just nine values in its grid, generates four subtotals and a grand total. Half of the values needed for this report would not be calculated with a query that requested SUM(amount_sold) and did a GROUP BY(channel_desc, country_id). To get the higher-level aggregates would require additional queries. Database commands that offer improved calculation of subtotals bring major benets to querying, reporting, and analytical operations.
SELECT channels.channel_desc, countries.country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09' AND customers.country_id=countries.country_id AND countries.country_iso_code IN ('US','FR') GROUP BY CUBE(channels.channel_desc, countries.country_iso_code); CHANNEL_DESC CO SALES$ -------------------- -- -------------833,224 FR 70,799 US 762,425 Internet 133,821 Internet FR 9,597 Internet US 124,224 Direct Sales 699,403 Direct Sales FR 61,202 Direct Sales US 638,201
It is very helpful for subtotaling along a hierarchical dimension such as time or geography. For instance, a query could specify a ROLLUP(y, m, day) or ROLLUP(country, state, city). For data warehouse administrators using summary tables, ROLLUP can simplify and speed up the maintenance of summary tables.
ROLLUP Syntax
ROLLUP appears in the GROUP BY clause in a SELECT statement. Its form is:
20-6
This example uses the data in the sh sample schema data, the same data as was used in Figure 201. The ROLLUP is across three dimensions.
SELECT channels.channel_desc, calendar_month_desc, countries.country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US') GROUP BY ROLLUP(channels.channel_desc, calendar_month_desc, countries.country_iso_code); CHANNEL_DESC -------------------Internet Internet Internet Internet Internet Internet Internet Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales CALENDAR -------2000-09 2000-09 2000-09 2000-10 2000-10 2000-10 2000-09 2000-09 2000-09 2000-10 2000-10 2000-10 CO SALES$ -- -------------GB 228,241 US 228,241 456,482 GB 239,236 US 239,236 478,473 934,955 GB 1,217,808 US 1,217,808 2,435,616 GB 1,225,584 US 1,225,584 2,451,169 4,886,784 5,821,739
Note that results do not always add due to rounding. This query returns the following sets of rows:
s
Regular aggregation rows that would be produced by GROUP BY without using ROLLUP. First-level subtotals aggregating across country_id for each combination of channel_desc and calendar_month.
Second-level subtotals aggregating across calendar_month_desc and country_id for each channel_desc value. A grand total row.
Partial Rollup
You can also roll up so that only some of the sub-totals will be included. This partial rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP BY clause creates subtotals at (2+1=3) aggregation levels. That is, at level (expr1, expr2, expr3), (expr1, expr2), and (expr1).
Example 203 Partial ROLLUP
SELECT channel_desc, calendar_month_desc, countries.country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US') GROUP BY channel_desc, ROLLUP(calendar_month_desc, countries.country_iso_code); CHANNEL_DESC -------------------Internet Internet Internet Internet Internet Internet Internet Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales CALENDAR -------2000-09 2000-09 2000-09 2000-10 2000-10 2000-10 2000-09 2000-09 2000-09 2000-10 2000-10 2000-10 CO SALES$ -- -------------GB 228,241 US 228,241 456,482 GB 239,236 US 239,236 478,473 934,955 GB 1,217,808 US 1,217,808 2,435,616 GB 1,225,584 US 1,225,584 2,451,169 4,886,784
20-8
Regular aggregation rows that would be produced by GROUP BY without using ROLLUP. First-level subtotals aggregating across country_id for each combination of channel_desc and calendar_month_desc. Second-level subtotals aggregating across calendar_month_desc and country_id for each channel_desc value. It does not produce a grand total row.
"Hierarchy Handling in ROLLUP and CUBE" on page 20-26 for an example of handling rollup calculations efciently.
CUBE Syntax
CUBE appears in the GROUP BY clause in a SELECT statement. Its form is:
SELECT GROUP BY CUBE (grouping_column_reference_list) CUBE
Example 204
SELECT channel_desc, calendar_month_desc, countries.country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, countries.country_iso_code); CHANNEL_DESC CALENDAR CO SALES$ -------------------- -------- -- -------------5,821,739 GB 2,910,870 US 2,910,870 2000-09 2,892,098 2000-09 GB 1,446,049 2000-09 US 1,446,049 2000-10 2,929,641 2000-10 GB 1,464,821 2000-10 US 1,464,821 Internet 934,955 Internet GB 467,478 Internet US 467,478 Internet 2000-09 456,482 Internet 2000-09 GB 228,241 Internet 2000-09 US 228,241 Internet 2000-10 478,473 Internet 2000-10 GB 239,236 Internet 2000-10 US 239,236 Direct Sales 4,886,784 Direct Sales GB 2,443,392 Direct Sales US 2,443,392 Direct Sales 2000-09 2,435,616 Direct Sales 2000-09 GB 1,217,808
US GB US
Partial CUBE
Partial CUBE resembles partial ROLLUP in that you can limit it to certain dimensions and precede it with columns outside the CUBE operator. In this case, subtotals of all possible combinations are limited to the dimensions within the cube list (in parentheses), and they are combined with the preceding items in the GROUP BY list. The syntax for partial CUBE is as follows:
GROUP BY expr1, CUBE(expr2, expr3)
Example 205
Using the sales database, you can issue the following statement:
SELECT channel_desc, calendar_month_desc, countries.country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id = times.time_id AND sales.cust_id = customers.cust_id AND sales.channel_id = channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US') GROUP BY channel_desc, CUBE(calendar_month_desc, countries.country_iso_code); CHANNEL_DESC CALENDAR CO SALES$ -------------------- -------- -- -------------Internet 934,955 Internet GB 467,478 Internet US 467,478 Internet 2000-09 456,482
GROUPING Functions
Internet Internet Internet Internet Internet Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales
GB US GB US GB US
GB US GB US
228,241 228,241 478,473 239,236 239,236 4,886,784 2,443,392 2,443,392 2,435,616 1,217,808 1,217,808 2,451,169 1,225,584 1,225,584
GROUPING Functions
Two challenges arise with the use of ROLLUP and CUBE. First, how can you programmatically determine which result set rows are subtotals, and how do you nd the exact level of aggregation for a given subtotal? You often need to use subtotals in calculations such as percent-of-totals, so you need an easy way to determine which rows are the subtotals. Second, what happens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE? How can you differentiate between the two? See Oracle Database SQL Reference for syntax and restrictions.
GROUPING Functions
GROUPING Function
GROUPING handles these problems. Using a single column as its argument, GROUPING returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, returns a 0. GROUPING appears in the selection list portion of a SELECT statement. Its form is:
SELECT [GROUPING(dimension_column)] GROUP BY {CUBE | ROLLUP| GROUPING SETS} Example 206 GROUPING to Mask Columns (dimension_column)
This example uses GROUPING to create a set of mask columns for the result set shown in Example 203 on page 20-8. The mask columns are easy to analyze programmatically.
SELECT channel_desc, calendar_month_desc, country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, GROUPING(channel_desc) AS Ch, GROUPING(calendar_month_desc) AS Mo, GROUPING(country_iso_code) AS Co FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, countries.country_iso_code); CHANNEL_DESC -------------------Internet Internet Internet Internet Internet Internet Internet Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales CALENDAR -------2000-09 2000-09 2000-09 2000-10 2000-10 2000-10 2000-09 2000-09 2000-09 2000-10 2000-10 2000-10 CO SALES$ CH MO CO -- -------------- ---------- ---------- ---------GB 228,241 0 0 0 US 228,241 0 0 0 456,482 0 0 1 GB 239,236 0 0 0 US 239,236 0 0 0 478,473 0 0 1 934,955 0 1 1 GB 1,217,808 0 0 0 US 1,217,808 0 0 0 2,435,616 0 0 1 GB 1,225,584 0 0 0 US 1,225,584 0 0 0 2,451,169 0 0 1 4,886,784 0 1 1 5,821,739 1 1 1
GROUPING Functions
A program can easily identify the detail rows by a mask of "0 0 0" on the T, R, and D columns. The rst level subtotal rows have a mask of "0 0 1", the second level subtotal rows have a mask of "0 1 1", and the overall total row has a mask of "1 1 1". You can improve the readability of result sets by using the GROUPING and DECODE functions as shown in Example 207.
Example 207 GROUPING For Readability
SELECT DECODE(GROUPING(channel_desc), 1, 'Multi-channel sum', channel_desc) AS Channel, DECODE (GROUPING (country_iso_code), 1, 'Multi-country sum', country_iso_code) AS Country, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc= '2000-09' AND country_iso_code IN ('GB', 'US') GROUP BY CUBE(channel_desc, country_iso_code); CHANNEL -------------------Multi-channel sum Multi-channel sum Multi-channel sum Internet Internet Internet Direct Sales Direct Sales Direct Sales COUNTRY SALES$ ----------------- -------------Multi-country sum 2,892,098 GB 1,446,049 US 1,446,049 Multi-country sum 456,482 GB 228,241 US 228,241 Multi-country sum 2,435,616 GB 1,217,808 US 1,217,808
To understand the previous statement, note its rst column specication, which handles the channel_desc column. Consider the rst line of the previous statement:
SELECT DECODE(GROUPING(channel_desc), 1, 'All Channels', channel_desc)AS Channel
In this, the channel_desc value is determined with a DECODE function that contains a GROUPING function. The GROUPING function returns a 1 if a row value is an aggregate created by ROLLUP or CUBE, otherwise it returns a 0. The DECODE function then operates on the GROUPING function's results. It returns the text "All Channels" if it receives a 1 and the channel_desc value from the database if it receives a 0. Values from the database will be either a real value such as "Internet" or a stored NULL. The second column specication, displaying country_id, works the same way.
GROUPING Functions
SELECT channel_desc, calendar_month_desc, country_iso_code, TO_CHAR( SUM(amount_sold), '9,999,999,999') SALES$, GROUPING(channel_desc) CH, GROUPING (calendar_month_desc) MO, GROUPING(country_iso_code) CO FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_iso_code) HAVING (GROUPING(channel_desc)=1 AND GROUPING(calendar_month_desc)= 1 AND GROUPING(country_iso_code)=1) OR (GROUPING(channel_desc)=1 AND GROUPING (calendar_month_desc)= 1) OR (GROUPING(country_iso_code)=1 AND GROUPING(calendar_month_desc)= 1); CHANNEL_DESC C CO SALES$ CH MO CO -------------------- - -- -------------- ---------- ---------- ---------GB 2,910,870 1 1 0 US 2,910,870 1 1 0 Direct Sales 4,886,784 0 1 1 Internet 934,955 0 1 1 5,821,739 1 1 1
Compare the result set of Example 208 with that in Example 203 on page 20-8 to see how Example 208 is a precisely specied group: it contains only the yearly totals, regional totals aggregated over time and department, and the grand total.
GROUPING_ID Function
To nd the GROUP BY level of a particular row, a query must return GROUPING function information for each of the GROUP BY columns. If we do this using the GROUPING function, every GROUP BY column requires another column using the GROUPING function. For instance, a four-column GROUP BY clause needs to be analyzed with four GROUPING functions. This is inconvenient to write in SQL and increases the number of columns required in the query. When you want to store the
GROUPING Functions
query result sets in tables, as with materialized views, the extra columns waste storage space. To address these problems, you can use the GROUPING_ID function. GROUPING_ ID returns a single number that enables you to determine the exact GROUP BY level. For each row, GROUPING_ID takes the set of 1's and 0's that would be generated if you used the appropriate GROUPING functions and concatenates them, forming a bit vector. The bit vector is treated as a binary number, and the number's base-10 value is returned by the GROUPING_ID function. For instance, if you group with the expression CUBE(a, b) the possible values are as shown in Table 202.
Table 202 GROUPING_ID Example for CUBE(a, b) Bit Vector 00 01 10 11 GROUPING_ID 0 1 2 3
GROUPING_ID clearly distinguishes groupings created by grouping set specication, and it is very useful during refresh and rewrite of materialized views.
GROUP_ID Function
While the extensions to GROUP BY offer power and exibility, they also allow complex result sets that can include duplicate groupings. The GROUP_ID function lets you distinguish among duplicate groupings. If there are multiple sets of rows calculated for a given level, GROUP_ID assigns the value of 0 to all the rows in the rst set. All other sets of duplicate rows for a particular grouping are assigned higher values, starting with 1. For example, consider the following query, which generates a duplicate grouping:
Example 209 GROUP_ID
SELECT country_iso_code, SUBSTR(cust_state_province,1,12), SUM(amount_sold), GROUPING_ID(country_iso_code, cust_state_province) GROUPING_ID, GROUP_ID() FROM sales, customers, times, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND customers.country_id=countries.country_id AND times.time_id= '30-OCT-00' AND country_iso_code IN ('FR', 'ES') GROUP BY GROUPING SETS (country_iso_code,
ROLLUP(country_iso_code, cust_state_province)); CO -ES ES ES FR FR FR FR FR ES FR ES FR SUBSTR(CUST_ SUM(AMOUNT_SOLD) GROUPING_ID GROUP_ID() ------------ ---------------- ----------- ---------Alicante 135.32 0 0 Valencia 4133.56 0 0 Barcelona 24.22 0 0 Centre 74.3 0 0 Aquitaine 231.97 0 0 Rhtne-Alpes 1624.69 0 0 Ile-de-Franc 1860.59 0 0 Languedoc-Ro 4287.4 0 0 12372.05 3 0 4293.1 1 0 8078.95 1 0 4293.1 1 1 8078.95 1 1
This query generates the following groupings: (country_id, cust_state_ province), (country_id), (country_id), and (). Note that the grouping (country_id) is repeated twice. The syntax for GROUPING SETS is explained in "GROUPING SETS Expression" on page 20-17. This function helps you lter out duplicate groupings from the result. For example, you can lter out duplicate (region) groupings from the previous example by adding a HAVING clause condition GROUP_ID()=0 to the query.
Note that this statement uses composite columns, described in "Composite Columns" on page 20-20. This statement calculates aggregates over three groupings:
s
Compare the previous statement with the following alternative, which uses the CUBE operation and the GROUPING_ID function to return the desired rows:
SELECT channel_desc, calendar_month_desc, country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, GROUPING_ID(channel_desc, calendar_month_desc, country_iso_code) gid FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US') GROUP BY CUBE(channel_desc, calendar_month_desc, country_iso_code) HAVING GROUPING_ID(channel_desc, calendar_month_desc, country_iso_code)=0 OR GROUPING_ID(channel_desc, calendar_month_desc, country_iso_code)=2 OR GROUPING_ID(channel_desc, calendar_month_desc, country_iso_code)=4;
This statement computes all the 8 (2 *2 *2) groupings, though only the previous 3 groups are of interest to you. Another alternative is the following statement, which is lengthy due to several unions. This statement requires three scans of the base table, making it inefcient. CUBE and ROLLUP can be thought of as grouping sets with very specic semantics. For example, consider the following statement:
CUBE(a, b, c)
Table 203 shows grouping sets specication and equivalent GROUP BY specication. Note that some examples use composite columns.
Table 203 GROUPING SETS Statements and Equivalent GROUP BY Equivalent GROUP BY Statement GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY c GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY b, c GROUP BY a, b, c GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY ()
GROUPING SETS Statement GROUP BY GROUPING SETS(a, b, c) GROUP BY GROUPING SETS(a, b, (b, c)) GROUP BY GROUPING SETS((a, b, c)) GROUP BY GROUPING SETS(a, (b), ())
GROUP BY GROUP BY a UNION ALL GROUPING SETS(a, ROLLUP(b, c)) GROUP BY ROLLUP(b, c)
In the absence of an optimizer that looks across query blocks to generate the execution plan, a query based on UNION would need multiple scans of the base table, sales. This could be very inefcient as fact tables will normally be huge. Using GROUPING SETS statements, all the groupings of interest are available in the same query block.
Composite Columns
Composite Columns
A composite column is a collection of columns that are treated as a unit during the computation of groupings. You specify the columns in parentheses as in the following statement:
ROLLUP (year, (quarter, month), day)
In this statement, the data is not rolled up across year and quarter, but is instead equivalent to the following groupings of a UNION ALL:
s
Here, (quarter, month) form a composite column and are treated as a unit. In general, composite columns are useful in ROLLUP, CUBE, GROUPING SETS, and concatenated groupings. For example, in CUBE or ROLLUP, composite columns would mean skipping aggregation across certain levels. That is, the following statement:
GROUP BY ROLLUP(a, (b, c))
Here, (b, c) are treated as a unit and rollup will not be applied across (b, c). It is as if you have an alias, for example z, for (b, c) and the GROUP BY expression reduces to GROUP BY ROLLUP(a, z). Compare this with the normal rollup as in the following:
GROUP BY ROLLUP(a, b, c)
Composite Columns
GROUP BY CUBE((a, b), c) GROUP GROUP GROUP GROUP BY BY BY By a, b, c UNION ALL a, b UNION ALL c UNION ALL ()
In GROUPING SETS, a composite column is used to denote a particular level of GROUP BY. See Table 203 for more examples of composite columns.
Example 2010 Composite Columns
You do not have full control over what aggregation levels you want with CUBE and ROLLUP. For example, the following statement:
SELECT channel_desc, calendar_month_desc, country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US') GROUP BY ROLLUP(channel_desc, calendar_month_desc, country_iso_code);
If you are just interested in grouping of lines (1), (3) and (4) in this example, you cannot limit the calculation to those groupings without using composite columns. With composite columns, this is possible by treating month and country as a single unit while rolling up. Columns enclosed in parentheses are treated as a unit while computing CUBE and ROLLUP. Thus, you would say:
SELECT channel_desc, calendar_month_desc, country_iso_code, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN
Concatenated Groupings
('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US') GROUP BY ROLLUP(channel_desc, (calendar_month_desc, country_iso_code)); CHANNEL_DESC -------------------Internet Internet Internet Internet Internet Direct Sales Direct Sales Direct Sales Direct Sales Direct Sales CALENDAR -------2000-09 2000-09 2000-10 2000-10 2000-09 2000-09 2000-10 2000-10 CO SALES$ -- -------------GB 228,241 US 228,241 GB 239,236 US 239,236 934,955 GB 1,217,808 US 1,217,808 GB 1,225,584 US 1,225,584 4,886,784 5,821,739
Concatenated Groupings
Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specied with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of concatenated groupings to generate a large number of nal groups. The concatenated groupings are specied simply by listing multiple grouping sets, cubes, and rollups, and separating them with commas. Here is an example of concatenated grouping sets:
GROUP BY GROUPING SETS(a, b), GROUPING SETS(c, d)
Ease of query development You need not enumerate all groupings manually.
Use by applications SQL generated by OLAP applications often involves concatenation of grouping sets, with each grouping set dening groupings needed for a dimension.
Concatenated Groupings
Example 2011
Concatenated Groupings
You can also specify more than one grouping in the GROUP BY clause. For example, if you want aggregated sales values for each product rolled up across all levels in the time dimension (year, month and day), and across all levels in the geography dimension (region), you can issue the following statement:
SELECT channel_desc, calendar_year, calendar_quarter_desc, country_iso_code, cust_state_province, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id = times.time_id AND sales.cust_id = customers.cust_id AND sales.channel_id = channels.channel_id AND countries.country_id = customers.country_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'FR') GROUP BY channel_desc, GROUPING SETS (ROLLUP(calendar_year, calendar_quarter_desc), ROLLUP(country_iso_code, cust_state_province));
(channel_desc, calendar_year, calendar_quarter_desc) (channel_desc, calendar_year) (channel_desc) (channel_desc, country_iso_code, cust_state_province) (channel_desc, country_iso_code) (channel_desc)
The expression, channel_desc ROLLUP(calendar_year, calendar_quarter_desc), which is equivalent to ((calendar_year, calendar_quarter_desc), (calendar_year), ()) ROLLUP(country_iso_code, cust_state_province), which is equivalent to ((country_iso_code, cust_state_province), (country_ iso_code), ())
Note that the output contains two occurrences of (channel_desc) group. To lter out the extra (channel_desc) group, the query could use a GROUP_ID function. Another concatenated join example is the following, showing the cross product of two grouping sets:
Concatenated Groupings
Example 2012
SELECT country_iso_code, cust_state_province, calendar_year, calendar_quarter_ desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND countries.country_id=customers.country_id AND sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_iso_code IN ('GB', 'FR') GROUP BY GROUPING SETS (country_iso_code, cust_state_province), GROUPING SETS (calendar_year, calendar_quarter_desc);
(country_iso_code, year), (country_iso_code, calendar_quarter_ desc), (cust_state_province, year) and (cust_state_province, calendar_quarter_desc)
time: year, quarter, month, day (week is in a separate hierarchy) product: category, subcategory, prod_name geography: region, subregion, country, state, city
This data is represented using a column for each level of the hierarchies, creating a total of twelve columns for dimensions, plus the columns holding sales gures. For our business intelligence needs, we would like to calculate and store certain aggregates of the various combinations of dimensions. In Example 2013 on page 20-25, we create the aggregates for all levels, except for "day", which would create too many rows. In particular, we want to use ROLLUP within each dimension
Concatenated Groupings
to generate useful aggregates. Once we have the ROLLUP-based aggregates within each dimension, we want to combine them with the other dimensions. This will generate our hierarchical cube. Note that this is not at all the same as a CUBE using all twelve of the dimension columns: that would create 2 to the 12th power (4,096) aggregation groups, of which we need only a small fraction. Concatenated grouping sets make it easy to generate exactly the aggregations we need. Example 2013 shows where a GROUP BY clause is needed.
Example 2013 Concatenated Groupings and Hierarchical Cubes
SELECT calendar_year, calendar_quarter_desc, calendar_month_desc, country_region, country_subregion, countries.country_iso_code, cust_state_ province, cust_city, prod_category_desc, prod_subcategory_desc, prod_name, TO_CHAR(SUM (amount_sold), '9,999,999,999') SALES$ FROM sales, customers, times, channels, countries, products WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND sales.channel_id= channels.channel_id AND sales.prod_id=products.prod_id AND customers.country_id=countries.country_id AND channels.channel_desc IN ('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09', '2000-10') AND prod_name IN ('Envoy Ambassador', 'Mouse Pad') AND countries.country_iso_code IN ('GB', 'US') GROUP BY ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc), ROLLUP(country_region, country_subregion, countries.country_iso_code, cust_state_province, cust_city), ROLLUP(prod_category_desc, prod_subcategory_desc, prod_name);
The ROLLUPs in the GROUP BY specication generate the following groups, four for each dimension.
Table 204 Hierarchical CUBE Example
ROLLUP By Product category, subcategory, name ROLLUP By Geography region, subregion, country, state, city region, subregion, country, state region, subregion, country year, quarter year all times category, subcategory category all products region, subregion region all geographies
The concatenated grouping sets specied in the previous SQL will take the ROLLUP aggregations listed in the table and perform a cross-product on them. The cross-product will create the 96 (4x4x6) aggregate groups needed for a hierarchical cube of the data. There are major advantages in using three ROLLUP expressions to replace what would otherwise require 96 grouping set expressions: the concise SQL is far less error-prone to develop and far easier to maintain, and it enables much better query optimization. You can picture how a cube with more dimensions and more levels would make the use of concatenated groupings even more advantageous. See "Working with Hierarchical Cubes in SQL" on page 20-29 for more information regarding hierarchical cubes.
Hierarchy Handling in ROLLUP and CUBE Column Capacity in ROLLUP and CUBE HAVING Clause Used with GROUP BY Extensions ORDER BY Clause Used with GROUP BY Extensions Using Other Aggregate Functions with ROLLUP and CUBE
SELECT calendar_year, calendar_quarter_number, calendar_month_number, SUM(amount_sold) FROM sales, times, products, customers, countries WHERE sales.time_id=times.time_id AND sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND prod_name IN ('Envoy Ambassador',
'Mouse Pad') AND country_iso_code = 'GB' AND calendar_year=1999 GROUP BY ROLLUP(calendar_year, calendar_quarter_number, calendar_month_number); CALENDAR_YEAR CALENDAR_QUARTER_NUMBER CALENDAR_MONTH_NUMBER SUM(AMOUNT_SOLD) ------------- ----------------------- --------------------- ---------------1999 1 1 168419.74 1999 1 2 332348.02 1999 1 3 169511.52 1999 1 670279.28 1999 2 4 247291.88 1999 2 5 182338.86 1999 2 6 264493.91 1999 2 694124.65 1999 3 7 192268.11 1999 3 8 182550.88 1999 3 9 270309.26 1999 3 645128.25 1999 4 10 180400.32 1999 4 11 232830.87 1999 4 12 168008.07 1999 4 581239.26 1999 2590771.44
Example 2015
WITH Clause
WITH channel_summary AS (SELECT channels.channel_desc, SUM(amount_sold) AS channel_total FROM sales, channels WHERE sales.channel_id = channels.channel_id GROUP BY channels.channel_desc) SELECT channel_desc, channel_total FROM channel_summary WHERE channel_total > (SELECT SUM(channel_total) * 1/3 FROM channel_summary); CHANNEL_DESC CHANNEL_TOTAL -------------------- ------------Direct Sales 57875260.6
Note that this example could also be performed efciently using the reporting aggregate functions described in Chapter 21, "SQL for Analysis and Reporting".
The following shows the GROUP BY clause needed to create a hierarchical cube for a 2-dimensional example similar to 2013. The following simple syntax performs a concatenated rollup:
GROUP BY ROLLUP(year, quarter, month), ROLLUP(Division, brand, item)
This concatenated rollup takes the ROLLUP aggregations similar to those listed in Table 204, "Hierarchical CUBE Example" in the prior section and performs a cross-product on them. The cross-product will create the 16 (4x4) aggregate groups needed for a hierarchical cube of the data.
Consider the following analytic query. It consists of a hierarchical cube query nested in a slicing query.
SELECT month, division, sum_sales FROM (SELECT year, quarter, month, division, brand, item, SUM(sales) sum_sales, GROUPING_ID(grouping-columns) gid FROM sales, products, time WHERE join-condition GROUP BY ROLLUP(year, quarter, month), ROLLUP(division, brand, item)) WHERE division = 25 AND month = 200201 AND gid = gid-for-Division-Month;
The inner hierarchical cube specied denes a simple cube, with two dimensions and four levels in each dimension. It would generate 16 groups (4 Time levels * 4 Product levels). The GROUPING_ID function in the query identies the specic group each row belongs to, based on the aggregation level of the grouping-columns in its argument. The outer query applies the constraints needed for our specic query, limiting Division to a value of 25 and Month to a value of 200201 (representing January 2002 in this case). In conceptual terms, it slices a small chunk of data from the cube. The outer query's constraint on the GID column, indicated in the query by gid-for-division-month would be the value of a key indicating that the data is grouped as a combination of division and month. The GID constraint selects only those rows that are aggregated at the level of a GROUP BY month, division clause. Oracle Database removes unneeded aggregation groups from query processing based on the outer query conditions. The outer conditions of the previous query limit the result set to a single group aggregating division and month. Any other
groups involving year, month, brand, and item are unnecessary here. The group pruning optimization recognizes this and transforms the query into:
SELECT month, division, sum_sales FROM (SELECT null, null, month, division, null, null, SUM(sales) sum_sales, GROUPING_ID(grouping-columns) gid FROM sales, products, time WHERE join-condition GROUP BY month, division) WHERE division = 25 AND month = 200201 AND gid = gid-for-Division-Month;
The bold items highlight the changed SQL. The inner query now has a simple GROUP BY clause of month, division. The columns year, quarter, brand and item have been converted to null to match the simplied GROUP BY clause. Because the query now requests just one group, fteen out of sixteen groups are removed from the processing, greatly reducing the work. For a cube with more dimensions and more levels, the savings possible through group pruning can be far greater. Note that the group pruning transformation works with all the GROUP BY extensions: ROLLUP, CUBE, and GROUPING SETS. While the optimizer has simplied the previous query to a simple GROUP BY, faster response times can be achieved if the group is precomputed and stored in a materialized view. Because OLAP queries can ask for any slice of the cube many groups may need to be precomputed and stored in a materialized view. This is discussed in the next section.
than a small set of aggregate groups. The trade-off in processing time and disk space versus query performance needs to be considered before deciding to create it. An additional possibility you could consider is to use data compression to lessen your disk space requirements. See Oracle Database SQL Reference for compression syntax and restrictions and "Storage And Table Compression" on page 8-22 for details regarding compression.
CREATE MATERIALIZED VIEW sales_hierarchical_mon_cube_mv PARTITION BY RANGE (mon) SUBPARTITION BY LIST (gid) REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT calendar_year yr, calendar_quarter_desc qtr, calendar_month_desc mon, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name) gid, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY calendar_year, calendar_quarter_desc, calendar_month_desc, ROLLUP(country_id, cust_state_province, cust_city), ROLLUP(prod_category, prod_subcategory, prod_name), ...;
CREATE MATERIALIZED VIEW sales_hierarchical_qtr_cube_mv REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT calendar_year yr, calendar_quarter_desc qtr, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name, GROUPING_ID(calendar_year, calendar_quarter_desc, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name) gid, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY calendar_year, calendar_quarter_desc, ROLLUP(country_id, cust_state_province, cust_city), ROLLUP(prod_category, prod_subcategory, prod_name), PARTITION BY RANGE (qtr) SUBPARTITION BY LIST (gid) ...; CREATE MATERIALIZED VIEW sales_hierarchical_yr_cube_mv PARTITION BY RANGE (year) SUBPARTITION BY LIST (gid) REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT calendar_year yr, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name, GROUPING_ID(calendar_year, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name) gid, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY calendar_year, ROLLUP(country_id, cust_state_province, cust_city), ROLLUP(prod_category, prod_subcategory, prod_name), ...; CREATE MATERIALIZED VIEW sales_hierarchical_all_cube_mv REFRESH FAST ON DEMAND ENABLE QUERY REWRITE AS SELECT country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name, GROUPING_ID(country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name) gid,
SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY ROLLUP(country_id, cust_state_province, cust_city), ROLLUP(prod_category, prod_subcategory, prod_name), PARTITION BY LIST (gid) ...;
This allows use of PCT refresh on the materialized views sales_hierarchical_ mon_cube_mv, sales_hierarchical_qtr_cube_mv, and sales_ hierarchical_yr_cube_mv on partition maintenance operations to sales table. PCT refresh can also be used when there have been signicant changes to the base table and log based fast refresh is estimated to be slower than PCT refresh. You can just specify the method as force (method => '?') in to refresh sub-programs in the DBMS_MVIEW package and Oracle Database will pick the best method of refresh. See "Partition Change Tracking (PCT) Refresh" on page 15-15 for more information regarding PCT refresh. Because sales_hierarchical_qtr_cube_mv does not contain any column from times table, PCT refresh is not enabled on it. But, you can still call refresh sub-programs in the DBMS_MVIEW package with method as force (method => '?') and Oracle Database will pick the best method of refresh. If you are interested in a partial cube (that is, a subset of groupings from the complete cube), then Oracle Corporation recommends storing the cube as a "federated cube". A federated cube stores each grouping of interest in a separate materialized view.
calendar_year yr, calendar_quarter_desc qtr, calendar_month_desc mon, country_id, cust_state_province, cust_city, prod_category, prod_subcategory, prod_name, CREATE MATERIALIZED VIEW sales_mon_city_prod_mv PARTITION BY RANGE (mon) ... BUILD DEFERRED REFRESH FAST ON DEMAND USING TRUSTED CONSTRAINTS ENABLE QUERY REWRITE AS SELECT calendar_month_desc mon, cust_city, prod_name, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY calendar_month_desc, cust_city, prod_name;
CREATE MATERIALIZED VIEW sales_qtr_city_prod_mv PARTITION BY RANGE (qtr) ... BUILD DEFERRED REFRESH FAST ON DEMAND USING TRUSTED CONSTRAINTS ENABLE QUERY REWRITE AS SELECT calendar_quarter_desc qtr, cust_city, prod_name,SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id = t.time_id GROUP BY calendar_quarter_desc, cust_city, prod_name; CREATE MATERIALIZED VIEW sales_yr_city_prod_mv PARTITION BY RANGE (yr) ... BUILD DEFERRED REFRESH FAST ON DEMAND USING TRUSTED CONSTRAINTS ENABLE QUERY REWRITE AS SELECT calendar_year yr, cust_city, prod_name, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id = t.time_id GROUP BY calendar_year, cust_city, prod_name; CREATE MATERIALIZED VIEW sales_mon_city_scat_mv PARTITION BY RANGE (mon) ... BUILD DEFERRED REFRESH FAST ON DEMAND USING TRUSTED CONSTRAINTS ENABLE QUERY REWRITE AS SELECT calendar_month_desc mon, cust_city, prod_subcategory, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id =t.time_id GROUP BY calendar_month_desc, cust_city, prod_subcategory; CREATE MATERIALIZED VIEW sales_qtr_city_cat_mv PARTITION BY RANGE (qtr) ... BUILD DEFERRED REFRESH FAST ON DEMAND
USING TRUSTED CONSTRAINTS ENABLE QUERY REWRITE AS SELECT calendar_quarter_desc qtr, cust_city, prod_category cat, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id =t.time_id GROUP BY calendar_quarter_desc, cust_city, prod_category; CREATE MATERIALIZED VIEW sales_yr_city_all_mv PARTITION BY RANGE (yr) ... BUILD DEFERRED REFRESH FAST ON DEMAND USING TRUSTED CONSTRAINTS ENABLE QUERY REWRITE AS SELECT calendar_year yr, cust_city, SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star FROM sales s, products p, customers c, times t WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id GROUP BY calendar_year, cust_city;
These materialized views can be created as BUILD DEFERRED and then, you can execute DBMS_MVIEW.REFRESH_DEPENDENT(number_of_failures, 'SALES', 'C' ...) so that the complete refresh of each of the materialized views dened on the detail table SALES is scheduled in the most efcient order. Please refer to section on "Scheduling Refresh" on page 15-22. Because each of these materialized views is partitioned on the time level (month, quarter, or year) present in the SELECT list, PCT is enabled on SALES table for each one of them, thus providing an opportunity to apply PCT refresh method in addition to FAST and COMPLETE refresh methods.
21
SQL for Analysis and Reporting
The following topics provide information about how to improve analytical SQL queries in a data warehouse:
s
Overview of SQL for Analysis and Reporting Ranking Functions Windowing Aggregate Functions Reporting Aggregate Functions LAG/LEAD Functions FIRST/LAST Functions Inverse Percentile Functions Hypothetical Rank and Distribution Functions Linear Regression Functions Frequent Itemsets Other Statistical Functions WIDTH_BUCKET Function User-Dened Aggregate Functions CASE Expressions Data Densication for Reporting Time Series Calculations on Densied Data
Rankings and percentiles Moving window calculations Lag/lead analysis First/last analysis Linear regression statistics
Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to nd moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to nd the rst or last value in an ordered group. Other enhancements to SQL include the CASE expression. CASE expressions provide if-then logic useful in many situations. In Oracle Database 10g, the SQL reporting capability was further enhanced by the introduction of partitioned outer join. Partitioned outer join is an extension to ANSI outer join syntax that allows users to selectively densify certain dimensions while keeping others sparse. This allows reporting tools to selectively densify dimensions, for example, the ones that appear in their cross-tabular reports while keeping others sparse. To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make calculations easier and more efcient, thereby enhancing database performance, scalability, and simplicity. Analytic functions are classied as described in Table 211.
Table 211
Type Ranking Windowing Reporting
21-2
Table 211
Type LAG/LEAD FIRST/LAST
Hypothetical Rank and The rank or percentile that a row would have if inserted into a specied data set. Distribution
To perform these operations, the analytic functions add several new elements to SQL processing. These elements build on existing SQL to allow exible and powerful calculation expressions. With just a few exceptions, the analytic functions have these new elements. The processing ow is represented in Figure 211.
Figure 211 Processing Order
Final ORDER BY
Processing order Query processing using analytic functions takes place in three stages. First, all joins, WHERE, GROUP BY and HAVING clauses are performed. Second, the result set is made available to the analytic functions, and all their calculations take place. Third, if the query has an ORDER BY clause at its end, the ORDER BY is processed to allow for precise output ordering. The processing order is shown in Figure 211.
Result set partitions The analytic functions allow users to divide query result sets into groups of rows called partitions. Note that the term partitions used with analytic functions is unrelated to the table partitions feature. Throughout this chapter, the term partitions refers to only the meaning related to analytic functions. Partitions are created after the groups dened with GROUP BY clauses, so they are available to any aggregate results such as sums and averages. Partition divisions may be based upon any desired columns or expressions. A query
result set may be partitioned into just one partition holding all the rows, a few large partitions, or many small partitions holding just a few rows each.
s
Window For each row in a partition, you can dene a sliding window of data. This window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time. The window has a starting row and an ending row. Depending on its denition, the window may move at one or both ends. For instance, a window dened for a cumulative sum function would have its starting row xed at the rst row of its partition, and its ending row would slide from the starting point all the way to the last row of the partition. In contrast, a window dened for a moving average would have both its starting and end points slide so that they maintain a constant physical or logical range. A window can be set as large as all the rows in a partition or just a sliding window of one row within a partition. When a window is near a border, the function returns results for only the available rows, rather than warning you that the results are not what you want. When using window functions, the current row is included during calculations, so you should only specify (n-1) when you are dealing with n items.
Current row Each calculation performed with an analytic function is based on a current row within a partition. The current row serves as the reference point determining the start and end of the window. For instance, a centered moving average calculation could be dened with a window that holds the current row, the six preceding rows, and the following six rows. This would create a sliding window of 13 rows, as shown in Figure 212.
21-4
Ranking Functions
Figure 212
Window Start
Window Finish
Ranking Functions
A ranking function computes the rank of a record compared to other records in the data set based on the values of a set of measures. The types of ranking function are:
s
RANK and DENSE_RANK Functions CUME_DIST Function PERCENT_RANK Function NTILE Function ROW_NUMBER Function
The difference between RANK and DENSE_RANK is that DENSE_RANK leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using DENSE_RANK and had three people tie for second place, you would say that
Ranking Functions
all three were in second place and that the next person came in third. The RANK function would also give three people in second place, but the next person would be in fth place. The following are some relevant points about RANK:
s
Ascending is the default sort order, which you may want to change to descending. The expressions in the optional PARTITION BY clause divide the query result set into groups within which the RANK function operates. That is, RANK gets reset whenever the group changes. In effect, the value expressions of the PARTITION BY clause dene the reset boundaries. If the PARTITION BY clause is missing, then ranks are computed over the entire query result set. The ORDER BY clause species the measures (<value expression>) on which ranking is done and denes the order in which rows are sorted in each group (or partition). Once the data is sorted within each partition, ranks are given to each row starting from 1. The NULLS FIRST | NULLS LAST clause indicates the position of NULLs in the ordered sequence, either rst or last in the sequence. The order of the sequence would make NULLs compare either high or low with respect to non-NULL values. If the sequence were in ascending order, then NULLS FIRST implies that NULLs are smaller than all other non-NULL values and NULLS LAST implies they are larger than non-NULL values. It is the opposite for descending order. See the example in "Treatment of NULLs" on page 21-10. If the NULLS FIRST | NULLS LAST clause is omitted, then the ordering of the null values depends on the ASC or DESC arguments. Null values are considered larger than any other values. If the ordering sequence is ASC, then nulls will appear last; nulls will appear rst otherwise. Nulls are considered equal to other nulls and, therefore, the order in which nulls are presented is non-deterministic.
Ranking Order
The following example shows how the [ASC | DESC] option changes the ranking order.
Example 211 Ranking Order
SELECT channel_desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (ORDER BY SUM(amount_sold)) AS default_rank,
21-6
Ranking Functions
RANK() OVER (ORDER BY SUM(amount_sold) DESC NULLS LAST) AS custom_rank FROM sales, products, customers, times, channels, countries WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-10') AND country_iso_code='US' GROUP BY channel_desc; CHANNEL_DESC SALES$ DEFAULT_RANK CUSTOM_RANK -------------------- -------------- ------------ ----------Direct Sales 2,443,392 3 1 Partners 1,365,963 2 2 Internet 467,478 1 3
While the data in this result is ordered on the measure SALES$, in general, it is not guaranteed by the RANK function that the data will be sorted on the measures. If you want the data to be sorted on SALES$ in your result, you must specify it explicitly with an ORDER BY clause, at the end of the SELECT statement.
SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold),-5), '9,999,999,999') SALES$, TO_CHAR(SUM(quantity_sold), '9,999,999,999') SALES_Count, RANK() OVER (ORDER BY TRUNC(SUM(amount_sold), -5) DESC, SUM(quantity_sold) DESC) AS col_rank FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-10') AND channels.channel_desc<>'Tele Sales' GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC -------------------Direct Sales Direct Sales Partners CALENDAR SALES$ SALES_COUNT COL_RANK -------- -------------- -------------- --------2000-10 1,200,000 12,584 1 2000-09 1,200,000 11,995 2 2000-10 600,000 7,508 3
Ranking Functions
4 5 6
The sales_count column breaks the ties for three pairs of values.
SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold),-4), '9,999,999,999') SALES$, RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DESC) AS RANK, DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DESC) AS DENSE_RANK FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-10') AND channels.channel_desc<>'Tele Sales' GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC -------------------Direct Sales Direct Sales Partners Partners Internet Internet CALENDAR SALES$ RANK DENSE_RANK -------- -------------- --------- ---------2000-09 1,200,000 1 1 2000-10 1,200,000 1 1 2000-09 600,000 3 2 2000-10 600,000 3 2 2000-09 200,000 5 3 2000-10 200,000 5 3
Note that, in the case of DENSE_RANK, the largest rank value gives the number of distinct values in the data set.
21-8
Ranking Functions
Example 214
SELECT channel_desc, calendar_month_desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (PARTITION BY channel_desc ORDER BY SUM(amount_sold) DESC) AS RANK_BY_CHANNEL FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-08', '2000-09', '2000-10', '2000-11') AND channels.channel_desc IN ('Direct Sales', 'Internet') GROUP BY channel_desc, calendar_month_desc;
A single query block can contain more than one ranking function, each partitioning the data into different groups (that is, reset on different boundaries). The groups can be mutually exclusive. The following query ranks products based on their dollar sales within each month (rank_of_product_per_region) and within each channel (rank_of_product_total).
Example 215 Per Group Ranking Example 2
SELECT channel_desc, calendar_month_desc, TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, RANK() OVER (PARTITION BY calendar_month_desc ORDER BY SUM(amount_sold) DESC) AS RANK_WITHIN_MONTH, RANK() OVER (PARTITION BY channel_desc ORDER BY SUM(amount_sold) DESC) AS RANK_WITHIN_CHANNEL FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-08', '2000-09', '2000-10', '2000-11') AND channels.channel_desc IN ('Direct Sales', 'Internet') GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC -------------------Direct Sales Internet Direct Sales Internet Direct Sales Internet Direct Sales Internet CALENDAR -------2000-08 2000-08 2000-09 2000-09 2000-10 2000-10 2000-11 2000-11 SALES$ RANK_WITHIN_MONTH RANK_WITHIN_CHANNEL -------------- ----------------- ------------------1,236,104 1 1 215,107 2 4 1,217,808 1 3 228,241 2 3 1,225,584 1 2 239,236 2 2 1,115,239 1 4 284,742 2 1
Ranking Functions
Treatment of NULLs
NULLs are treated like normal values. Also, for rank computation, a NULL value is assumed to be equal to another NULL value. Depending on the ASC | DESC options provided for measures and the NULLS FIRST | NULLS LAST clause, NULLs will either sort low or high and hence, are given ranks appropriately. The following example shows how NULLs are ranked in different cases:
SELECT times.time_id time, sold, RANK() OVER (ORDER BY (sold) DESC NULLS LAST) AS NLAST_DESC,
Ranking Functions
RANK() OVER (ORDER BY (sold) DESC NULLS FIRST) AS NFIRST_DESC, RANK() OVER (ORDER BY (sold) ASC NULLS FIRST) AS NFIRST, RANK() OVER (ORDER BY (sold) ASC NULLS LAST) AS NLAST FROM ( SELECT time_id , sum(sales.amount_sold) sold FROM sales, products, customers, countries WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND prod_name IN ('Envoy Ambassador', 'Mouse Pad') AND country_iso_code ='GB' GROUP BY time_id) v, times WHERE v.time_id (+) =times.time_id AND calendar_year=1999 AND calendar_month_number=1 ORDER BY sold DESC NULLS LAST; TIME SOLD NLAST_DESC NFIRST_DESC NFIRST NLAST --------- ---------- ---------- ----------- ---------- ---------14-JAN-99 25241.48 1 13 31 19 21-JAN-99 24365.05 2 14 30 18 10-JAN-99 22901.24 3 15 29 17 20-JAN-99 16578.19 4 16 28 16 16-JAN-99 15881.12 5 17 27 15 30-JAN-99 15637.49 6 18 26 14 17-JAN-99 13262.87 7 19 25 13 25-JAN-99 13227.08 8 20 24 12 03-JAN-99 9885.74 9 21 23 11 28-JAN-99 4471.08 10 22 22 10 27-JAN-99 3453.66 11 23 21 9 23-JAN-99 925.45 12 24 20 8 07-JAN-99 756.87 13 25 19 7 08-JAN-99 571.8 14 26 18 6 13-JAN-99 569.21 15 27 17 5 02-JAN-99 316.87 16 28 16 4 12-JAN-99 195.54 17 29 15 3 26-JAN-99 92.96 18 30 14 2 19-JAN-99 86.04 19 31 13 1 05-JAN-99 20 1 1 20 01-JAN-99 20 1 1 20 31-JAN-99 20 1 1 20 11-JAN-99 20 1 1 20 06-JAN-99 20 1 1 20 18-JAN-99 20 1 1 20 09-JAN-99 20 1 1 20 29-JAN-99 20 1 1 20
Ranking Functions
20 20 20 20
1 1 1 1
1 1 1 1
20 20 20 20
Bottom N Ranking
Bottom N is similar to top N except for the ordering sequence within the rank expression. Using the previous example, you can order SUM(s_amount) ascending instead of descending.
CUME_DIST Function
The CUME_DIST function (dened as the inverse of percentile in some statistical books) computes the position of a specied value relative to a set of values. The order can be ascending or descending. Ascending is the default. The range of values for CUME_DIST is from greater than 0 to 1. To compute the CUME_DIST of a value x in a set S of size N, you use the formula:
CUME_DIST(x) = number of values in S coming before and including x in the specified order/ N
The semantics of various options in the CUME_DIST function are similar to those in the RANK function. The default order is ascending, implying that the lowest value gets the lowest CUME_DIST (as all other values come later than this value in the order). NULLs are treated the same as they are in the RANK function. They are counted toward both the numerator and the denominator as they are treated like non-NULL values. The following example nds cumulative distribution of sales by channel within each month:
SELECT calendar_month_desc AS MONTH, channel_desc, TO_CHAR(SUM(amount_sold) , '9,999,999,999') SALES$, CUME_DIST() OVER (PARTITION BY calendar_month_desc ORDER BY SUM(amount_sold) ) AS CUME_DIST_BY_CHANNEL FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2000-09', '2000-07','2000-08') GROUP BY calendar_month_desc, channel_desc;
Ranking Functions
MONTH -------2000-07 2000-07 2000-07 2000-08 2000-08 2000-08 2000-09 2000-09 2000-09
CHANNEL_DESC SALES$ CUME_DIST_BY_CHANNEL -------------------- -------------- -------------------Internet 140,423 .333333333 Partners 611,064 .666666667 Direct Sales 1,145,275 1 Internet 215,107 .333333333 Partners 661,045 .666666667 Direct Sales 1,236,104 1 Internet 228,241 .333333333 Partners 666,172 .666666667 Direct Sales 1,217,808 1
PERCENT_RANK Function
PERCENT_RANK is similar to CUME_DIST, but it uses rank values rather than row counts in its numerator. Therefore, it returns the percent rank of a value relative to a group of values. The function is available in many popular spreadsheets. PERCENT_ RANK of a row is calculated as:
(rank of row in its partition - 1) / (number of rows in the partition - 1)
PERCENT_RANK returns values in the range zero to one. The row(s) with a rank of 1 will have a PERCENT_RANK of zero. Its syntax is:
PERCENT_RANK () OVER ([query_partition_clause] order_by_clause)
NTILE Function
NTILE allows easy calculation of tertiles, quartiles, deciles and other common summary statistics. This function divides an ordered partition into a specied number of groups called buckets and assigns a bucket number to each row in the partition. NTILE is a very useful calculation because it lets users divide a data set into fourths, thirds, and other groupings. The buckets are calculated so that each bucket has exactly the same number of rows assigned to it or at most 1 row more than the others. For instance, if you have 100 rows in a partition and ask for an NTILE function with four buckets, 25 rows will be assigned a value of 1, 25 rows will have value 2, and so on. These buckets are referred to as equiheight buckets. If the number of rows in the partition does not divide evenly (without a remainder) into the number of buckets, then the number of rows assigned for each bucket will differ by one at most. The extra rows will be distributed one for each bucket starting from the lowest bucket number. For instance, if there are 103 rows in a partition
Ranking Functions
which has an NTILE(5) function, the rst 21 rows will be in the rst bucket, the next 21 in the second bucket, the next 21 in the third bucket, the next 20 in the fourth bucket and the nal 20 in the fth bucket. The NTILE function has the following syntax:
NTILE (expr) OVER ([query_partition_clause] order_by_clause)
In this, the N in NTILE(N) can be a constant (for example, 5) or an expression. This function, like RANK and CUME_DIST, has a PARTITION BY clause for per group computation, an ORDER BY clause for specifying the measures and their sort order, and NULLS FIRST | NULLS LAST clause for the specic treatment of NULLs. For example, the following is an example assigning each month's sales total into one of 4 buckets:
SELECT calendar_month_desc AS MONTH , TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, NTILE(4) OVER (ORDER BY SUM(amount_sold)) AS TILE4 FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_year=2000 AND prod_category= 'Electronics' GROUP BY calendar_month_desc; MONTH SALES$ TILE4 -------- -------------- ---------2000-02 242,416 1 2000-01 257,286 1 2000-03 280,011 1 2000-06 315,951 2 2000-05 316,824 2 2000-04 318,106 2 2000-07 433,824 3 2000-08 477,833 3 2000-12 553,534 3 2000-10 652,225 4 2000-11 661,147 4 2000-09 691,449 4
NTILE ORDER BY statements must be fully specied to yield reproducible results. Equal values can get distributed across adjacent buckets. To ensure deterministic results, you must order on a unique key.
ROW_NUMBER Function
The ROW_NUMBER function assigns a unique number (sequentially, starting from 1, as dened by ORDER BY) to each row within the partition. It has the following syntax:
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause ) Example 216 ROW_NUMBER
SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(SUM(amount_sold), -5), '9,999,999,999') SALES$, ROW_NUMBER() OVER (ORDER BY TRUNC(SUM(amount_sold), -6) DESC) AS ROW_NUMBER FROM sales, products, customers, times, channels WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND times.calendar_month_desc IN ('2001-09', '2001-10') GROUP BY channel_desc, calendar_month_desc; CHANNEL_DESC -------------------Direct Sales Direct Sales Internet Internet Partners Partners CALENDAR SALES$ ROW_NUMBER -------- -------------- ---------2001-09 1,100,000 1 2001-10 1,000,000 2 2001-09 500,000 3 2001-10 700,000 4 2001-09 600,000 5 2001-10 600,000 6
Note that there are three pairs of tie values in these results. Like NTILE, ROW_ NUMBER is a non-deterministic function, so each tied value could have its row number switched. To ensure deterministic results, you must order on a unique key. Inmost cases, that will require adding a new tie breaker column to the query and using it in the ORDER BY specication.
provide access to more than one row of a table without a self-join. The syntax of the windowing functions is:
{SUM|AVG|MAX|MIN|COUNT|STDDEV|VARIANCE|FIRST_VALUE|LAST_VALUE} ({value expression1 | *}) OVER ([PARTITION BY value expression2[,...]) ORDER BY value expression3 [collate clause>] [ASC| DESC] [NULLS FIRST | NULLS LAST] [,...] {ROWS | RANGE} {BETWEEN {UNBOUNDED PRECEDING | CURRENT ROW | value_expr {PRECEDING | FOLLOWING}} AND { UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW | value_expr PRECEDING}}
See Also: Oracle Database SQL Reference for further information regarding syntax and restrictions
AND c.cust_id IN (2595, 9646, 11111) GROUP BY c.cust_id, t.calendar_quarter_desc ORDER BY c.cust_id, t.calendar_quarter_desc; CUST_ID ---------2595 2595 2595 2595 9646 9646 9646 9646 11111 11111 11111 11111 CALENDA Q_SALES CUM_SALES ------- ----------------- ----------------2000-01 659.92 659.92 2000-02 224.79 884.71 2000-03 313.90 1,198.61 2000-04 6,015.08 7,213.69 2000-01 1,337.09 1,337.09 2000-02 185.67 1,522.76 2000-03 203.86 1,726.62 2000-04 458.29 2,184.91 2000-01 43.18 43.18 2000-02 33.33 76.51 2000-03 579.73 656.24 2000-04 307.58 963.82
In this example, the analytic function SUM denes, for each row, a window that starts at the beginning of the partition (UNBOUNDED PRECEDING) and ends, by default, at the current row. Nested SUMs are needed in this example since we are performing a SUM over a value that is itself a SUM. Nested aggregations are used very often in analytic aggregate functions.
Example 218 Moving Aggregate Function
This example of a time-based window shows, for one customer, the moving average of sales for the current month and preceding two months:
SELECT c.cust_id, t.calendar_month_desc, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY c.cust_id, t.calendar_month_desc ROWS 2 PRECEDING), '9,999,999,999') AS MOVING_3_MONTH_AVG FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.calendar_year=1999 AND c.cust_id IN (6510) GROUP BY c.cust_id, t.calendar_month_desc ORDER BY c.cust_id, t.calendar_month_desc; CUST_ID CALENDAR SALES MOVING_3_MONTH ---------- -------- -------------- -------------6510 1999-04 125 125
Note that the rst two rows for the three month moving average calculation in the output data are based on a smaller interval size than specied because the window calculation cannot reach past the data retrieved by the query. You need to consider the different window sizes found at the borders of result sets. In other words, you may need to modify the query to include exactly what you want.
SELECT t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY t.time_id RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG FROM sales s, times t WHERE s.time_id=t.time_id AND t.calendar_week_number IN (51) AND calendar_year=1999 GROUP BY t.time_id ORDER BY t.time_id; TIME_ID SALES CENTERED_3_DAY --------- -------------- -------------20-DEC-99 134,337 106,676 21-DEC-99 79,015 102,539 22-DEC-99 94,264 85,342 23-DEC-99 82,746 93,322 24-DEC-99 102,957 82,937 25-DEC-99 63,107 87,062 26-DEC-99 95,123 79,115
The starting and ending rows for each product's centered moving average calculation in the output data are based on just two days, since the window calculation cannot reach past the data retrieved by the query. Users need to consider the different window sizes found at the borders of result sets: the query may need to be adjusted.
SELECT time_id, daily_sum, SUM(daily_sum) OVER (ORDER BY time_id RANGE BETWEEN INTERVAL '10' DAY PRECEDING AND CURRENT ROW) AS current_group_sum FROM (SELECT time_id, channel_id, SUM(s.quantity_sold) AS daily_sum FROM customers c, sales s, countries WHERE c.cust_id=s.cust_id AND s.cust_id IN (638, 634, 753, 440 ) AND s.time_id BETWEEN '01-MAY-00' AND '13-MAY-00' GROUP BY time_id, channel_id); TIME_ID DAILY_SUM CURRENT_GROUP_SUM --------- ---------- ----------------06-MAY-00 161 161 10-MAY-00 23 207 10-MAY-00 23 207 11-MAY-00 46 345 11-MAY-00 92 345 12-MAY-00 23 368 13-MAY-00 46 529 13-MAY-00 115 529
/* /* /* /* /* /* /* /*
In the output of this example, all dates except May 6 and May 12 return two rows with duplicate dates. Examine the commented numbers to the right of the output to see how the values are calculated. Note that each group in parentheses represents the values returned for a single day. Note that this example applies only when you use the RANGE keyword rather than the ROWS keyword. It is also important to remember that with RANGE, you can only use 1 ORDER BY expression in the analytic function's ORDER BY clause. With the ROWS keyword, you can use multiple order by expressions in the analytic function's ORDER BY clause.
In this statement, t_timekey is a date eld. Here, fn could be a PL/SQL function with the following specication: fn(t_timekey) returns
s
4 if t_timekey is Monday, Tuesday 2 otherwise If any of the previous days are holidays, it adjusts the count appropriately.
Note that, when window is specied using a number in a window function with ORDER BY on a date column, then it is converted to mean the number of days. You
could have also used the interval literal conversion function, as NUMTODSINTERVAL(fn(t_timekey), 'DAY') instead of just fn(t_timekey) to mean the same thing. You can also write a PL/SQL function that returns an INTERVAL datatype value.
SELECT t.time_id, TO_CHAR(amount_sold, '9,999,999,999') AS INDIV_SALE, TO_CHAR(SUM(amount_sold) OVER (PARTITION BY t.time_id ORDER BY t.time_id ROWS UNBOUNDED PRECEDING), '9,999,999,999') AS CUM_SALES FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.time_id IN (TO_DATE('11-DEC-1999'), TO_DATE('12-DEC-1999')) AND c.cust_id BETWEEN 6500 AND 6600 ORDER BY t.time_id; TIME_ID INDIV_SALE --------- ---------12-DEC-99 23 12-DEC-99 9 12-DEC-99 14 12-DEC-99 24 12-DEC-99 19 CUM_SALES --------23 32 46 70 89
One way to handle this problem would be to add the prod_id column to the result set and order on both time_id and prod_id.
If the IGNORE NULLS option is used with FIRST_VALUE, it will return the rst non-null value in the set, or NULL if all values are NULL. If IGNORE NULLS is used with LAST_VALUE, it will return the last non-null value in the set, or NULL if all values are NULL. The IGNORE NULLS option is particularly useful in populating an inventory table properly.
An asterisk (*) is only allowed in COUNT(*) DISTINCT is supported only if corresponding aggregate functions allow it value expression1 and value expression2 can be any valid expression involving column references or aggregates. The PARTITION BY clause denes the groups on which the windowing functions would be computed. If the PARTITION BY clause is absent, then the function is computed over the whole query result set.
Reporting functions can appear only in the SELECT clause or the ORDER BY clause. The major benet of reporting functions is their ability to do multiple passes of data in a single query block and speed up query performance. Queries such as "Count the number of salesmen with sales more than 10% of city sales" do not require joins between separate query blocks. For example, consider the question "For each product category, nd the region in which it had maximum sales". The equivalent SQL query using the MAX reporting aggregate function would be:
SELECT prod_category, country_region, sales FROM (SELECT SUBSTR(p.prod_category,1,8) AS prod_category, co.country_region, SUM(amount_sold) AS sales, MAX(SUM(amount_sold)) OVER (PARTITION BY prod_category) AS MAX_REG_SALES FROM sales s, customers c, countries co, products p
WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND s.prod_id =p.prod_id AND s.time_id = TO_DATE('11-OCT-2001') GROUP BY prod_category, country_region) WHERE sales = MAX_REG_SALES;
The inner query with the reporting aggregate function MAX(SUM(amount_sold)) returns:
PROD_CAT -------Electron Hardware Peripher Peripher Peripher Peripher Software Software Software Software COUNTRY_REGION SALES MAX_REG_SALES -------------------- ---------- ------------Americas 581.92 581.92 Americas 925.93 925.93 Americas 3084.48 4290.38 Asia 2616.51 4290.38 Europe 4290.38 4290.38 Oceania 940.43 4290.38 Americas 4445.7 4445.7 Asia 1408.19 4445.7 Europe 3288.83 4445.7 Oceania 890.25 4445.7
Example 2112
Reporting aggregates combined with nested queries enable you to answer complex queries efciently. For example, what if you want to know the best selling products in your most signicant product subcategories? The following is a query which nds the 5 top-selling products for each product subcategory that contributes more than 20% of the sales within its product category:
SELECT SUBSTR(prod_category,1,8) AS CATEG, prod_subcategory, prod_id, SALES FROM (SELECT p.prod_category, p.prod_subcategory, p.prod_id, SUM(amount_sold) AS SALES, SUM(SUM(amount_sold)) OVER (PARTITION BY p.prod_category) AS CAT_SALES, SUM(SUM(amount_sold)) OVER (PARTITION BY p.prod_subcategory) AS SUBCAT_SALES, RANK() OVER (PARTITION BY p.prod_subcategory ORDER BY SUM(amount_sold) ) AS RANK_IN_LINE
FROM sales s, customers c, countries co, products p WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND s.prod_id=p.prod_id AND s.time_id=to_DATE('11-OCT-2000') GROUP BY p.prod_category, p.prod_subcategory, p.prod_id ORDER BY prod_category, prod_subcategory) WHERE SUBCAT_SALES>0.2*CAT_SALES AND RANK_IN_LINE<=5;
RATIO_TO_REPORT Function
The RATIO_TO_REPORT function computes the ratio of a value to the sum of a set of values. If the expression value expression evaluates to NULL, RATIO_TO_ REPORT also evaluates to NULL, but it is treated as zero for computing the sum of values for the denominator. Its syntax is:
RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] )
expr can be any valid expression involving column references or aggregates. The PARTITION BY clause denes the groups on which the RATIO_TO_ REPORT function is to be computed. If the PARTITION BY clause is absent, then the function is computed over the whole query result set.
RATIO_TO_REPORT
Example 2113
To calculate RATIO_TO_REPORT of sales for each channel, you might use the following syntax:
SELECT ch.channel_desc, TO_CHAR(SUM(amount_sold),'9,999,999') AS SALES, TO_CHAR(SUM(SUM(amount_sold)) OVER (), '9,999,999') AS TOTAL_SALES, TO_CHAR(RATIO_TO_REPORT(SUM(amount_sold)) OVER (), '9.999') AS RATIO_TO_REPORT FROM sales s, channels ch WHERE s.channel_id=ch.channel_id AND s.time_id=to_DATE('11-OCT-2000') GROUP BY ch.channel_desc; CHANNEL_DESC SALES TOTAL_SALE RATIO_ -------------------- ---------- ---------- -----Direct Sales 14,447 23,183 .623 Internet 345 23,183 .015 Partners 8,391 23,183 .362
LAG/LEAD Functions
LAG/LEAD Functions
The LAG and LEAD functions are useful for comparing values when the relative positions of rows can be known reliably. They work by specifying the count of rows which separate the target row from the current row. Because the functions provide access to more than one row of a table at the same time without a self-join, they can enhance processing speed. The LAG function provides access to a row at a given offset prior to the current position, and the LEAD function provides access to a row at a given offset after the current position.
LAG/LEAD Syntax
These functions have the following syntax:
{LAG | LEAD} ( value_expr [, offset] [, default] ) OVER ( [query_partition_clause] order_by_clause )
offset is an optional parameter and defaults to 1. default is an optional parameter and is the value returned if offset falls outside the bounds of the table or partition.
Example 2114 LAG/LEAD
SELECT time_id, TO_CHAR(SUM(amount_sold),'9,999,999') AS SALES, TO_CHAR(LAG(SUM(amount_sold),1) OVER (ORDER BY time_id),'9,999,999') AS LAG1, TO_CHAR(LEAD(SUM(amount_sold),1) OVER (ORDER BY time_id),'9,999,999') AS LEAD1 FROM sales WHERE time_id>=TO_DATE('10-OCT-2000') AND time_id<=TO_DATE('14-OCT-2000') GROUP BY time_id; TIME_ID SALES LAG1 LEAD1 --------- ---------- ---------- ---------10-OCT-00 238,479 23,183 11-OCT-00 23,183 238,479 24,616 12-OCT-00 24,616 23,183 76,516 13-OCT-00 76,516 24,616 29,795 14-OCT-00 29,795 76,516
See "Data Densication for Reporting" for information showing how to use the LAG/LEAD functions for doing period-to-period comparison queries on sparse data.
FIRST/LAST Functions
FIRST/LAST Functions
The FIRST/LAST aggregate functions allow you to rank a data set and work with its top-ranked or bottom-ranked rows. After nding the top or bottom ranked rows, an aggregate function is applied to any desired column. That is, FIRST/LAST lets you rank on column A but return the result of an aggregate applied on the rst-ranked or last-ranked rows of column B. This is valuable because it avoids the need for a self-join or subquery, thus improving performance. These functions' syntax begins with a regular aggregate function (MIN, MAX, SUM, AVG, COUNT, VARIANCE, STDDEV) that produces a single return value per group. To specify the ranking used, the FIRST/LAST functions add a new clause starting with the word KEEP.
FIRST/LAST Syntax
These functions have the following syntax:
aggregate_function KEEP ( DENSE_RANK LAST ORDER BY expr [ DESC | ASC ] [NULLS { FIRST | LAST }] [, expr [ DESC | ASC ] [NULLS { FIRST | LAST }]]...) [OVER query_partitioning_clause]
The following query lets us compare minimum price and list price of our products. For each product subcategory within the Men's clothing category, it returns the following:
s
List price of the product with the lowest minimum price Lowest minimum price List price of the product with the highest minimum price Highest minimum price
SELECT prod_subcategory, MIN(prod_list_price) KEEP (DENSE_RANK FIRST ORDER BY (prod_min_price)) AS LP_OF_LO_MINP, MIN(prod_min_price) AS LO_MINP, MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price))
FIRST/LAST Functions
AS LP_OF_HI_MINP, MAX(prod_min_price) AS HI_MINP FROM products WHERE prod_category='Electronics' GROUP BY prod_subcategory; PROD_SUBCATEGORY LP_OF_LO_MINP ---------------- ------------Game Consoles 299.99 Home Audio 499.99 Y Box Accessories 7.99 Y Box Games 7.99 LO_MINP ------299.99 499.99 7.99 7.99 LP_OF_HI_MINP HI_MINP ------------- ---------299.99 299.99 599.99 599.99 20.99 20.99 29.99 29.99
SELECT prod_id, prod_list_price, MIN(prod_list_price) KEEP (DENSE_RANK FIRST ORDER BY (prod_min_price)) OVER(PARTITION BY (prod_subcategory)) AS LP_OF_LO_MINP, MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price)) OVER(PARTITION BY (prod_subcategory)) AS LP_OF_HI_MINP FROM products WHERE prod_subcategory = 'Documentation'; PROD_ID PROD_LIST_PRICE LP_OF_LO_MINP LP_OF_HI_MINP ---------- --------------- ------------- ------------40 44.99 44.99 44.99 41 44.99 44.99 44.99 42 44.99 44.99 44.99 43 44.99 44.99 44.99 44 44.99 44.99 44.99 45 44.99 44.99 44.99
Using the FIRST and LAST functions as reporting aggregates makes it easy to include the results in calculations such "Salary as a percent of the highest salary."
They require a parameter between 0 and 1 (inclusive). A parameter specied out of this range will result in error. This parameter should be specied as an expression that evaluates to a constant. They require a sort specication. This sort specication is an ORDER BY clause with a single expression. Multiple expressions are not allowed.
---------- ----------------- ---------28344 1500 .173913043 8962 1500 .173913043 36651 1500 .173913043 32497 1500 .173913043 15192 3000 .347826087 102077 3000 .347826087 102343 3000 .347826087 8270 3000 .347826087 21380 5000 .52173913 13808 5000 .52173913 101784 5000 .52173913 30420 5000 .52173913 10346 7000 .652173913 31112 7000 .652173913 35266 7000 .652173913 3424 9000 .739130435 100977 9000 .739130435 103066 10000 .782608696 35225 11000 .956521739 14459 11000 .956521739 17268 11000 .956521739 100421 11000 .956521739 41496 15000 1
PERCENTILE_DISC(x) is computed by scanning up the CUME_DIST values in each group till you nd the rst one greater than or equal to x, where x is the specied percentile value. For the example query where PERCENTILE_DISC(0.5), the result is 5,000, as the following illustrates:
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_cont FROM customers WHERE cust_city='Marshal'; PERC_DISC --------5000 PERC_CONT --------5000
The result of PERCENTILE_CONT is computed by linear interpolation between rows after ordering them. To compute PERCENTILE_CONT(x), we rst compute the row number = RN= (1+x*(n-1)), where n is the number of rows in the group and x is the specied percentile value. The nal result of the aggregate function is computed by
linear interpolation between the values from rows at row numbers CRN = CEIL(RN) and FRN = FLOOR(RN). The nal result will be: PERCENTILE_CONT(X) = if (CRN = FRN = RN), then (value of expression from row at RN) else (CRN - RN) * (value of expression for row at FRN) + (RN -FRN) * (value of expression for row at CRN). Consider the previous example query, where we compute PERCENTILE_ CONT(0.5). Here n is 17. The row number RN = (1 + 0.5*(n-1))= 9 for both groups. Putting this into the formula, (FRN=CRN=9), we return the value from row 9 as the result. Another example is, if you want to compute PERCENTILE_CONT(0.66). The computed row number RN=(1 + 0.66*(n-1))= (1 + 0.66*16)= 11.67. PERCENTILE_ CONT(0.66) = (12-11.67)*(value of row 11)+(11.67-11)*(value of row 12). These results are:
SELECT PERCENTILE_DISC(0.66) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.66) WITHIN GROUP (ORDER BY cust_credit_limit) AS perc_cont FROM customers WHERE cust_city='Marshal'; PERC_DISC PERC_CONT ---------- ---------9000 8040
Inverse percentile aggregate functions can appear in the HAVING clause of a query like other existing aggregate functions.
As Reporting Aggregates
You can also use the aggregate functions PERCENTILE_CONT, PERCENTILE_DISC as reporting aggregate functions. When used as reporting aggregate functions, the syntax is similar to those of other reporting aggregates.
[PERCENTILE_CONT | PERCENTILE_DISC](constant expression) WITHIN GROUP ( ORDER BY single order by expression [ASC|DESC] [NULLS FIRST| NULLS LAST]) OVER ( [PARTITION BY value expression [,...]] )
This query computes the same thing (median credit limit for customers in this result set, but reports the result for every row in the result set, as shown in the following output:
SELECT cust_id, cust_credit_limit, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) OVER () AS perc_disc,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit) OVER () AS perc_cont FROM customers WHERE cust_city='Marshal'; CUST_ID CUST_CREDIT_LIMIT PERC_DISC PERC_CONT ---------- ----------------- ---------- ---------28344 1500 5000 5000 8962 1500 5000 5000 36651 1500 5000 5000 32497 1500 5000 5000 15192 3000 5000 5000 102077 3000 5000 5000 102343 3000 5000 5000 8270 3000 5000 5000 21380 5000 5000 5000 13808 5000 5000 5000 101784 5000 5000 5000 30420 5000 5000 5000 10346 7000 5000 5000 31112 7000 5000 5000 35266 7000 5000 5000 3424 9000 5000 5000 100977 9000 5000 5000 103066 10000 5000 5000 35225 11000 5000 5000 14459 11000 5000 5000 17268 11000 5000 5000 100421 11000 5000 5000 41496 15000 5000 5000
can use the NULLS FIRST/NULLS LAST option in the ORDER BY clause, but they will be ignored as NULLs are ignored.
Here, constant expression refers to an expression that evaluates to a constant, and there may be more than one such expressions that are passed as arguments to the function. The ORDER BY clause can contain one or more expressions that dene the sorting order on which the ranking will be based. ASC, DESC, NULLS FIRST, NULLS LAST options will be available for each expression in the ORDER BY.
Example 2117 Hypothetical Rank and Distribution Example 1
Using the list price data from the products table used throughout this section, you can calculate the RANK, PERCENT_RANK and CUME_DIST for a hypothetical sweater with a price of $50 for how it ts within each of the sweater subcategories. The query and results are:
SELECT cust_city, RANK(6000) WITHIN GROUP (ORDER BY CUST_CREDIT_LIMIT DESC) AS HRANK, TO_CHAR(PERCENT_RANK(6000) WITHIN GROUP (ORDER BY cust_credit_limit),'9.999') AS HPERC_RANK, TO_CHAR(CUME_DIST (6000) WITHIN GROUP (ORDER BY cust_credit_limit),'9.999') AS HCUME_DIST FROM customers WHERE cust_city LIKE 'Fo%' GROUP BY cust_city; CUST_CITY HRANK HPERC_ HCUME_ ------------------------------ ---------- ------ ------
Fondettes Fords Prairie Forest City Forest Heights Forestville Forrestcity Fort Klamath Fort William Foxborough
13 18 47 38 58 51 59 30 52
Unlike the inverse percentile aggregates, the ORDER BY clause in the sort specication for hypothetical rank and distribution functions may take multiple expressions. The number of arguments and the expressions in the ORDER BY clause should be the same and the arguments must be constant expressions of the same or compatible type to the corresponding ORDER BY expression. The following is an example using two arguments in several hypothetical ranking functions.
Example 2118 Hypothetical Rank and Distribution Example 2
SELECT prod_subcategory, RANK(10,8) WITHIN GROUP (ORDER BY prod_list_price DESC,prod_min_price) AS HRANK, TO_CHAR(PERCENT_RANK(10,8) WITHIN GROUP (ORDER BY prod_list_price, prod_min_price),'9.999') AS HPERC_RANK, TO_CHAR(CUME_DIST (10,8) WITHIN GROUP (ORDER BY prod_list_price, prod_min_price),'9.999') AS HCUME_DIST FROM products WHERE prod_subcategory LIKE 'Recordable%' GROUP BY prod_subcategory; PROD_SUBCATEGORY -------------------Recordable CDs Recordable DVD Discs HRANK ----4 5 HPERC_ -----.571 .200 HCUME_ -----.625 .333
These functions can appear in the HAVING clause of a query just like other aggregate functions. They cannot be used as either reporting aggregate functions or windowing aggregate functions.
REGR_COUNT Function REGR_AVGY and REGR_AVGX Functions REGR_SLOPE and REGR_INTERCEPT Functions REGR_R2 Function REGR_SXX, REGR_SYY, and REGR_SXY Functions
Oracle applies the function to the set of (e1, e2) pairs after eliminating all pairs for which either of e1 or e2 is null. e1 is interpreted as a value of the dependent variable (a "y value"), and e2 is interpreted as a value of the independent variable (an "x value"). Both expressions must be numbers. The regression functions are all computed simultaneously during a single pass through the data. They are frequently combined with the COVAR_POP, COVAR_ SAMP, and CORR functions.
REGR_COUNT Function
REGR_COUNT returns the number of non-null number pairs used to t the regression line. If applied to an empty set (or if there are no (e1, e2) pairs where neither of e1 or e2 is null), the function returns 0.
REGR_R2 Function
The REGR_R2 function computes the coefcient of determination (usually called "R-squared" or "goodness of t") for the regression line. REGR_R2 returns values between 0 and 1 when the regression line is dened (slope of the line is not null), and it returns NULL otherwise. The closer the value is to 1, the better the regression line ts the data.
Type of Statistic Adjusted R2 Standard error Total sum of squares Regression sum of squares Residual sum of squares t statistic for slope t statistic for y-intercept
Frequent Itemsets
RSQR are slope, intercept, and coefcient of determination of the regression line, respectively. The (integer) value COUNT is the number of products in each channel for whom both quantity sold and list price data are available.
SELECT s.channel_id, REGR_SLOPE(s.quantity_sold, p.prod_list_price) SLOPE, REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT, REGR_R2(s.quantity_sold, p.prod_list_price) RSQR, REGR_COUNT(s.quantity_sold, p.prod_list_price) COUNT, REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP, REGR_AVGY(s.quantity_sold, p.prod_list_price) AVGQSOLD FROM sales s, products p WHERE s.prod_id=p.prod_id AND p.prod_category='Electronics' AND s.time_id=to_DATE('10-OCT-2000') GROUP BY s.channel_id; CHANNEL_ID SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD ---------- ---------- ---------- ---------- ---------- ---------- ---------2 0 1 1 39 466.656667 1 3 0 1 1 60 459.99 1 4 0 1 1 19 526.305789 1
Frequent Itemsets
Instead of counting how often a given event occurs (for example, how often someone has purchased milk at the grocery), frequent itemsets provides a mechanism for counting how often multiple events occur together (for example, how often someone has purchased both milk and cereal together at the grocery store). The input to the frequent-itemsets operation is a set of data that represents collections of items (itemsets). Some examples of itemsets could be all of the products that a given customer purchased in a single trip to the grocery store (commonly called a market basket), the web-pages that a user accessed in a single session, or the nancial services that a given customer utilizes. The notion of a frequent itemset is to nd those itemsets that occur most often. If you apply the frequent-itemset operator to a grocery store's point-of-sale data, you might, for example, discover that milk and bananas are the most commonly bought pair of items. Frequent itemsets have thus been used in business intelligence environments for many years, with the most common one being for market basket analysis in the retail industry. Frequent itemsets are integrated with the database, operating on top of relational tables and accessed through SQL. This integration provides a couple of key benets:
Applications that previously relied on frequent itemset operations now benet from signicantly improved performance as well as simpler implementation. SQL-based applications that did not previously use frequent itemsets can now be easily extended to take advantage of this functionality.
Frequent itemsets analysis is performed with the PL/SQL package DBMS_ FREQUENT_ITEMSETS. See PL/SQL Packages and Types Reference for more information.
Descriptive Statistics
You can calculate the following descriptive statistics:
s
One-Sample T-Test
STATS_T_TEST_ONE (expr1, expr2 (a constant) [, return_value])
Paired-Samples T-Test
STATS_T_TEST_PAIRED (expr1, expr2 [, return_value])
The F-Test
STATS_F_TEST (expr1, expr2 [, return_value])
One-Way ANOVA
STATS_ONE_WAY_ANOVA (expr1, expr2 [, return_value])
Crosstab Statistics
You can calculate crosstab statistics using the following syntax:
STATS_CROSSTAB (expr1, expr2 [, return_value])
Observed value of chi-squared Signicance of observed chi-squared Degree of freedom for chi-squared Phi coefcient, Cramer's V statistic Contingency coefcient Cohen's Kappa
Mann-Whitney Test
STATS_MW_TEST (expr1, expr2 [, return_value])
Kolmogorov-Smirnov Test
STATS_KS_TEST (expr1, expr2 [, return_value])
WIDTH_BUCKET Function
Non-Parametric Correlation
You can calculate the following parametric statistics:
s
In addition to the functions, this release has a new PL/SQL package, DBMS_STAT_ FUNCS. It contains the descriptive statistical function SUMMARY along with functions to support distribution tting. The SUMMARY function summarizes a numerical column of a table with a variety of descriptive statistics. The ve distribution tting functions support normal, uniform, Weibull, Poisson, and exponential distributions.
WIDTH_BUCKET Function
For a given expression, the WIDTH_BUCKET function returns the bucket number that the result of this expression will be assigned after it is evaluated. You can generate equiwidth histograms with this function. Equiwidth histograms divide data sets into buckets whose interval size (highest value to lowest value) is equal. The number of rows held by each bucket will vary. A related function, NTILE, creates equiheight buckets. Equiwidth histograms can be generated only for numeric, date or datetime types. So the rst three parameters should be all numeric expressions or all date expressions. Other types of expressions are not allowed. If the rst parameter is NULL, the result is NULL. If the second or the third parameter is NULL, an error message is returned, as a NULL value cannot denote any end point (or any point) for a range in a date or numeric value dimension. The last parameter (number of buckets) should be a numeric expression that evaluates to a positive integer value; 0, NULL, or a negative value will result in an error. Buckets are numbered from 0 to (n+1). Bucket 0 holds the count of values less than the minimum. Bucket(n+1) holds the count of values greater than or equal to the maximum specied value.
WIDTH_BUCKET Syntax
The WIDTH_BUCKET takes four expressions as parameters. The rst parameter is the expression that the equiwidth histogram is for. The second and third parameters are
WIDTH_BUCKET Function
expressions that denote the end points of the acceptable range for the rst parameter. The fourth parameter denotes the number of buckets.
WIDTH_BUCKET(expression, minval expression, maxval expression, num buckets)
Consider the following data from table customers, that shows the credit limits of 17 customers. This data is gathered in the query shown in Example 2119 on page 21-41.
CUST_ID --------10346 35266 41496 35225 3424 28344 31112 8962 15192 21380 36651 30420 8270 17268 14459 13808 32497 100977 102077 103066 101784 100421 102343 CUST_CREDIT_LIMIT ----------------7000 7000 15000 11000 9000 1500 7000 1500 3000 5000 1500 5000 3000 11000 11000 5000 1500 9000 3000 10000 5000 11000 3000
In the table customers, the column cust_credit_limit contains values between 1500 and 15000, and we can assign the values to four equiwidth buckets, numbered from 1 to 4, by using WIDTH_BUCKET (cust_credit_limit, 0, 20000, 4). Ideally each bucket is a closed-open interval of the real number line, for example, bucket number 2 is assigned to scores between 5000.0000 and 9999.9999..., sometimes denoted [5000, 10000) to indicate that 5,000 is included in the interval and 10,000 is excluded. To accommodate values outside the range [0, 20,000), values less than 0 are assigned to a designated underow bucket which is numbered 0, and values greater than or equal to 20,000 are assigned to a designated
WIDTH_BUCKET Function
overow bucket which is numbered 5 (num buckets + 1 in general). See Figure 213 for a graphical illustration of how the buckets are assigned.
Figure 213
Credit Limits 0 Bucket # 0 1 2 3 4 5
Bucket Assignments
5000
10000
15000
20000
You can specify the bounds in the reverse order, for example, WIDTH_BUCKET (cust_credit_limit, 20000, 0, 4). When the bounds are reversed, the buckets will be open-closed intervals. In this example, bucket number 1 is (15000,20000], bucket number 2 is (10000,15000], and bucket number 4, is (0,5000]. The overow bucket will be numbered 0 (20000, +infinity), and the underow bucket will be numbered 5 (-infinity, 0]. It is an error if the bucket count parameter is 0 or negative.
Example 2119 WIDTH_BUCKET
The following query shows the bucket numbers for the credit limits in the customers table for both cases where the boundaries are specied in regular or reverse order. We use a range of 0 to 20,000.
SELECT cust_id, cust_credit_limit, WIDTH_BUCKET(cust_credit_limit,0,20000,4) AS WIDTH_BUCKET_UP, WIDTH_BUCKET(cust_credit_limit,20000, 0, 4) AS WIDTH_BUCKET_DOWN FROM customers WHERE cust_city = 'Marshal'; CUST_ID CUST_CREDIT_LIMIT WIDTH_BUCKET_UP WIDTH_BUCKET_DOWN ---------- ----------------- --------------- ----------------10346 7000 2 3 35266 7000 2 3 41496 15000 4 2 35225 11000 3 2 3424 9000 2 3 28344 1500 1 4 31112 7000 2 3 8962 1500 1 4 15192 3000 1 4 21380 5000 2 4
36651 30420 8270 17268 14459 13808 32497 100977 102077 103066 101784 100421 102343
1500 5000 3000 11000 11000 5000 1500 9000 3000 10000 5000 11000 3000
1 2 1 3 3 2 1 2 1 3 2 3 1
4 4 4 2 2 4 4 3 4 3 4 2 4
Highly complex functions can be programmed using a fully procedural language. Higher scalability than other techniques when user-dened functions are programmed for parallel processing. Object datatypes can be processed.
As a simple example of a user-dened aggregate function, consider the skew statistic. This calculation measures if a data set has a lopsided distribution about its mean. It will tell you if one tail of the distribution is signicantly larger than the other. If you created a user-dened aggregate called udskew and applied it to the credit limit data in the prior example, the SQL statement and results might look like this:
SELECT USERDEF_SKEW(cust_credit_limit) FROM customers WHERE cust_city='Marshal'; USERDEF_SKEW ============ 0.583891
CASE Expressions
Before building user-dened aggregate functions, you should consider if your needs can be met in regular SQL. Many complex calculations are possible directly in SQL, particularly by using the CASE expression. Staying with regular SQL will enable simpler development, and many query operations are already well-parallelized in SQL. Even the earlier example, the skew statistic, can be created using standard, albeit lengthy, SQL.
CASE Expressions
Oracle now supports simple and searched CASE statements. CASE statements are similar in purpose to the DECODE statement, but they offer more exibility and logical power. They are also easier to read than traditional DECODE statements, and offer better performance as well. They are commonly used when breaking categories into buckets like age (for example, 20-29, 30-39, and so on). The syntax for simple statements is:
expr WHEN comparison_expr THEN return_expr [, WHEN comparison_expr THEN return_expr]...
You can specify only 255 arguments and each WHEN ... THEN pair counts as two arguments. For a workaround to this limit, see Oracle Database SQL Reference.
Example 2120 CASE
Suppose you wanted to nd the average salary of all employees in the company. If an employee's salary is less than $2000, you want the query to use $2000 instead. Without a CASE statement, you would have to write this query as follows,
SELECT AVG(foo(e.sal)) FROM emps e;
In this, foo is a function that returns its input if the input is greater than 2000, and returns 2000 otherwise. The query has performance implications because it needs to invoke a function for each row. Writing custom functions can also add to the development load. Using CASE expressions in the database without PL/SQL, this query can be rewritten as:
SELECT AVG(CASE when e.sal > 2000 THEN e.sal ELSE 2000 end) FROM emps e;
CASE Expressions
Using a CASE expression lets you avoid developing custom functions and can also perform faster.
SELECT SUM(CASE WHEN cust_credit_limit BETWEEN 0 AND 3999 THEN 1 ELSE 0 END) AS "0-3999", SUM(CASE WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN 1 ELSE 0 END) AS "4000-7999", SUM(CASE WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN 1 ELSE 0 END) AS "8000-11999", SUM(CASE WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN 1 ELSE 0 END) AS "12000-16000" FROM customers WHERE cust_city = 'Marshal'; 0-3999 4000-7999 8000-11999 12000-16000 ---------- ---------- ---------- ----------8 7 7 1 Example 2122 Histogram Example 2
SELECT (CASE WHEN cust_credit_limit BETWEEN 0 AND 3999 THEN ' 0 - 3999' WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN ' 4000 - 7999' WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN ' 8000 - 11999' WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN '12000 - 16000' END) AS BUCKET, COUNT(*) AS Count_in_Group FROM customers WHERE cust_city = 'Marshal' GROUP BY (CASE WHEN cust_credit_limit BETWEEN 0 AND 3999 THEN ' 0 - 3999' WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN ' 4000 - 7999' WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN ' 8000 - 11999' WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN '12000 - 16000' END); BUCKET COUNT_IN_GROUP ------------- -------------0 - 3999 8
7 7 1
FROM table_reference LEFT OUTER JOIN table_reference PARTITION BY {expr [,expr ]...)
Note that FULL OUTER JOIN is not supported with a partitioned outer join.
In this example, we would expect 22 rows of data (11 weeks each from 2 years) if the data were dense. However we get only 18 rows because weeks 25 and 26 are missing in 2000, and weeks 26 and 28 in 2001.
26 27 28 29 30
Note that in this query, a WHERE condition was placed for weeks between 20 and 30 in the inline view for the time dimension. This was introduced to keep the result set small.
In this query, the WITH sub-query factoring clause v1, summarizes sales data at the product, country, and year level. This result is sparse but users may want to see all the country, year combinations for each product. To achieve this, we take each
partition of v1 based on product values and outer join it on the country dimension rst. This will give us all values of country for each product. We then take that result and partition it on product and country values and then outer join it on time dimension. This will give us all time values for each product and country combination.
PROD_ID COUNTRY_ID CALENDAR_YEAR UNITS SALES ---------- ---------- ------------- ---------- ---------147 52782 1998 147 52782 1999 29 209.82 147 52782 2000 71 594.36 147 52782 2001 345 2754.42 147 52782 2002 147 52785 1998 1 7.99 147 52785 1999 147 52785 2000 147 52785 2001 147 52785 2002 147 52786 1998 1 7.99 147 52786 1999 147 52786 2000 2 15.98 147 52786 2001 147 52786 2002 147 52787 1998 147 52787 1999 147 52787 2000 147 52787 2001 147 52787 2002 147 52788 1998 147 52788 1999 147 52788 2000 1 7.99 147 52788 2001 147 52788 2002 148 52782 1998 139 4046.67 148 52782 1999 228 5362.57 148 52782 2000 251 5629.47 148 52782 2001 308 7138.98 148 52782 2002 148 52785 1998 148 52785 1999 148 52785 2000 148 52785 2001 148 52785 2002 148 52786 1998 148 52786 1999
148 148 148 148 148 148 148 148 148 148 148 148 148
52786 52786 52786 52787 52787 52787 52787 52787 52788 52788 52788 52788 52788
2000 2001 2002 1998 1999 2000 2001 2002 1998 1999 2000 2001 2002
117.23
For reporting purposes, users may want to see this inventory data differently. For example, they may want to see all values of time for each product. This can be accomplished using partitioned outer join. In addition, for the newly inserted rows of missing time periods, users may want to see the values for quantity of units column to be carried over from the most recent existing time period. The latter can be accomplished using analytic window function LAST_VALUE value. Here is the query and the desired output:
WITH v1 AS (SELECT time_id FROM times WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY') AND TO_DATE('07/04/01', 'DD/MM/YY')) SELECT product, time_id, quant quantity, LAST_VALUE(quant IGNORE NULLS) OVER (PARTITION BY product ORDER BY time_id) repeated_quantity FROM (SELECT product, v1.time_id, quant FROM invent_table PARTITION BY (product) RIGHT OUTER JOIN v1 ON (v1.time_id = invent_table.time_id)) ORDER BY 1, 2;
The inner query computes a partitioned outer join on time within each product. The inner query densies the data on the time dimension (meaning the time dimension will now have a row for each day of the week). However, the measure column quantity will have nulls for the newly added rows (see the output in the column quantity in the following results. The outer query uses the analytic function LAST_VALUE. Applying this function partitions the data by product and orders the data on the time dimension column (time_id). For each row, the function nds the last non-null value in the window due to the option IGNORE NULLS, which you can use with both LAST_VALUE and FIRST_VALUE. We see the desired output in the column repeated_quantity in the following output:
PRODUCT ---------bottle bottle bottle bottle bottle TIME_ID QUANTITY REPEATED_QUANTITY --------- -------- ----------------01-APR-01 10 10 02-APR-01 10 03-APR-01 10 04-APR-01 10 05-APR-01 10
8 15
11
8 8 15 15 15 11 11 11 11
WITH V AS (SELECT substr(p.prod_name,1,12) prod_name, calendar_month_desc, SUM(quantity_sold) units, SUM(amount_sold) sales FROM sales s, products p, times t WHERE s.prod_id in (122,136) AND calendar_year = 2000 AND t.time_id = s.time_id AND p.prod_id = s.prod_id GROUP BY p.prod_name, calendar_month_desc) SELECT v.prod_name, calendar_month_desc, units, sales, NVL(units, AVG(units) OVER (partition by v.prod_name)) computed_units, NVL(sales, AVG(sales) OVER (partition by v.prod_name)) computed_sales FROM (SELECT DISTINCT calendar_month_desc FROM times WHERE calendar_year = 2000) t
LEFT OUTER JOIN V PARTITION BY (prod_name) USING (calendar_month_desc); PROD_NAME -----------64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory 64MB Memory DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, DVD-R Discs, computed computed CALENDAR UNITS SALES _units _sales -------- ---------- ---------- ---------- ---------2000-01 112 4129.72 112 4129.72 2000-02 190 7049 190 7049 2000-03 47 1724.98 47 1724.98 2000-04 20 739.4 20 739.4 2000-05 47 1738.24 47 1738.24 2000-06 20 739.4 20 739.4 2000-07 72.6666667 2686.79 2000-08 72.6666667 2686.79 2000-09 72.6666667 2686.79 2000-10 72.6666667 2686.79 2000-11 72.6666667 2686.79 2000-12 72.6666667 2686.79 2000-01 167 3683.5 167 3683.5 2000-02 152 3362.24 152 3362.24 2000-03 188 4148.02 188 4148.02 2000-04 144 3170.09 144 3170.09 2000-05 189 4164.87 189 4164.87 2000-06 145 3192.21 145 3192.21 2000-07 124.25 2737.71 2000-08 124.25 2737.71 2000-09 1 18.91 1 18.91 2000-10 124.25 2737.71 2000-11 124.25 2737.71 2000-12 8 161.84 8 161.84
SELECT Product_Name, t.Year, t.Week, NVL(Sales,0) Current_sales, SUM(Sales) OVER (PARTITION BY Product_Name, t.year ORDER BY t.week) Cumulative_sales FROM (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND t.Calendar_Week_Number BETWEEN 20 AND 30 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) v PARTITION BY (v.Product_Name) RIGHT OUTER JOIN (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year in (2000, 2001) AND Calendar_Week_Number BETWEEN 20 AND 30) t ON (v.week = t.week AND v.Year = t.Year) ORDER BY t.year, t.week; PRODUCT_NAME YEAR WEEK CURRENT_SALES CUMULATIVE_SALES --------------- ---------- ---------- ------------- ---------------Bounce 2000 20 801 801 Bounce 2000 21 4062.24 4863.24 Bounce 2000 22 2043.16 6906.4 Bounce 2000 23 2731.14 9637.54 Bounce 2000 24 4419.36 14056.9 Bounce 2000 25 0 14056.9 Bounce 2000 26 0 14056.9 Bounce 2000 27 2297.29 16354.19 Bounce 2000 28 1443.13 17797.32 Bounce 2000 29 1927.38 19724.7 Bounce 2000 30 1927.38 21652.08 Bounce 2001 20 1483.3 1483.3 Bounce 2001 21 4184.49 5667.79 Bounce 2001 22 2609.19 8276.98 Bounce 2001 23 1416.95 9693.93 Bounce 2001 24 3149.62 12843.55 Bounce 2001 25 2645.98 15489.53 Bounce 2001 26 0 15489.53 Bounce 2001 27 2125.12 17614.65 Bounce 2001 28 0 17614.65 Bounce 2001 29 2467.92 20082.57
Bounce
2001
30
2620.17
22702.74
WITH v AS (SELECT SUBSTR(p.Prod_Name,1,6) Prod, t.Calendar_Year Year, t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales FROM Sales s, Times t, Products p WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id AND p.Prod_name in ('Y Box') AND t.Calendar_Year in (2000,2001) AND t.Calendar_Week_Number BETWEEN 30 AND 40 GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number) SELECT Prod , Year, Week, Sales, Weekly_ytd_sales, Weekly_ytd_sales_prior_year FROM (SELECT Prod, Year, Week, Sales, Weekly_ytd_sales, LAG(Weekly_ytd_sales, 1) OVER (PARTITION BY Prod , Week ORDER BY Year) Weekly_ytd_sales_prior_year FROM (SELECT v.Prod Prod , t.Year Year, t.Week Week, NVL(v.Sales,0) Sales, SUM(NVL(v.Sales,0)) OVER (PARTITION BY v.Prod , t.Year ORDER BY t.week) weekly_ytd_sales FROM v PARTITION BY (v.Prod ) RIGHT OUTER JOIN (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year FROM Times WHERE Calendar_Year IN (2000, 2001)) t ON (v.week = t.week AND v.Year = t.Year) ) dense_sales ) year_over_year_sales WHERE Year = 2001 AND Week BETWEEN 30 AND 40
ORDER BY 1, 2, 3; Weekly_ytd _sales_ prior_year ---------0 1537.35 9531.57 39048.69 69100.79 71265.35 81156.29 95433.09 107726.96 118817.4 120969.69
PROD YEAR WEEK SALES WEEKLY_YTD_SALES ------ ---------- ---------- ---------- ---------------Y Box 2001 30 7877.45 7877.45 Y Box 2001 31 13082.46 20959.91 Y Box 2001 32 11569.02 32528.93 Y Box 2001 33 38081.97 70610.9 Y Box 2001 34 33109.65 103720.55 Y Box 2001 35 0 103720.55 Y Box 2001 36 4169.3 107889.85 Y Box 2001 37 24616.85 132506.7 Y Box 2001 38 37739.65 170246.35 Y Box 2001 39 284.95 170531.3 Y Box 2001 40 10868.44 181399.74
In the FROM clause of the in-line view dense_sales, we use a partitioned outer join of aggregate view v and time view t to ll gaps in the sales data along the time dimension. The output of the partitioned outer join is then processed by the analytic function SUM ... OVER to compute the weekly year-to-date sales (the weekly_ ytd_sales column). Thus, the view dense_sales computes the year-to-date sales data for each week, including those missing in the aggregate view s. The in-line view year_over_year_sales then computes the year ago weekly year-to-date sales using the LAG function. The LAG function labeled weekly_ytd_ sales_prior_year species a PARTITION BY clause that pairs rows for the same week of years 2000 and 2001 into a single partition. We then pass an offset of 1 to the LAG function to get the weekly year to date sales for the prior year. The outermost query block selects data from year_over_year_sales with the condition yr = 2001, and thus the query returns, for each product, its weekly year-to-date sales in the specied weeks of years 2001 and 2000.
1. 2.
We will create a view called cube_prod_time, which holds a hierarchical cube of sales aggregated across times and products. Then we will create a view of the time dimension to use as an edge of the cube. The time edge, which holds a complete set of dates, will be partitioned outer joined to the sparse data in the view cube_prod_time. Finally, for maximum performance, we will create a materialized view, mv_ prod_time, built using the same denition as cube_prod_time.
3.
For more information regarding hierarchical cubes, see Chapter 20, "SQL for Aggregation in Data Warehouses". The materialized view is dened using the following statement: Step 1 Create the hierarchical cube view The materialized view shown in the following may already exist in your system; if not, create it now. If you must generate it, please note that we limit the query to just two products to keep processing time short:
CREATE OR REPLACE VIEW cube_prod_time AS SELECT (CASE WHEN ((GROUPING(calendar_year)=0 ) AND (GROUPING(calendar_quarter_desc)=1 )) THEN (TO_CHAR(calendar_year) || '_0') WHEN ((GROUPING(calendar_quarter_desc)=0 ) AND (GROUPING(calendar_month_desc)=1 )) THEN (TO_CHAR(calendar_quarter_desc) || '_1') WHEN ((GROUPING(calendar_month_desc)=0 ) AND (GROUPING(t.time_id)=1 )) THEN (TO_CHAR(calendar_month_desc) || '_2') ELSE (TO_CHAR(t.time_id) || '_3') END) Hierarchical_Time, calendar_year year, calendar_quarter_desc quarter, calendar_month_desc month, t.time_id day, prod_category cat, prod_subcategory subcat, p.prod_id prod, GROUPING_ID(prod_category, prod_subcategory, p.prod_id, calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid, GROUPING_ID(prod_category, prod_subcategory, p.prod_id) gid_p, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id) gid_t, SUM(amount_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt FROM SALES s, TIMES t, PRODUCTS p WHERE s.time_id = t.time_id AND p.prod_name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id
Because this view is limited to two products, it returns just over 2200 rows. Note that the column Hierarchical_Time contains string representations of time from all levels of the time hierarchy. The CASE expression used for the Hierarchical_ Time column appends a marker (_0, _1, ...) to each date string to denote the time level of the value. A _0 represents the year level, _1 is quarters, _2 is months, and _3 is day. Note that the GROUP BY clause is a concatenated ROLLUP which species the rollup hierarchy for the time and product dimensions. The GROUP BY clause is what determines the hierarchical cube contents. Step 2 Create the view edge_time, which is a complete set of date values edge_time is the source for lling time gaps in the hierarchical cube using a partitioned outer join. The column Hierarchical_Time in edge_time will be used in a partitioned join with the Hierarchical_Time column in the view cube_prod_time. The following statement denes edge_time:
CREATE OR REPLACE VIEW edge_time AS SELECT (CASE WHEN ((GROUPING(calendar_year)=0 ) AND (GROUPING(calendar_quarter_desc)=1 )) THEN (TO_CHAR(calendar_year) || '_0') WHEN ((GROUPING(calendar_quarter_desc)=0 ) AND (GROUPING(calendar_month_desc)=1 )) THEN (TO_CHAR(calendar_quarter_desc) || '_1') WHEN ((GROUPING(calendar_month_desc)=0 ) AND (GROUPING(time_id)=1 )) THEN (TO_CHAR(calendar_month_desc) || '_2') ELSE (TO_CHAR(time_id) || '_3') END) Hierarchical_Time, calendar_year yr, calendar_quarter_number qtr_num, calendar_quarter_desc qtr, calendar_month_number mon_num, calendar_month_desc mon, time_id - TRUNC(time_id, 'YEAR') + 1 day_num, time_id day, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, time_id) gid_t FROM TIMES GROUP BY ROLLUP (calendar_year, (calendar_quarter_desc, calendar_quarter_number), (calendar_month_desc, calendar_month_number), time_id);
Step 3 Create the materialized view mv_prod_time to support faster performance The materialized view denition is a duplicate of the view cube_prod_time dened earlier. Because it is a duplicate query, references to cube_prod_time will be rewritten to use the mv_prod_time materialized view. The following materialized may already exist in your system; if not, create it now. If you must generate it, please note that we limit the query to just two products to keep processing time short.
CREATE MATERIALIZED VIEW mv_prod_time REFRESH COMPLETE ON DEMAND AS SELECT (CASE WHEN ((GROUPING(calendar_year)=0 ) AND (GROUPING(calendar_quarter_desc)=1 )) THEN (TO_CHAR(calendar_year) || '_0') WHEN ((GROUPING(calendar_quarter_desc)=0 ) AND (GROUPING(calendar_month_desc)=1 )) THEN (TO_CHAR(calendar_quarter_desc) || '_1') WHEN ((GROUPING(calendar_month_desc)=0 ) AND (GROUPING(t.time_id)=1 )) THEN (TO_CHAR(calendar_month_desc) || '_2') ELSE (TO_CHAR(t.time_id) || '_3') END) Hierarchical_Time, calendar_year year, calendar_quarter_desc quarter, calendar_month_desc month, t.time_id day, prod_category cat, prod_subcategory subcat, p.prod_id prod, GROUPING_ID(prod_category, prod_subcategory, p.prod_id, calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid, GROUPING_ID(prod_category, prod_subcategory, p.prod_id) gid_p, GROUPING_ID(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id) gid_t, SUM(amount_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt FROM SALES s, TIMES t, PRODUCTS p WHERE s.time_id = t.time_id AND p.prod_name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id GROUP BY ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc, t.time_id), ROLLUP(prod_category, prod_subcategory, p.prod_id);
Step 4 Create the comparison query We have now set the stage for our comparison query. We can obtain period-to-period comparison calculations at all time levels. It requires applying analytic functions to a hierarchical cube with dense data along the time dimension.
Some of the calculations we can achieve for each time level are:
s
Sum of sales for prior period at all levels of time. Variance in sales over prior period. Sum of sales in the same period a year ago at all levels of time. Variance in sales over the same period last year.
The following example performs all four of these calculations. It uses a partitioned outer join of the views cube_prod_time and edge_time to create an in-line view of dense data called dense_cube_prod_time. The query then uses the LAG function in the same way as the prior single-level example. The outer WHERE clause species time at three levels: the days of August 2001, the entire month, and the entire third quarter of 2001. Note that the last two rows of the results contain the month level and quarter level aggregations. Note: To make the results easier to read if you are using SQL*Plus, the column headings should be adjusted with the following commands. The commands will fold the column headings to reduce line length:
col col col col sales_prior_period heading 'sales_prior|_period' variance_prior_period heading 'variance|_prior|_period' sales_same_period_prior_year heading 'sales_same|_period_prior|_year' variance_same_period_p_year heading 'variance|_same_period|_prior_year'
Here is the query comparing current sales to prior and year ago sales:
SELECT SUBSTR(prod,1,4) prod, SUBSTR(Hierarchical_Time,1,12) ht, sales, sales_prior_period, sales - sales_prior_period variance_prior_period, sales_same_period_prior_year, sales - sales_same_period_prior_year variance_same_period_p_year FROM (SELECT cat, subcat, prod, gid_p, gid_t, Hierarchical_Time, yr, qtr, mon, day, sales, LAG(sales, 1) OVER (PARTITION BY gid_p, cat, subcat, prod, gid_t ORDER BY yr, qtr, mon, day) sales_prior_period, LAG(sales, 1) OVER (PARTITION BY gid_p, cat, subcat, prod, gid_t, qtr_num, mon_num, day_num ORDER BY yr) sales_same_period_prior_year FROM (SELECT c.gid, c.cat, c.subcat, c.prod, c.gid_p, t.gid_t, t.yr, t.qtr, t.qtr_num, t.mon, t.mon_num, t.day, t.day_num, t.Hierarchical_Time, NVL(s_sold,0) sales
FROM cube_prod_time c PARTITION BY (gid_p, cat, subcat, prod) RIGHT OUTER JOIN edge_time t ON ( c.gid_t = t.gid_t AND c.Hierarchical_Time = t.Hierarchical_Time) ) dense_cube_prod_time ) --side by side current and prior year sales WHERE prod IN (139) AND gid_p=0 AND --1 product and product level data ( (mon IN ('2001-08' ) AND gid_t IN (0, 1)) OR --day and month data (qtr IN ('2001-03' ) AND gid_t IN (3))) --quarter level data ORDER BY day; variance sales_same variance sales_prior _prior _period_prior _same_period _period _period _year _prior_year ----------- ---------- ------------- -----------0 0 0 0 0 1347.53 0 1347.53 1347.53 -1347.53 42.36 -42.36 0 57.83 995.75 -937.92 57.83 -57.83 0 0 0 0 0 0 0 134.81 880.27 -745.46 134.81 1155.08 0 1289.89 1289.89 -1289.89 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38.49 1104.55 -1066.06 38.49 -38.49 0 0 0 77.17 1052.03 -974.86 77.17 2390.37 0 2467.54 2467.54 -2467.54 127.08 -127.08 0 0 0 0 0 0 0 0 0 0 0 0 0 1371.43 0 1371.43 1371.43 -1217.47 2091.3 -1937.34 153.96 -153.96 0 0 0 0 0 0 0 1235.48 0 1235.48 1235.48 -1062.18 2075.64 -1902.34 173.3 -173.3 0 0
PROD ---139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139 139
HT SALES ------------ ---------01-AUG-01_3 0 02-AUG-01_3 1347.53 03-AUG-01_3 0 04-AUG-01_3 57.83 05-AUG-01_3 0 06-AUG-01_3 0 07-AUG-01_3 134.81 08-AUG-01_3 1289.89 09-AUG-01_3 0 10-AUG-01_3 0 11-AUG-01_3 0 12-AUG-01_3 0 13-AUG-01_3 0 14-AUG-01_3 0 15-AUG-01_3 38.49 16-AUG-01_3 0 17-AUG-01_3 77.17 18-AUG-01_3 2467.54 19-AUG-01_3 0 20-AUG-01_3 0 21-AUG-01_3 0 22-AUG-01_3 0 23-AUG-01_3 1371.43 24-AUG-01_3 153.96 25-AUG-01_3 0 26-AUG-01_3 0 27-AUG-01_3 1235.48 28-AUG-01_3 173.3 29-AUG-01_3 0
0 0 8347.43 24356.8
0 0 7213.21 28862.14
0 0 1134.22 -4505.34
0 0 8368.98 24168.99
0 0 -21.55 187.81
The rst LAG function (sales_prior_period) partitions the data on gid_p, cat, subcat, prod, gid_t and orders the rows on all the time dimension columns. It gets the sales value of the prior period by passing an offset of 1. The second LAG function (sales_same_period_prior_year) partitions the data on additional columns qtr_num, mon_num, and day_num and orders it on yr so that, with an offset of 1, it can compute the year ago sales for the same period. The outermost SELECT clause computes the variances.
In this statement, the view time_c is dened by performing a UNION ALL of the edge_time view (dened in the prior example) and the user-dened 13th month.
The gid_t value of 8 was chosen to differentiate the custom member from the standard members. The UNION ALL species the attributes for a 13th month member by doing a SELECT from the DUAL table. Note that the grouping id, column gid_t, is set to 8, and the quarter number is set to 5. Then, the second step is to use an inline view of the query to perform a partitioned outer join of cube_prod_time with time_c. This step creates sales data for the 13th month at each level of product aggregation. In the main query, the analytic function SUM is used with a CASE expression to compute the 13th month, which is dened as the summation of the rst month's sales of each quarter.
SELECT * FROM (SELECT SUBSTR(cat,1,12) cat, SUBSTR(subcat,1,12) subcat, prod, mon, mon_num, SUM(CASE WHEN mon_num IN (1, 4, 7, 10) THEN s_sold ELSE NULL END) OVER (PARTITION BY gid_p, prod, subcat, cat, yr) sales_month_13 FROM (SELECT c.gid, c.prod, c.subcat, c.cat, gid_p, t.gid_t, t.day, t.mon, t.mon_num, t.qtr, t.yr, NVL(s_sold,0) s_sold FROM cube_prod_time c PARTITION BY (gid_p, prod, subcat, cat) RIGHT OUTER JOIN time_c t ON (c.gid_t = t.gid_t AND c.Hierarchical_Time = t.Hierarchical_Time) ) ) WHERE mon_num=13; CAT -----------Electronics Electronics Electronics Electronics Electronics SUBCAT PROD MON MON_NUM SALES_MONTH_13 ------------ ---------- -------- ---------- -------------Game Console 16 2001-13 13 762334.34 Y Box Games 139 2001-13 13 75650.22 Game Console 2001-13 13 762334.34 Y Box Games 2001-13 13 75650.22 2001-13 13 837984.56 2001-13 13 837984.56
The SUM function uses a CASE to limit the data to months 1, 4, 7, and 10 within each year. Due to the tiny data set, with just 2 products, the rollup values of the results are necessarily repetitions of lower level aggregations. For more realistic set of
rollup values, you can include more products from the Game Console and Y Box Games subcategories in the underlying materialized view.
22
SQL for Modeling
This chapter discusses using SQL modeling, and includes:
s
Overview of SQL Modeling Basic Topics in SQL Modeling Advanced Topics in SQL Modeling Performance Considerations with SQL Modeling Examples of SQL Modeling
Partition columns dene the logical blocks of the result set in a way similar to the partitions of the analytical functions described in Chapter 21, "SQL for Analysis and Reporting". Rules in the MODEL clause are applied to each partition independent of other partitions. Thus, partitions serve as a boundary point for parallelizing the MODEL computation. Dimension columns dene the multi-dimensional array and are used to identify cells within a partition. By default, a full combination of dimensions should identify just one cell in a partition. In default mode, they can be considered analogous to the key of a relational table. Measures are equivalent to the measures of a fact table in a star schema. They typically contain numeric values such as sales units or cost. Each cell is accessed by specifying its full combination of dimensions. Note that each partition may have a cell that matches a given combination of dimensions.
The MODEL clause enables you to specify rules to manipulate the measure values of the cells in the multi-dimensional array dened by partition and dimension columns. Rules access and update measure column values by specifying dimension values symbolically. Such symbolic references used in rules result in a highly readable model. Rules are concise and exible, and can use wild cards and looping constructs for maximum expressiveness. Oracle evaluates the rules in an efcient way, parallelizes the model computation whenever possible, and provides a seamless integration of the MODEL clause with other SQL clauses. The MODEL clause,
22-2
thus, is a scalable and manageable way of computing business models in the database. Figure 221 offers a conceptual overview of the modeling feature of SQL. The gure has three parts. The top segment shows the concept of dividing a typical table into partition, dimension, and measure columns. The middle segment shows two rules that calculate the value of Prod1 and Prod2 for the year 2002. Finally, the third part shows the output of a query that applies the rules to such a table with hypothetical data. The unshaded output is the original data as it is retrieved from the database, while the shaded output shows the rows calculated by the rules. Note that results in partition A are calculated independently from results of partition B.
Figure 221
Model Elements
Mapping of columns to model entities: Country Partition Product Dimension Year Dimension Sales Measure
Rules: Sales(Prod1, 2002) = Sales(Prod1,2000)+Sales(Prod1,2001) Sales(Prod2, 2002) = Sales(Prod2,2000)+Sales(Prod2,2001) Output of MODEL clause: Country Partition A A A A B B B B A A B B Product Dimension Prod1 Prod1 Prod2 Prod2 Prod1 Prod1 Prod2 Prod2 Prod1 Prod2 Prod1 Prod2 Year Dimension 2000 2001 2000 2001 2000 2001 2000 2001 2002 2002 2002 2002 Sales Measure 10 15 12 16 21 23 28 29 25 28 44 57 Rule Results Original Data
22-4
clause and rearranged into an array. Once the array is dened, rules are applied one by one to the data. Finally, the data, including both its updated values and newly created values, is rearranged into row form and presented as the results of the query.
Figure 222 Model Flow Processing
MODEL DIMENSION BY (prod, year) MEASURES (sales s) RULES UPSERT (s[ANY, 2000]=s[CV(prod), CV(year-1]*2], --Rule 1 s[vcr, 2002]=s[vcr, 2001]+s[vcr, 2000], --Rule 2 s[dvd, 2002]=AVG(s)[CV(prod), year<2001]) --Rule 3 prod year sales ... ... ... ... ... ... vcr 2001 9 Query results input to MODEL dvd 2001 0 clause
Rule 2 applied 1 2 3 4 2 9 4 0 6 1 8 2 pc
9 0 1 vcr dvd tv
11 vcr dvd tv
prod year sales Rule 3 applied 1 2 9 2 4 0 3 6 1 4 8 2 pc 1999 2000 2001 2002 ... ... vcr dvd vcr dvd ... ... 2001 2001 2002 2002 ... ... 9 0 11 3
11 3 vcr dvd tv
This query partitions the data in sales_view (which is illustrated in "Base Schema" on page 22-11) on country so that the model computation, as dened by the three rules, is performed on each country. This model calculates the sales of Bounce in 2002 as the sum of its sales in 2000 and 2001, and sets the sales for Y Box in 2002 to the same value as they were in 2001. Also, it introduces a new product category All_Products (sales_view does not have the product All_Products) for year 2002 to be the sum of sales of Bounce and Y Box for that year. The output of this query is as follows, where bold text indicates new values:
COUNTRY -------------------Italy Italy Italy Italy ... Italy Italy Italy Italy ... Italy ... Japan PRODUCT YEAR SALES --------------- ---------- ---------Bounce 1999 2474.78 Bounce 2000 4333.69 Bounce 2001 4846.3 Bounce 2002 9179.99 Y Y Y Y Box Box Box Box 1999 2000 2001 2002 2002 1999 15215.16 29322.89 81207.55 81207.55 90387.54 2961.3
All_Products Bounce
22-6
Japan Japan Japan ... Japan Japan Japan Japan ... Japan ...
All_Products
Note that, while the sales values for Bounce and Y Box exist in the input, the values for All_Products are derived.
Symbolic cell addressing Measure columns in individual rows are treated like cells in a multi-dimensional array and can be referenced and updated using symbolic references to dimension values. For example, in a fact table ft(country, year, sales), you can designate country and year to be dimension columns and sales to be the measure and reference sales for a given country and year as sales[country='Spain', year=1999]. This gives you the sales value for Spain in 1999. You can also use a shorthand form sales['Spain', 1999] to mean the same thing. There are a few semantic differences between these notations, though. See "Cell Referencing" on page 22-15 for further details.
Symbolic array computation You can specify a series of formulas, called rules, to operate on the data. Rules can invoke functions on individual cells or on a set or range of cells. An example involving individual cells is the following:
sales[country='Spain',year=2001] = sales['Spain',2000]+ sales['Spain',1999]
This sets the sales in Spain for the year 2001 to the sum of sales in Spain for 1999 and 2000. An example involving a range of cells is the following:
sales[country='Spain',year=2001] = MAX(sales)['Spain',year BETWEEN 1997 AND 2000]
This sets the sales in Spain for the year 2001 equal to the maximum sales in Spain between 1997 and 2000.
s
The UPSERT and UPDATE options Using the UPSERT option, which is the default, you can create cell values that do not exist in the input data. If the cell referenced exists in the data, it is updated. Otherwise, it is inserted. The UPDATE option, on the other hand, would not insert any new cells. You can specify these options globally, in which case they apply to all rules, or per each rule. If you specify an option at the rule level, it overrides the global option. Consider the following rules:
UPDATE sales['Spain', 1999] = 3567.99, UPSERT sales['Spain', 2001] = sales['Spain', 2000]+ sales['Spain', 1999]
The rst rule updates the cell for sales in Spain for 1999. The second rule updates the cell for sales in Spain for 2001 if it exists, otherwise, it creates a new cell.
s
Wildcard specication of dimensions You can use ANY and IS ANY to specify all values in a dimension. As an example, consider the following statement:
sales[ANY, 2001] = sales['Japan', 2000]
This rule sets the 2001 sales of all countries equal to the sales value of Japan for the year 2000. All values for the dimension, including nulls, satisfy the ANY specication. You can specify the same in symbolic form using an IS ANY predicate as in the following:
sales[country IS ANY, 2001] = sales['Japan', 2000]
s
Accessing dimension values using the CV function You can use the CV function on the right side of a rule to access the value of a dimension column of the cell referenced on the left side of a rule. It enables you to combine multiple rules performing similar computation into a single rule, thus resulting in concise specication. For example, you can combine the following rules:
sales[country='Spain', year=2002] = 1.2 * sales['Spain', 2001], sales[country='Italy', year=2002] = 1.2 * sales['Italy', 2001], sales[country='Japan', year=2002] = 1.2 * sales['Japan', 2001]
22-8
Observe that the CV function passes the value for the country dimension from the left to the right side of the rule.
s
Ordered computation For rules updating a set of cells, the result may depend on the ordering of dimension values. You can force a particular order for the dimension values by specifying an ORDER BY in the rule. An example is the following rule:
sales[country IS ANY, year BETWEEN 2000 AND 2003] ORDER BY year = 1.05 * sales[CV(country), CV(year)-1]
This ensures that the years are referenced in increasing chronological order.
s
Automatic rule ordering Rules in the MODEL clause can be automatically ordered based on dependencies among the cells using the AUTOMATIC ORDER keywords. For example, in the following assignments, the last two rules will be processed before the rst rule because the rst depends on the second and third:
RULES AUTOMATIC ORDER {sales[c='Spain', y=2001] = sales[c='Spain', y=2000] + sales[c='Spain', y=1999] sales[c='Spain', y=2000] = 50000, sales[c='Spain', y=1999] = 40000}
Iterative rule evaluation You can specify iterative rule evaluation, in which case the rules are evaluated iteratively until the termination condition is satised. Consider the following specication:
MODEL DIMENSION BY (x) MEASURES (s) RULES ITERATE (4) (s[x=1] = s[x=1]/2)
This statement species that the formula s[x=1] = s[x=1]/2 evaluation be repeated four times. The number of iterations is specied in the ITERATE option of the MODEL clause. It is also possible to specify a termination condition by using an UNTIL clause.
Iterative rule evaluation is an important tool for modeling recursive relationships between entities in a business application. For example, a loan amount might depend on the interest rate where the interest rate in turn depends on the amount of the loan.
s
Reference models Rules can reference cells from different multi-dimensional arrays. All but one of the multi-dimensional arrays used in the model are read-only and are called reference models. Rules can update or insert cells in only one multi-dimensional array, which is called the main model. The use of reference models enables you to relate models with different dimensionality. For example, assume that, in addition to the fact table ft(country, year, sales), you have a table with currency conversion ratios cr(country, ratio) with country as the dimension column and ratio as the measure. Each row in this table gives the conversion ratio of that country's currency to that of US dollar. These two tables could be used in rules such as the following:
dollar_sales['Spain',2001] = sales['Spain',2000] * ratio['Spain']
Scalable computation You can partition data and evaluate rules within each partition independent of other partitions. This enables parallelization of model computation based on partitions. For example, consider the following model:
MODEL PARTITION BY (c) DIMENSION BY (y) MEASURES (s) (sales[y=2001] = AVG(s)[y BETWEEN 1990 AND 2000]
The data is partitioned by country and, within each partition, you can compute the sales in 2001 to be the average of sales in the years between 1990 and 2000. Partitions can be processed in parallel and this results in a scalable execution of the model.
Base Schema MODEL Clause Syntax Keywords in SQL Modeling Rules and Restrictions when Using SQL for Modeling
Base Schema
This chapter's examples are based on the following view sales_view, which is derived from the sh sample schema.
CREATE VIEW sales_view AS SELECT country_name country, prod_name product, calendar_year year, SUM(amount_sold) sales, COUNT(amount_sold) cnt, MAX(calendar_year) KEEP (DENSE_RANK FIRST ORDER BY SUM(amount_sold) DESC) OVER (PARTITION BY country_name, prod_name) best_year, MAX(calendar_year) KEEP (DENSE_RANK LAST ORDER BY SUM(amount_sold) DESC) OVER (PARTITION BY country_name, prod_name) worst_year FROM sales, times, customers, countries, products WHERE sales.time_id = times.time_id AND sales.prod_id = products.prod_id AND sales.cust_id =customers.cust_id AND customers.country_id=countries.country_id GROUP BY country_name, prod_name, calendar_year;
This query computes SUM and COUNT aggregates on the sales data grouped by country, product, and year. It will report for each product sold in a country, the year when the sales were the highest for that product in that country. This is called the best_year of the product. Similarly, worst_year gives the year when the sales were the lowest.
22-11
MEASURES (<cols>) [<reference options>] [RULES] <rule options> (<rule>, <rule>,.., <rule>) <global reference options> ::= <reference options> <ret-opt> <ret-opt> ::= RETURN {ALL|UPDATED} ROWS <reference options> ::= [IGNORE NAV | [KEEP NAV] [UNIQUE DIMENSION | UNIQUE SINGLE REFERENCE] <rule options> ::= [UPSERT | UPDATE] [AUTOMATIC ORDER | SEQUENTIAL ORDER] [ITERATE (<number>) [UNTIL <condition>]] <reference models> ::= REFERENCE ON <ref-name> ON (<query>) DIMENSION BY (<cols>) MEASURES (<cols>) <reference options>
Each rule represents an assignment. Its left side references a cell or a set of cells and the right side can contain expressions involving constants, host variables, individual cells or aggregates over ranges of cells. For example, consider Example 221.
Example 221 Simple Query with the MODEL Clause
SELECT SUBSTR(country,1,20) country, SUBSTR(product,1,15) product, year, sales FROM sales_view WHERE country in ('Italy', 'Japan') MODEL RETURN UPDATED ROWS MAIN simple_model PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales) RULES (sales['Bounce', 2001] = 1000, sales['Bounce', 2002] = sales['Bounce', 2001] + sales['Bounce', 2000], sales['Y Box', 2002] = sales['Y Box', 2001]) ORDER BY country, product, year;
This query denes model computation on the rows from sales_view for the countries Italy and Japan. This model has been given the name simple_model. It partitions the data on country and denes, within each partition, a two-dimensional array on product and year. Each cell in this array holds the measure value sales. The rst rule of this model sets the sales of Bounce in year 2001 to 1000. The next two
rules dene that the sales of Bounce in 2002 are the sum of its sales in years 2001 and 2000, and the sales of Y Box in 2002 are same as that of the previous year 2001. Specifying RETURN UPDATED ROWS makes the preceding query return only those rows that are updated or inserted by the model computation. By default or if you use RETURN ALL ROWS, you would get all rows not just the ones updated or inserted by the MODEL clause. The query produces the following output:
COUNTRY -------------------Italy Italy Italy Japan Japan Japan PRODUCT YEAR SALES --------------- ---------- ---------Bounce 2001 1000 Bounce 2002 5333.69 Y Box 2002 81207.55 Bounce 2001 1000 Bounce 2002 6133.53 Y Box 2002 89634.83
Note that the MODEL clause does not update or insert rows into database tables. The following query illustrates this be showing that sales_view has not been altered:
SELECT SUBSTR(country,1,20) country, SUBSTR(product,1,15) product, year, sales FROM sales_view WHERE country IN ('Italy', 'Japan'); COUNTRY -------------------Italy Italy Italy ... PRODUCT YEAR SALES --------------- ---------- ---------Bounce 1999 2474.78 Bounce 2000 4333.69 Bounce 2001 4846.3
Observe that the update of the sales value for Bounce in the 2001 done by this MODEL clause is not reected in the database. If you want to update or insert rows in the database tables, you should use the INSERT, UPDATE, or MERGE statements. In the preceding example, columns are specied in the PARTITION BY, DIMENSION BY, and MEASURES list. You can also specify constants, host variables, single-row functions, aggregate functions, analytical functions, or expressions involving them as partition and dimension keys and measures. However, you need to alias them in PARTITION BY, DIMENSION BY, and MEASURES lists. You need to use aliases to refer these expressions in the rules, SELECT list, and the query ORDER BY. The following example shows how to use expressions and aliases:
SELECT country, p product, year, sales, profits FROM sales_view
22-13
WHERE country IN ('Italy', 'Japan') MODEL RETURN UPDATED ROWS PARTITION BY (SUBSTR(country,1,20) AS country) DIMENSION BY (product AS p, year) MEASURES (sales, 0 AS profits) RULES (profits['Bounce', 2001] = sales['Bounce', 2001] * 0.25, sales['Bounce', 2002] = sales['Bounce', 2001] + sales['Bounce', 2000], profits['Bounce', 2002] = sales['Bounce', 2002] * 0.35) ORDER BY country, year; COUNTRY ------Italy Italy Japan Japan PRODUCT --------Bounce Bounce Bounce Bounce YEAR ---2001 2002 2001 2002 SALES -------4846.3 9179.99 6303.6 11437.13 PROFITS -------1211.575 3212.9965 1575.9 4002.9955
See Oracle Database SQL Reference for more information regarding MODEL clause syntax.
UPDATE This updates existing cell values. If the cell values do not exist, no updates are done.
UPSERT This updates existing cell values. If the cell values do not exist, they are inserted.
IGNORE NAV For numeric cells, this treats nulls and absent values as 0. This means that a cell not supplied to MODEL by the query result set will be treated as a zero for the calculation. This can be used at a global level for all measures in a model.
KEEP NAV
This keeps null and absent cell values unchanged. It is useful for making exceptions when IGNORE NAV is specied at the global level. This is the default, and can be omitted.
Calculation Definition
s
MEASURES The set of values that are modied or created by the model.
AUTOMATIC ORDER This causes all rules to be evaluated in an order based on their dependencies.
SEQUENTIAL ORDER This causes rules to be evaluated in the order they are written. This is the default.
UNIQUE DIMENSION This is the default, and it means that the PARTITION BY and DIMENSION BY columns in the MODEL clause must uniquely identify each and every cell in the model. This uniqueness is explicitly veried at run time when necessary, in which case it causes some overhead.
UNIQUE SINGLE REFERENCE The PARTITION BY and DIMENSION BY clauses uniquely identify single point references on the right-hand side of the rules. This may reduce processing time by avoiding explicit checks for uniqueness at run time.
RETURN [ALL|UPDATED] ROWS This enables you to specify whether to return all rows selected or only those rows updated by the rules. The default is ALL, while the alternative is UPDATED ROWS.
Cell Referencing
In the MODEL clause, a relation can be viewed as a multi-dimensional array of cells. A cell of this multi-dimensional array contains the measure values and is indexed using DIMENSION BY keys, within each partition dened by the PARTITION BY keys. For example, consider the following:
22-15
SELECT country, product, year, sales, best_year, best_year FROM sales_view MODEL PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales, best_year) (<rules> ..) ORDER BY country, product, year;
This partitions the data by country and denes within each partition, a two-dimensional array on product and year. The cells of this array contain two measures: sales and best_year. Accessing the measure value of a cell by specifying the DIMENSION BY keys constitutes a cell reference. An example of a cell reference is sales[product= 'Bounce', year=2000]. Here, we are accessing the sales value of a cell referenced by product Bounce and the year 2000. In a cell reference, you can specify DIMENSION BY keys either symbolically as in the preceding cell reference or positionally as in sales['Bounce', 2000].
Only rules with positional references can insert new cells. Assuming DIMENSION BY keys to be product and year in that order, it accesses the sales value for Bounce and 2001. Based on how they are specied, cell references are either single cell or multi-cell reference.
This is a single cell reference in which a single value is specied for the rst dimension positionally and a single value for second dimension (year) is specied symbolically.
Multi-Cell References
Cell references that are not single cell references are called multi-cell references. Examples of multi-cell reference are:
sales[year>=2001], sales[product='Bounce', year < 2001], AND sales[product LIKE '%Finding Fido%', year IN (1999, 2000, 2001)]
Rules
Model computation is expressed in rules that manipulate the cells of the multi-dimensional array dened by PARTITION BY, DIMENSION BY, and MEASURES clauses. A rule is an assignment statement whose left side represents a cell or a range of cells and whose right side is an expression involving constants, bind variables, individual cells or an aggregate function on a range of cells. Rules can use wild cards and looping constructs for maximum expressiveness. An example of a rule is the following:
sales['Bounce', 2003] = 1.2 * sales['Bounce', 2002]
This rule says that, for the product Bounce, the sales for 2003 are 20% more than that of 2002.
22-17
Note that this rule has single cell references on both left and right side and is relatively simple. Complex rules can be written with multi-cell references, aggregates, and nested cell references. These are discussed in the following sections.
The following example illustrates the usage of inverse percentile function PERCENTILE_DISC. It projects Finding Fido sales for year 2003 to be 30% more than the median sales for products Finding Fido, Standard Mouse Pad, and Boat for all years prior to 2003.
sales[product='Finding Fido', year=2003] = 1.3 * PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sales) [product IN ('Finding Fido','Standard Mouse Pad','Boat'), year < 2003]
REGR_SLOPE(sales, y)['Finding Fido', year < 2002] * sales['Finding Fido', 2002] + sales['Finding Fido', 2002]);
This shows the usage of regression slope REGR_SLOPE function in rules. This function computes the slope of the change of a measure with respect to a dimension of the measure. In the preceding example, it gives the slope of the changes in the sales value with respect to year. This model projects Finding Fido sales for 2003 to be the sales in 2002 scaled by the growth (or slope) in sales for years less than 2002. Aggregate functions can appear only on the right side of rules. Arguments to the aggregate function can be constants, bind variables, measures of the MODEL clause, or expressions involving them. For example, the rule computes the sales of Bounce for 2003 to be the weighted average of its sales for years from 1998 to 2002 would be:
sales['Bounce', 2003] = AVG(sales * weight)['Bounce', year BETWEEN 1998 AND 2002]
This rule accesses a range of cells on the left side (cells for product Standard Mouse Pad and year greater than 2000) and assigns sales measure of each such cell to the value computed by the right side expression. Computation by the preceding rule is described as "sales of Standard Mouse Pad for years after 2000 is 20% of the sales of Finding Fido for year 2000". This computation is simple in that the right side cell references and hence the right side expression are the same for all cells referenced on the left.
sales[product='Standard Mouse Pad', year>2000] = sales[CV(product), CV(year)] + 0.2 * sales['Finding Fido', 2000]
The CV function provides the value of a DIMENSION BY key of the cell currently referenced on the left side. When the left side references the cell Standard Mouse Pad and 2001, the right side expression would be:
sales['Standard Mouse Pad', 2001] + 0.2 * sales['Finding Fido', 2000]
Similarly, when the left side references the cell Standard Mouse Pad and 2002, the right side expression we would evaluate is:
sales['Standard Mouse Pad', 2002] + 0.2 * sales['Finding Fido', 2000]
22-19
The use of the CV function provides the capability of relative indexing where dimension values of the cell referenced on the left side are used on the right side cell references. The CV function takes a dimension key as its argument. It is also possible to use CV without any argument as in CV() and in which case, positional referencing is implied. CV() may be used outside a cell reference, but when used in this way its argument must contain the name of the dimension desired. You can also write the preceding rule as:
sales[product='Standard Mouse Pad', year>2000] = sales[CV(), CV()] + 0.2 * sales['Finding Fido', 2000]
The rst CV() reference corresponds to CV(product) and the latter corresponds to CV(year). The CV function can be used only in right side cell references. Another example of the usage of CV function is the following:
sales[product IN ('Finding Fido','Standard Mouse Pad','Bounce'), year BETWEEN 2002 AND 2004] = 2 * sales[CV(product), CV(year)-10]
This rule says that, for products Finding Fido, Standard Mouse Pad, and Bounce, the sales for years between 2002 and 2004 will be twice of what their sales were 10 years ago.
Here, the nested cell reference best_year['Bounce', 2003] provides value for the dimension key year and is used in the symbolic reference for year. Measures
best_year and worst_year give, for each year (y) and product (p) combination, the year for which sales of product p were highest or lowest. The following rule computes the sales of Standard Mouse Pad for 2003 to be the average of Standard Mouse Pad sales for the years in which Finding Fido sales were highest and lowest:
sales['Standard Mouse Pad', 2003] = (sales[CV(), best_year['Finding Fido', CV(year)]] + sales[CV(), worst_year['Finding Fido', CV(year)]]) / 2
Oracle allows only one level of nesting, and only single cell references can be used as nested cell references. Aggregates on multi-cell references cannot be used in nested cell references.
Alternatively, the option AUTOMATIC ORDER enables Oracle to determine the order of evaluation of rules automatically. Oracle examines the cell references within rules and constructs a dependency graph based on dependencies among rules. If cells referenced on the left side of rule R1 are referenced on the right side of another rule R2, then R2 is considered to depend on R1. In other words, rule R1 should be evaluated before rule R2. If you specify AUTOMATIC ORDER in the preceding example as in:
RULES AUTOMATIC ORDER (sales['Bounce', 2001] = sales['Bounce', 2000] + sales['Bounce', 1999], sales['Bounce', 2000] = 50000, sales['Bounce', 1999] = 40000)
Rules 2 and 3 are evaluated, in some arbitrary order, before rule 1. This is because rule 1 depends on rules 2 and 3 and hence need to be evaluated after rules 2 and 3. The order of evaluation among second and third rules can be arbitrary as they do not depend on one another. The order of evaluation among rules independent of one another can be arbitrary. A dependency graph is analyzed to come up with the
22-21
rule evaluation order. SQL models with automatic order of evaluation, as in the preceding fragment, are called automatic order models. In an automatic order model, multiple assignments to the same cell are not allowed. In other words, measure of a cell can be assigned only once. Oracle will return an error in such cases as results would be non-deterministic. For example, the following rule specication will generate an error as sales['Bounce', 2001] is assigned more than once:
RULES AUTOMATIC ORDER (sales['Bounce', 2001] = sales['Bounce', 2000] + sales['Bounce', 1999], sales['Bounce', 2001] = 50000, sales['Bounce', 2001] = 40000)
The rules assigning the sales of product Bounce for 2001 do not depend on one another and hence, no particular evaluation order can be xed among them. This leads to non-deterministic results as the evaluation order is arbitrary sales['Bounce', 2001] can be 40000 or 50000 or sum of Bounce sales for years 1999 and 2000. Oracle prevents this by disallowing multiple assignments when AUTOMATIC ORDER is specied. However, multiple assignments are ne in sequential order models. If SEQUENTIAL ORDER was specied instead of AUTOMATIC ORDER in the preceding example, the result of sales['Bounce', 2001] would be 40000.
The cell for product Bounce and year 2003, if it exists, gets updated with the sum of Bounce sales for years 2001 and 2002, otherwise, it gets created. An optional UPSERT keyword can be specied in the MODEL clause to make this upsert semantic explicit. Alternatively, the UPDATE option forces strict update mode. In this mode, the rule is ignored if the cell it references on the left side does not exist.
You can specify an UPDATE or UPSERT option at the global level in the RULES clause in which case all rules operate in the respective mode. These options can be specied at a local level with each rule and in which case, they override the global behavior. For example, in the following specication:
RULES UPDATE (UPDATE s['Bounce',2001] = sales['Bounce',2000] + sales['Bounce',1999], UPSERT s['Y Box', 2001] = sales['Y Box', 2000] + sales['Y Box', 1999], sales['Mouse Pad', 2001] = sales['Mouse Pad', 2000] + sales['Mouse Pad',1999])
The UPDATE option is specied at global level so, rst and third rules operate in update mode. The second rule operates in upsert mode as an UPSERT keyword is specied with that rule. Note that no option was specied for the third rule and hence it inherits the update behavior from the global option. Using UPSERT would create a new cell corresponding to the one referenced on the left side of the rule when the cell is missing, and the cell reference contains only positional references qualied by constants. Assuming we do not have cells for years greater than 2003, consider the following rule:
UPSERT sales['Bounce', year = 2004] = 1.1 * sales['Bounce', 2002]
This would not create any new cell because of the symbolic reference year = 2004. However, consider the following:
UPSERT sales['Bounce', 2004] = 1.1 * sales['Bounce', 2002]
This would create a new cell for product Bounce for year 2004. On a related note, new cells will not be created if any of the positional reference is ANY. This is because ANY is predicate that qualies all dimensional values including NULL. If there is positional reference ANY for a dimension d, then it can be considered as a predicate (d IS NOT NULL OR d IS NULL).
22-23
provides options to treat them in other useful ways according to business logic, for example, to treat nulls as zero for arithmetic operations. By default, NULL cell measure values are treated the same way as nulls are treated elsewhere in SQL. For example, in the following rule:
sales['Bounce', 2001] = sales['Bounce', 1999] + sales['Bounce', 2000]
The right side expression would evaluate to NULL if Bounce sales for one of the years 1999 and 2000 is NULL. Similarly, aggregate functions in rules would treat NULL values in the same way as their regular behavior where NULL values are ignored during aggregation. Missing cells are treated as cells with NULL measure values. For example, in the preceding rule, if the cell for Bounce and 2000 is missing, then it is treated as a NULL value and the right side expression would evaluate to NULL.
If the cell for product Bounce and year 2000 exists, it returns the corresponding sales multiplied by 1.1, otherwise, it returns 100. Note that if sales for the product Bounce for year 2000 is NULL, the preceding specication would return NULL. The PRESENTNNV function not only checks for the presence of a cell but also whether it is NULL or not. It returns the rst expression expr1 if the cell exists and is not NULL, otherwise, it returns the second expression expr2. For example, consider the following:
PRESENTNNV(sales['Bounce', 2000], 1.1*sales['Bounce', 2000], 100)
This would return 1.1*sales['Bounce', 2000] if sales['Bounce', 2000] exists and is not NULL. Otherwise, it returns 100. Applications can use the new IS PRESENT predicate in their model to check the presence of a cell in an explicit fashion.This predicate returns TRUE if cell exists and FALSE otherwise. The preceding example using PRESENTNNV can be written using IS PRESENT as:
CASE WHEN sales['Bounce', 2000] IS PRESENT AND sales['Bounce', 2000] IS NOT NULL THEN 1.1 * sales['Bounce', 2000] ELSE 100 END
The IS PRESENT predicate, like the PRESENTV and PRESENTNNV functions, checks for cell existence in the input data, that is, the data as existed before the execution of the MODEL clause. This enables you to initialize multiple measures of a cell newly inserted by an UPSERT rule. For example, if you want to initialize sales and prot values of a cell, if it does not exist in the data, for product Bounce and year 2003 to 1000 and 500 respectively, you can do so by the following:
RULES (UPSERT sales['Bounce', 2003] = PRESENTV(sales['Bounce', 2003], sales['Bounce', 2003], 1000), UPSERT profit['Bounce', 2003] = PRESENTV(profit['Bounce', 2003], profit['Bounce', 2003], 500))
The PRESENTV functions used in this formulation would both return TRUE or FALSE based on the existence of the cell in the input data. If the cell for Bounce and 2003 gets inserted by one of the rules, based on their evaluation order, PRESENTV function in the other rule would still evaluate to FALSE. You can consider this behavior as a preprocessing step to rule evaluation that evaluates and replaces all PRESENTV and PRESENTNNV functions and IS PRESENT predicate by their respective values.
0 for numeric data Empty string for character/string data 01-JAN-2001 for data type data
22-25
In this, the input to the MODEL clause does not have a cell for product Bounce and year 2002. Because of IGNORE NAV option, sales['Bounce', 2002] value would default to 0 (as sales is of numeric type) instead of NULL. Thus, sales['Bounce', 2003] value would be same as that of sales['Bounce', 2001].
Positional reference using wild card ANY as in sales[ANY]. Symbolic reference using the IS ANY predicate as in sales[product is ANY]. Positional reference of NULL as in sales[NULL]. Symbolic reference using IS NULL predicate as in sales[product IS NULL].
Note that symbolic reference sales[product = NULL] would not qualify nulls in product dimension. This behavior is in conformance with the handling of the predicate "product= NULL" by SQL.
Reference Models
In addition to the multi-dimensional array on which rules operate, which is called the main model, one or more read-only multi-dimensional arrays, called reference models, can be created and referenced in the MODEL clause to act as look-up tables. Like the main model, a reference model is dened over a query block and has DIMENSION BY and MEASURES clauses to indicate its dimensions and measures respectively. A reference model is created by the following subclause:
REFERENCE model_name ON (query) DIMENSION BY (cols) MEASURES (cols) [reference options]
Like the main model, a multi-dimensional array for the reference model is built before evaluating the rules. But, unlike the main model, reference models are read-only in that their cells cannot be updated and no new cells can be inserted after they are built. Thus, the rules in the main model can access cells of a reference model, but they cannot update or insert new cells into the reference model. References to the cells of a reference model can only appear on the right side of rules. You can view reference models as look-up tables on which the rules of the main model perform look-ups to obtain cell values. The following is an example using a currency conversion table as a reference model:
CREATE TABLE dollar_conv_tbl(country VARCHAR2(30), exchange_rate NUMBER); INSERT INTO dollar_conv_tbl VALUES('Poland', 0.25); INSERT INTO dollar_conv_tbl VALUES('France', 0.14); ...
Now, to convert the projected sales of Poland and France for 2003 to the US dollar, you can use the dollar conversion table as a reference model as in the following:
SELECT country, year, sales, dollar_sales FROM sales_view GROUP BY country, year MODEL REFERENCE conv_ref ON (SELECT country, exchange_rate FROM dollar_conv_tbl) DIMENSION BY (country) MEASURES (exchange_rate) IGNORE NAV MAIN conversion DIMENSION BY (country, year) MEASURES (SUM(sales) sales, SUM(sales) dollar_sales) IGNORE NAV RULES (dollar_sales['France', 2003] = sales[CV(country), 2002] * 1.02 * conv_ref.exchange_rate['France'], dollar_sales['Poland', 2003] = sales['Poland', 2002] * 1.05 * exchange_rate['Poland']);
A one dimensional reference model named conv_ref is created on rows from the table dollar_conv_tbl and that its measure exchange_rate has been referenced in the rules of the main model. The main model (called conversion) has two dimensions, country and year, whereas the reference model conv_ref has one dimension, country. Different styles of accessing the exchange_rate measure of the reference model. For France, it is rather explicit with model_name.measure_name
22-27
notation conv_ref.exchange_rate, whereas for Poland, it is a simple measure_name reference exchange_rate. The former notation needs to be used to resolve any ambiguities in column names across main and reference models. Growth rates, in this example, are hard coded in the rules. The growth rate for France is 2% and that of Poland is 5%. But they could come from a separate table and you can have a reference model dened on top of that. Assume that you have a growth_rate(country, year, rate) table dened as the following:
CREATE TABLE growth_rate_tbl(country VARCHAR2(30), year NUMBER, growth_rate NUMBER); INSERT INTO growth_rate_tbl VALUES('Poland', 2002, INSERT INTO growth_rate_tbl VALUES('Poland', 2003, ... INSERT INTO growth_rate_tbl VALUES('France', 2002, INSERT INTO growth_rate_tbl VALUES('France', 2003,
Then the following query computes the projected sales in dollars for 2003 for all countries:
SELECT country, year, sales, dollar_sales FROM sales_view GROUP BY country, year MODEL REFERENCE conv_ref ON (SELECT country, exchange_rate FROM dollar_conv_tbl) DIMENSION BY (country c) MEASURES (exchange_rate) IGNORE NAV REFERENCE growth_ref ON (SELECT country, year, growth_rate FROM growth_rate_tbl) DIMENSION BY (country c, year y) MEASURES (growth_rate) IGNORE NAV MAIN projection DIMENSION BY (country, year) MEASURES (SUM(sales) sales, 0 dollar_sales) IGNORE NAV RULES (dollar_sales[ANY, 2003] = sales[CV(country), 2002] * growth_rate[CV(country), CV(year)] * exchange_rate[CV(country)]);
This query shows the capability of the MODEL clause in dealing with and relating objects of different dimensionality. Reference model conv_ref has one dimension while the reference model growth_ref and the main model have two dimensions. Dimensions in the single cell references on reference models are specied using the CV function thus relating the cells in main model with the reference model. This
specication, in effect, is performing a relational join between main and reference models. Reference models also help you convert keys to sequence numbers, perform computations using sequence numbers (for example, where a prior period would be used in a subtraction operation), and then convert sequence numbers back to keys. For example, consider a view that assigns sequence numbers to years:
CREATE or REPLACE VIEW year_2_seq (i, year) AS SELECT ROW_NUMBER() OVER (ORDER BY calendar_year), calendar_year FROM (SELECT DISTINCT calendar_year FROM TIMES);
This view can dene two lookup tables: integer-to-year i2y, which maps sequence numbers to integers, and year-to-integer y2i, which performs the reverse mapping. The references y2i.i[year] and y2i.i[year] - 1 return sequence numbers of the current and previous years respectively and the reference i2y.y[y2i.i[year]-1] returns the year key value of the previous year. The following query demonstrates such a usage of reference models:
SELECT country, product, year, sales, prior_period FROM sales_view MODEL REFERENCE y2i ON (SELECT year, i FROM year_2_seq) DIMENSION BY (year y) MEASURES (i) REFERENCE i2y ON (SELECT year, i FROM year_2_seq) DIMENSION BY (i) MEASURES (year y) MAIN projection2 PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales, CAST(NULL AS NUMBER) prior_period) (prior_period[ANY, ANY] = sales[CV(product), i2y.y[y2i.i[CV(year)]-1]]) ORDER BY country, product, year;
Nesting of reference model cell references is evident in the preceding example. Cell reference on the reference model y2i is nested inside the cell reference on i2y which, in turn, is nested in the cell reference on the main SQL model. There is no limitation on the levels of nesting you can have on reference model cell references. However, you can only have two levels of nesting on the main SQL model cell references. Finally, the following are restrictions on the specication and usage of reference models:
s
22-29
The query block on which the reference model is dened cannot be correlated to an outer query. Reference models must be named and their names should be unique. All references to the cells of a reference model should be single cell references.
FOR Loops Iterative Models Ordered Rules Unique Dimensions Versus Unique Single References
FOR Loops
The MODEL clause provides a FOR construct that can be used inside rules to express computations more compactly. It can be used on both the left and right side of a rule. For example, consider the following computation, which estimates the sales of several products for 2004 to be 10% higher than their sales for 2003:
RULES UPSERT (sales['Bounce', 2004] = 1.1 * sales['Bounce', 2003], sales['Standard Mouse Pad', 2004] = 1.1 * sales['Standard Mouse Pad', 2003], ... sales['Y Box', 2004] = 1.1 * sales['Y Box', 2003])
The UPSERT option is used in this computation so that cells for these products and 2004 will be inserted if they are not previously present in the multi-dimensional array. This is rather bulky as you have to have as many rules as there are products. Using the FOR construct, this computation can be represented compactly and with exactly the same semantics as in:
RULES UPSERT (sales[FOR product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'), 2004] = 1.1 * sales[CV(product), 2003])
If you write a specication similar to this, but without the FOR keyword as in the following:
RULES UPSERT
(sales[product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'), 2004] = 1.1 * sales[CV(product), 2003])
You would get UPDATE semantics even though you have specied UPSERT. In other words, existing cells will be updated but no new cells will be created by this specication. This is because of the symbolic multi-cell reference on product that is treated as a predicate. You can view a FOR construct as a macro that generates multiple rules with positional references from a single rule, thus preserving the UPSERT semantics. Conceptually, the following rule:
sales[FOR product IN ('Bounce', 'Standard Mouse Pad', ..., 'Y Box'), FOR year IN (2004, 2005)] = 1.1 * sales[CV(product), CV(year)-1]
The FOR construct in the preceding examples is of type FOR dimension IN (list of values). Values in the list should either be constants or expressions involving constants. In this example, there are separate FOR constructs on product and year. It is also possible to specify all dimensions using one FOR construct. Consider for example, we want only to estimate sales for Bounce in 2004, Standard Mouse Pad in 2005 and Y Box in 2004 and 2005. This can be formulated as the following:
sales[FOR (product, year) IN (('Bounce', 2004), ('Standard Mouse Pad', 2005), ('Y Box', 2004), ('Y Box', 2005))] = 1.1 * sales[CV(product), CV(year)-1]
This FOR construct should be of form FOR (d1, ..., dn) IN ((d1_val1, ..., dn_val1), ..., (d1_valm, ..., dn_valm)] when there are n dimensions d1, ..., dn and m values in the list. In some cases, the list of values for a dimension in FOR can be stored in a table or they can be the result of a subquery. Oracle Database provides a avor of FOR construct as in FOR dimension in (subquery) to handle these cases. For example,
22-31
assume that the products of interest are stored in a table interesting_products, then the following rule estimates their sales in 2004 and 2005:
sales[FOR product IN (SELECT product_name FROM interesting_products) FOR year IN (2004, 2005)] = 1.1 * sales[CV(product), CV(year)-1]
As another example, consider the scenario where you want to introduce a new country, called new_country, with sales that mimic those of Poland. This is accomplished by issuing the following statement:
SELECT country, product, year, s FROM sales_view MODEL DIMENSION BY (country, product, year) MEASURES (sales s) IGNORE NAV RULES UPSERT (s[FOR (country, product, year) IN (SELECT DISTINCT 'new_country', product, year FROM sales_view WHERE country = 'Poland')] = s['Poland',CV(),CV()]) ORDER BY country, year, product;
Note the multi-column IN-list produced by evaluating the subquery in this specication. The subquery used to obtain the IN-list cannot be correlated to outer query blocks and it should return fewer than 10000 rows. Otherwise, Oracle returns an error. Model has a 10,000 rule limit and each combination of values created in a cell reference creates a rule. Therefore, FOR constructs should be designed so that the combination of values they generate does not exceed 10,000. If the FOR construct is used to densify sparse data on multiple dimensions, it is possible to encounter the 10,000 rule limit. In those cases, densication can be done outside the MODEL clause using a partitioned outer join. See Chapter 21, "SQL for Analysis and Reporting" for further information. If you know that the values of interest come from a discrete domain, you can use FOR construct FOR dimension FROM value1 TO value2 [INCREMENT | DECREMENT] value3. This specication results in values between value1 and value2 by starting from value1 and incrementing (or decrementing) by value3. The values value1, value2, and value3 should be constants or expressions involving constants. For example, the following rule:
sales['Bounce', FOR year FROM 2001 TO 2005 INCREMENT 1] = sales['Bounce', year=CV(year)-1] * 1.2
sales['Bounce', 2002] = sales['Bounce', 2001] * 1.2, ... sales['Bounce', 2005] = sales['Bounce', 2004] * 1.2
This kind of FOR construct can be used for dimensions of numeric, date and datetime datatypes. The increment/decrement expression value3 should be numeric for numeric dimensions and can be numeric or interval for dimensions of date or datetime types. Also, value3 should be positive. Oracle will return an error if you use FOR year FROM 2005 TO 2001 INCREMENT -1. You should use either FOR year FROM 2005 TO 2001 DECREMENT 1 or FOR year FROM 2001 TO 2005 INCREMENT 1. Oracle will also report an error if the domain (or the range) is empty, as in FOR year FROM 2005 TO 2001 INCREMENT 1. To generate string values, you can use the FOR construct FOR dimension LIKE string FROM value1 TO value2 [INCREMENT | DECREMENT] value3. The string string should contain only one % character. This specication results in string by replacing % with values between value1 and value2 with appropriate increment/decrement value value3. For example, the following rule:
sales[FOR product LIKE 'product-%' FROM 1 TO 3 INCREMENT 1, 2003] = sales[CV(product), 2002] * 1.2
For this kind of FOR construct, value1, value2, and value3 should all be of numeric type.
sales['product-1', 2003] = sales['product-1', 2002] * 1.2, sales['product-2', 2003] = sales['product-2', 2002] * 1.2, sales['product-3', 2003] = sales['product-3', 2002] * 1.2
In SEQUENTIAL ORDER models, rules represented by a FOR construct are evaluated in the order they are generated. On the contrary, rule evaluation order would be dependency based if AUTOMATIC ORDER is specied. For example, the evaluation order for the rules represented by the rule:
sales['Bounce', FOR year FROM 2004 TO 2001 DECREMENT 1] = 1.1 * sales['Bounce', CV(year)-1]
22-33
sales['Bounce', 2003] = 1.1 * sales['Bounce', 2002], sales['Bounce', 2002] = 1.1 * sales['Bounce', 2001], sales['Bounce', 2001] = 1.1 * sales['Bounce', 2000]
Iterative Models
Using the ITERATE option of the MODEL clause, you can evaluate rules iteratively for a certain number of times, which you can specify as an argument to the ITERATE clause. ITERATE can be specied only for SEQUENTIAL ORDER models and such models are referred to as iterative models. For example, consider the following:
SELECT x, s FROM DUAL MODEL DIMENSION BY (1 AS x) MEASURES (1024 AS s) RULES UPDATE ITERATE (4) (s[1] = s[1]/2);
In Oracle, the table DUAL has only one row. Hence this model denes a 1-dimensional array, dimensioned by x with a measure s, with a single element s[1] = 1024. The rule s[1] = s[1]/2 evaluation will be repeated four times. The result of this query will be a single row with values 1 and 64 for columns x and s respectively. The number of iterations arguments for the ITERATE clause should be a positive integer constant. Optionally, you can specify an early termination condition to stop rule evaluation before reaching the maximum iteration. This condition is specied in the UNTIL subclause of ITERATE and is checked at the end of an iteration. So, you will have at least one iteration when ITERATE is specied. The syntax of the ITERATE clause is:
ITERATE (number_of_iterations) [ UNTIL (condition) ]
Iterative evaluation will stop either after nishing the specied number of iterations or when the termination condition evaluates to TRUE, whichever comes rst. In some cases, you may want the termination condition to be based on the change, across iterations, in value of a cell. Oracle provides a mechanism to specify such conditions in that it enables you to access cell values as they existed before and after the current iteration in the UNTIL condition. Oracle's PREVIOUS function takes a
single cell reference as an argument and returns the measure value of the cell as it existed after the previous iteration. You can also access the current iteration number by using the system variable ITERATION_NUMBER, which starts at value 0 and is incremented after each iteration. By using PREVIOUS and ITERATION_NUMBER, you can construct complex termination conditions. Consider the following iterative model that species iteration over rules till the change in the value of s[1] across successive iterations falls below 1, up to a maximum of 1000 times:
SELECT x, s, iterations FROM DUAL MODEL DIMENSION BY (1 AS x) MEASURES (1024 AS s, 0 AS iterations) RULES ITERATE (1000) UNTIL ABS(PREVIOUS(s[1]) - s[1]) < 1 (s[1] = s[1]/2, iterations[1] = ITERATION_NUMBER);
The absolute value function (ABS) can be helpful for termination conditions because you may not know if the most recent value is positive or negative. Rules in this model will be iterated over 11 times as after 11th iteration the value of s[1] would be 0.5. This query results in a single row with values 1, 0.5, 10 for x, s and iterations respectively. You can use the PREVIOUS function only in the UNTIL condition. However, ITERATION_NUMBER can be anywhere in the main model. In the following example, ITERATION_NUMBER is used in cell references:
SELECT country, product, year, sales FROM sales_view MODEL PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales sales) IGNORE NAV RULES ITERATE(3) (sales['Bounce', 2002 + ITERATION_NUMBER] = sales['Bounce', 1999 + ITERATION_NUMBER]);
This statement achieves an array copy of sales of Bounce from cells in the array 1999-2001 to 2002-2005.
22-35
circular (or cyclical) dependencies. A cyclic dependency can be of the form "rule A depends on B and rule B depends on A" or of the self-cyclic "rule depending on itself" form. An example of the former is:
sales['Bounce', 2002] = 1.5 * sales['Y Box', 2002], sales['Y Box', 2002] = 100000 / sales['Bounce', 2002
However, there is no self-cycle in the following rule as different measures are being accessed on the left and right side:
projected_sales['Bounce', 2002] = 25000 / sales['Bounce', 2002]
When the analysis of an AUTOMATIC ORDER model nds that the rules have no circular dependencies (that is, the dependency graph is acyclic), Oracle Database will evaluate the rules in their dependency order. For example, in the following AUTOMATIC ORDER model:
MODEL DIMENSION BY (prod, year) MEASURES (sale sales) IGNORE NAV RULES AUTOMATIC ORDER (sales['SUV', 2001] = 10000, sales['Standard Mouse Pad', 2001] = sales['Finding Fido', 2001] * 0.10 + sales['Boat', 2001] * 0.50, sales['Boat', 2001] = sales['Finding Fido', 2001] * 0.25 + sales['SUV', 2001]* 0.75, sales['Finding Fido', 2001] = 20000)
Rule 2 depends on rules 3 and 4, while rule 3 depends on rules 1 and 4, and rules 1 and 4 do not depend on any rule. Oracle, in this case, will nd that the rule dependencies are acyclic and will evaluate rules in one of the possible evaluation orders (1, 4, 3, 2) or (4, 1, 3, 2). This type of rule evaluation is called an ACYCLIC algorithm. In some cases, Oracle Database may not be able to ascertain that your model is acyclic even though there is no cyclical dependency among the rules. This can happen if you have complex expressions in your cell references. Oracle Database assumes that the rules are cyclic and employs a CYCLIC algorithm that evaluates the model iteratively based on the rules and data. Iteration will stop as soon as convergence is reached and the results will be returned. Convergence is dened as the state in which further executions of the model will not change values of any of the cell in the model. Convergence is certain to be reached when there are no cyclical dependencies.
If your AUTOMATIC ORDER model has rules with cyclical dependencies, Oracle will employ the earlier mentioned CYCLIC algorithm. Results are produced if convergence can be reached within the number of iterations Oracle is going to try the algorithm. Otherwise, Oracle will report a cycle detection error. You can circumvent this problem by manually ordering the rules and specifying SEQUENTIAL ORDER.
Ordered Rules
An ordered rule is one that has ORDER BY specied on the left side. It accesses cells in the order prescribed by ORDER BY and applies the right side computation. When you have a positional ANY or symbolic references on the left side of a rule but without the ORDER BY clause, Oracle might return an error saying that the rule's results depend on the order in which cells are accessed and hence are non-deterministic. Consider the following SEQUENTIAL ORDER model:
SELECT t, s FROM sales, times WHERE sales.time_id = times.time_id GROUP BY calendar_year MODEL DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s) RULES SEQUENTIAL ORDER (s[ANY] = s[CV(t)-1]);
This query attempts to set, for all years t, sales s value for a year to the sales value of the prior year. Unfortunately, the result of this rule depends on the order in which the cells are accessed. If cells are accessed in the ascending order of year, the result would be that of column 3 in Table 221. If they are accessed in descending order, the result would be that of column 4.
Table 221 t 1998 1999 2000 2001 Ordered Rules s 1210000982 1473757581 2376222384 1267107764 If ascending null null null null If descending null 1210000982 1473757581 2376222384
If you want the cells to be considered in descending order and get the result given in column 4, you should specify:
22-37
SELECT t, s FROM sales, times WHERE sales.time_id = times.time_id GROUP BY calendar_year MODEL DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s) RULES SEQUENTIAL ORDER (s[ANY] ORDER BY t DESC = s[CV(t)-1]);
In general, you can use any ORDER BY specication as long as it produces a unique order among cells that qualify the left side cell reference. Expressions in the ORDER BY of a rule can involve constants, measures and dimension keys and you can specify the ordering options [ASC | DESC] [NULLS FIRST | NULLS LAST] to get the order you want. You can also specify ORDER BY for rules in an AUTOMATIC ORDER model to make Oracle consider cells in a particular order during rule evaluation. Rules are never considered self-cyclic if they have ORDER BY. For example, to make the following AUTOMATIC ORDER model with a self-cyclic formula:
MODEL DIMENSION BY (calendar_year t) MEASURES (SUM(amount_sold) s) RULES AUTOMATIC ORDER (s[ANY] = s[CV(t)-1])
acyclic, you need to provide the order in which cells need to be accessed for evaluation using ORDER BY. For example, you can say:
s[ANY] ORDER BY t = s[CV(t) - 1]
Then Oracle will pick an ACYCLIC algorithm (which is certain to produce the result) for formula evaluation.
WHERE country IN ('France', 'Poland') MODEL UNIQUE DIMENSION PARTITION BY (country) DIMENSION BY (product) MEASURES (sales sales) IGNORE NAV RULES UPSERT (sales['Bounce'] = sales['All Products'] * 0.24);
This would return a uniqueness violation error as the rowset input to model is not unique on country and product:
ERROR at line 2: ORA-32638: Non unique addressing in MODEL dimensions
Input to the MODEL clause in this case is unique on country, product, and year as shown in:
COUNTRY ------Italy Italy Italy Italy ... PRODUCT ----------------------------1.44MB External 3.5" Diskette 1.44MB External 3.5" Diskette 1.44MB External 3.5" Diskette 1.44MB External 3.5" Diskette YEAR ---1998 1999 2000 2001 SALES -------3141.84 3086.87 3440.37 855.23
If you want to relax this uniqueness checking, you can specify UNIQUE SINGLE REFERENCE keyword. This can save processing time. In this case, the MODEL clause checks the uniqueness of only the single cell references appearing on the right side of rules. So the query that returned the uniqueness violation error would be successful if you specify UNIQUE SINGLE REFERENCE instead of UNIQUE DIMENSION. Another difference between UNIQUE DIMENSION and UNIQUE SINGLE REFERENCE semantics is the number of cells that can be updated by a rule with a single cell reference on left side. In the case of UNIQUE DIMENSION, such a rule can update at most one row as only one cell would qualify the single cell reference on the left side. This is because the input rowset would be unique on PARTITION BY and
22-39
DIMENSION BY keys. With UNIQUE SINGLE REFERENCE, all cells that qualify the left side single cell reference would be updated by the rule.
The only columns that can be updated are the columns specied in the MEASURES subclause of the main SQL model. Measures of reference models cannot be updated. The MODEL clause is evaluated after all clauses in the query block except SELECT DISTINCT, and ORDER BY clause are evaluated. These clauses and expressions in the SELECT list are evaluated after the MODEL clause. If your query has a MODEL clause, then the query's SELECT and ORDER BY lists cannot contain aggregates or analytic functions. If needed, these can be specied in PARTITION BY, DIMENSION BY, and MEASURES lists and need to be aliased. Aliases can then be used in the SELECT or ORDER BY clauses. In the following example, the analytic function RANK is specied and aliased in the MEASURES list of the MODEL clause, and its alias is used in the SELECT list so that the outer query can order resulting rows based on their ranks.
SELECT country, product, year, s, RNK FROM (SELECT country, product, year, s, rnk FROM sales_view MODEL PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales s, year y, RANK() OVER (ORDER BY sales) rnk) RULES UPSERT (s['Bounce Increase 90-99', 2001] = REGR_SLOPE(s, y) ['Bounce', year BETWEEN 1990 AND 2000], s['Bounce', 2001] = s['Bounce', 2000] * (1+s['Bounce increase 90-99', 2001]))) WHERE product <> 'Bounce Increase 90-99' ORDER BY country, year, rnk, product;
When there is a multi-cell reference on the right hand side of a rule, you need to apply a function to aggregate the measure values of multiple cells referenced into a single value. You can use any kind of aggregate function for this purpose: regular, OLAP aggregate (inverse percentile, hypothetical rank and distribution), or user-dened aggregate. You cannot use analytic functions (functions that use the OVER clause) in rules.
Only rules with positional single cell references on the left side have UPSERT semantics. All other rules have UPDATE semantics, even when you specify the UPSERT option for them. Negative increments are not allowed in FOR loops. Also, no empty FOR loops are allowed. FOR d FROM 2005 TO 2001 INCREMENT -1 is illegal. You should use FOR d FROM 2005 TO 2001 DECREMENT 1 instead. FOR d FROM 2005 TO 2001 INCREMENT 1 is illegal as it designates an empty loop. You cannot use nested query expressions (subqueries) in rules except in the FOR construct. For example, it would be illegal to issue the following:
SELECT * FROM sales_view WHERE country = 'Poland' MODEL DIMENSION BY (product, year) MEASURES (sales sales) RULES UPSERT (sales['Bounce', 2003] = sales['Bounce', 2002] + (SELECT SUM(sales) FROM sales_view));
This is because the rule has a subquery on its right side. Instead, you can rewrite the preceding query in the following legal way:
SELECT * FROM sales_view WHERE country = 'Poland' MODEL DIMENSION BY (product, year) MEASURES (sales sales, (SELECT SUM(sales) FROM sales_view) AS grand_total) RULES UPSERT (sales['Bounce', 2003] =sales['Bounce', 2002] + grand_total['Bounce', 2002]);
s
You can also use subqueries in the FOR construct specied on the left side of a rule. However, they:
s
Cannot be correlated Must return fewer than 10000 rows Cannot be a query dened in the WITH clause Will make the cursor unsharable
Nested cell references must be single cell references. Aggregates on nested cell references are not supported. So, it would be illegal to say s['Bounce', MAX(best_year)['Bounce', ANY]].
22-41
Only one level of nesting is supported for nested cell references on the main model. So, for example, s['Bounce', best_year['Bounce', 2001]] is legal, but s['Bounce', best_year['Bounce', best_year['Bounce', 2001]]] is not. Nested cell references appearing on the left side of rules in an AUTOMATIC ORDER model should not be updated in any rule of the model. This restriction ensures that the rule dependency relationships do not arbitrarily change (and hence cause non-deterministic results) due to updates to reference measures. There is no such restriction on nested cell references in a SEQUENTIAL ORDER model. Also, this restriction is not applicable on nested references appearing on the right side of rules in both SEQUENTIAL or AUTOMATIC ORDER models.
The query dening the reference model cannot be correlated to any outer query. It can, however, be a query with subqueries, views and so on. Reference models cannot have a PARTITION BY clause. Reference models cannot be updated.
Parallel Execution Aggregate Computation Using EXPLAIN PLAN to Understand Model Queries
Parallel Execution
MODEL clause computation is scalable in terms of the number of processors you have. Scalability is achieved by performing the MODEL computation in parallel across the partitions dened by the PARTITION BY clause. Data is distributed among processing elements (also called parallel query slaves) based on the PARTITION BY key values such that all rows with the same values for the PARTITION BY keys will go to the same slave. Note that the internal processing of partitions will not create a one-to-one match of logical and internally processed partitions. This way, each slave can nish MODEL clause computation independent
of other slaves. The data partitioning can be hash based or range based. Consider the following MODEL clause:
MODEL PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sales) RULES UPDATE (sales['Bounce', 2002] = 1.2 * sales['Bounce', 2001], sales['Car', 2002] = 0.8 * sales['Car', 2001])
Here input data will be partitioned among slaves based on the PARTITION BY key country and this partitioning can be hash or range based. Each slave will evaluate the rules on the data it receives. Parallelism of the model computation is governed or limited by the way you specify the MODEL clause. If your MODEL clause has no PARTITION BY keys, then the computation cannot be parallelized (with exceptions mentioned in the following). If PARTITION BY keys have very low cardinality, then the degree of parallelism will be limited. In such cases, Oracle identies the DIMENSION BY keys that can used for partitioning. For example, consider a MODEL clause equivalent to the preceding one, but without PARTITION BY keys as in the following:
MODEL DIMENSION BY (country, product, time) MEASURES (sales) RULES UPDATE (sales[ANY, 'Bounce', 2002] = 1.2 * sales[CV(country), 'Bounce', 2001], sales[ANY, 'Car', 2002] = 0.8 * sales[CV(country), 'Car', 2001])
In this case, Oracle Database will identify that it can use the DIMENSION BY key country for partitioning and uses region as the basis of internal partitioning. It partitions the data among slaves on country and thus effects parallel execution.
Aggregate Computation
The MODEL clause processes aggregates in two different ways: rst, the regular fashion in which data in the partition is scanned and aggregated and second, an efcient window style aggregation. The rst type as illustrated in the following introduces a new dimension member ALL_2002_products and computes its value to be the sum of year 2002 sales for all products:
MODEL PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sale sales) RULES UPSERT (sales['ALL_2002_products', 2002] = SUM(sales)[ANY, 2002])
22-43
To evaluate the aggregate sum in this case, each partition will be scanned to nd the cells for 2002 for all products and they will be aggregated. If the left side of the rule were to reference multiple cells, then Oracle will have to compute the right side aggregate by scanning the partition for each cell referenced on the left. For example, consider the following example:
MODEL PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sale sales, 0 avg_exclusive) RULES UPDATE (avg_exclusive[ANY, 2002] = AVG(sales)[product <> CV(product), CV(time)])
This rule calculates a measure called avg_exclusive for every product in 2002. The measure avg_exclusive is dened as the average sales of all products excluding the current product. In this case, Oracle scans the data in a partition for every product in 2002 to calculate the aggregate, and this may be expensive Oracle Database will optimize the evaluation of such aggregates in some scenarios with window-style computation as used in analytic functions. These scenarios involve rules with multi-cell references on their left side and computing window computations such as moving averages, cumulative sums and so on. Consider the following example:
MODEL PARTITION BY (country) DIMENSION BY (product, time) MEASURES (sale sales, 0 mavg) RULES UPDATE (mavg[product IN ('Bounce', 'Y Box', 'Mouse Pad'), ANY] = AVG(sales)[CV(product), time BETWEEN CV(time) AND CV(time) - 2])
It computes the moving average of sales for products Bounce, Y Box, and Mouse Pad over a three year period. It would be very inefcient to evaluate the aggregate by scanning the partition for every cell referenced on the left side. Oracle identies the computation as being in window-style and evaluates it efciently. It sorts the input on product, time and then scans the data once to compute the moving average. You can view this rule as an analytic function being applied on the sales data for products Bounce, Y Box, and Mouse Pad:
AVG(sales) OVER (PARTITION BY product ORDER BY time RANGE BETWEEN 2 PRECEDING AND CURRENT ROW)
This computation style is called WINDOW (IN MODEL) SORT. This style of aggregation is applicable when the rule has a multi-cell reference on its left side with no ORDER BY, has a simple aggregate (SUM, COUNT, MIN, MAX, STDEV, and
VAR) on its right side, only one dimension on the right side has a boolean predicate (<, <=, >, >=, BETWEEN), and all other dimensions on the right are qualied with CV.
22-45
FROM sales_view WHERE country IN ('Italy', 'Japan') MODEL UNIQUE DIMENSION PARTITION BY (country) DIMENSION BY (prod, year) MEASURES (sale sales) RULES UPSERT (sales['Bounce', 2003] = AVG(sales)[ANY, 2002] * 1.24 sales[prod <> 'Bounce', 2003] = sales['Bounce', 2003] * 0.25);
Example 1 Calculating Sales Differences Show the sales for Italy and Spain and the difference between the two for each product. The difference should be placed in a new row with country = 'Diff Italy-Spain'.
SELECT product, country, sales FROM sales_view WHERE country IN ('Italy', 'Spain') GROUP BY product, country MODEL PARTITION BY (product) DIMENSION BY (country) MEASURES (SUM(sales) AS sales)
22-47
Example 2 Calculating Percentage Change If sales for each product in each country grew (or declined) at the same monthly rate from November 2000 to December 2000 as they did from October 2000 to November 2000, what would the fourth quarter's sales be for the whole company and for each country?
SELECT country, SUM(sales) FROM (SELECT product, country, month, sales FROM sales_view2 WHERE year=2000 AND month IN ('October','November') MODEL PARTITION BY (product, country) DIMENSION BY (month) MEASURES (sale sales) RULES (sales['December']=(sales['November'] /sales['October']) *sales['November'])) GROUP BY GROUPING SETS ((),(country));
Example 3 Calculating Net Present Value You want to calculate the net present value (NPV) of a series of periodic cash ows. Your scenario involves two projects, each of which starts with an initial investment at time 0, represented as a negative cash ow. The initial investment is followed by three years of positive cash ow. First, create a table (cash_flow) and populate it with some data, as in the following statements:
CREATE INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT TABLE cash_flow (year INTO cash_flow VALUES INTO cash_flow VALUES INTO cash_flow VALUES INTO cash_flow VALUES INTO cash_flow VALUES INTO cash_flow VALUES INTO cash_flow VALUES INTO cash_flow VALUES DATE, i INTEGER, (TO_DATE('1999', (TO_DATE('2000', (TO_DATE('2001', (TO_DATE('2002', (TO_DATE('1999', (TO_DATE('2000', (TO_DATE('2001', (TO_DATE('2002', prod VARCHAR2(3), amount NUMBER); 'YYYY'), 0, 'vcr', -100.00); 'YYYY'), 1, 'vcr', 12.00); 'YYYY'), 2, 'vcr', 10.00); 'YYYY'), 3, 'vcr', 20.00); 'YYYY'), 0, 'dvd', -200.00); 'YYYY'), 1, 'dvd', 22.00); 'YYYY'), 2, 'dvd', 12.00); 'YYYY'), 3, 'dvd', 14.00);
To calculate the NPV using a discount rate of 0.14, issue the following statement:
SELECT year, i, prod, amount, npv FROM cash_flow MODEL PARTITION BY (prod) DIMENSION BY (i) MEASURES (amount, 0 npv, year) RULES
(npv[0] = amount[0], npv[i !=0] ORDER BY i = amount[CV()]/ POWER(1.14,CV(i)) + npv[CV(i)-1]); YEAR I PRO AMOUNT NPV --------- ---------- --- ---------- ---------01-AUG-99 0 dvd -200 -200 01-AUG-00 1 dvd 22 -180.70175 01-AUG-01 2 dvd 12 -171.46814 01-AUG-02 3 dvd 14 -162.01854 01-AUG-99 0 vcr -100 -100 01-AUG-00 1 vcr 12 -89.473684 01-AUG-01 2 vcr 10 -81.779009 01-AUG-02 3 vcr 20 -68.279579
Example 4 Calculating Using Simultaneous Equations You want your interest expenses to equal 30% of your net income (net=pay minus tax minus interest). Interest is tax deductible from gross, and taxes are 38% of salary and 28% capital gains. You have salary of $100,000 and capital gains of $15,000. Net income, taxes, and interest expenses are unknown. Observe that this is a simultaneous equation (net depends on interest, which depends on net), thus the ITERATE clause is included. First, create a table called ledger:
CREATE TABLE ledger (account VARCHAR2(20), balance NUMBER(10,2) );
22-49
Example 5 Calculating Using Regression The sales of Bounce in 2001 will increase in comparison to 2000 as they did in the last three years (between 1998 and 2000). Sales of Shaving Cream in 2001 will also increase in comparison to 2000 as they did between 1998 and 2000. To calculate the increase, use the regression function REGR_SLOPE as follows. Because we are calculating the next period's value, it is sufcient to add the slope to the 2000 value.
SELECT * FROM (SELECT country, product, year, projected_sale, sales FROM sales_view WHERE country IN ('Italy', 'Japan') AND product IN ('Shaving Cream', 'Bounce') MODEL PARTITION BY (country) DIMENSION BY (product, year) MEASURES (sales sales, year y, CAST(NULL AS NUMBER) projected_sale) IGNORE NAV RULES UPSERT (projected_sale[FOR product IN ('Bounce', 'Shaving Cream'), 2001] = sales[CV(), 2000] + REGR_SLOPE(sales, y)[CV(), year BETWEEN 1998 AND 2000])) ORDER BY country, product, year;
6192.6
7305.76
Example 6 Calculating Mortgage Amortization This example creates mortgage amortization tables for any number of customers, using information about mortgage loans selected from a table of mortgage facts. First, create two tables and insert needed data:
s
mortgage_facts Holds information about individual customer loans, including the name of the customer, the fact about the loan that is stored in that row, and the value of that fact. The facts stored for this example are loan (Loan), annual interest rate (Annual_Interest), and number of payments (Payments) for the loan. Also, the values for two customers, Smith and Jones, are inserted.
CREATE amount INSERT INSERT INSERT INSERT INSERT INSERT INSERT INSERT TABLE mortgage_facts NUMBER(10,2)); INTO mortgage_facts INTO mortgage_facts INTO mortgage_facts INTO mortgage_facts INTO mortgage_facts INTO mortgage_facts INTO mortgage_facts INTO mortgage_facts (customer VARCHAR2(20), fact VARCHAR2(20), VALUES VALUES VALUES VALUES VALUES VALUES VALUES VALUES ('Smith', ('Smith', ('Smith', ('Smith', ('Jones', ('Jones', ('Jones', ('Jones', 'Loan', 100000); 'Annual_Interest', 12); 'Payments', 360); 'Payment', 0); 'Loan', 200000); 'Annual_Interest', 12); 'Payments', 180); 'Payment', 0);
mortgage Holds output information for the calculations. The columns are customer, payment number (pmt_num), principal applied in that payment (principalp), interest applied in that payment (interestp), and remaining loan balance (mort_balance). In order to upsert new cells into a partition, you need to have at least one row pre-existing per partition. Therefore, we seed the mortgage table with the values for the two customers before they have made any payments. This seed information could be easily generated using a SQL Insert statement based on the mortgage_fact table.
CREATE TABLE mortgage_facts (customer VARCHAR2(20), fact VARCHAR2(20), amount NUMBER(10,2)); INSERT INSERT INSERT INSERT INSERT INSERT INSERT INTO INTO INTO INTO INTO INTO INTO mortgage_facts mortgage_facts mortgage_facts mortgage_facts mortgage_facts mortgage_facts mortgage_facts VALUES VALUES VALUES VALUES VALUES VALUES VALUES ('Smith', ('Smith', ('Smith', ('Smith', ('Jones', ('Jones', ('Jones', 'Loan', 100000); 'Annual_Interest', 12); 'Payments', 360); 'Payment', 0); 'Loan', 200000); 'Annual_Interest', 12); 'Payments', 180);
22-51
CREATE TABLE mortgage (customer VARCHAR2(20), pmt_num NUMBER(4), principalp NUMBER(10,2), interestp NUMBER(10,2), mort_balance NUMBER(10,2)); INSERT INTO mortgage VALUES ('Jones',0, 0, 0, 200000); INSERT INTO mortgage VALUES ('Smith',0, 0, 0, 100000);
The following SQL statement is complex, so individual lines have been annotated as needed. These lines are explained in more detail later.
SELECT c, p, m, pp, ip FROM MORTGAGE MODEL REFERENCE R ON (SELECT customer, fact, amt FROM mortgage_facts MODEL DIMENSION BY (customer, fact) MEASURES (amount amt) RULES (amt[any, 'PaymentAmt']= (amt[CV(),'Loan']* Power(1+ (amt[CV(),'Annual_Interest']/100/12), amt[CV(),'Payments']) * (amt[CV(),'Annual_Interest']/100/12)) / (Power(1+(amt[CV(),'Annual_Interest']/100/12), amt[CV(),'Payments']) - 1) ) ) DIMENSION BY (customer cust, fact) measures (amt) MAIN amortization PARTITION BY (customer c) DIMENSION BY (0 p) MEASURES (principalp pp, interestp ip, mort_balance m, customer mc) RULES ITERATE(1000) UNTIL (ITERATION_NUMBER+1 = r.amt[mc[0],'Payments']) (ip[ITERATION_NUMBER+1] = m[CV()-1] * r.amt[mc[0], 'Annual_Interest']/1200, pp[ITERATION_NUMBER+1] = r.amt[mc[0], 'PaymentAmt'] - ip[CV()], m[ITERATION_NUMBER+1] = m[CV()-1] - pp[CV()] ) ORDER BY c, p;
The following numbers refer to the numbers listed in the example: 1: This is the start of the main model denition.
2 through 4: These lines mark the start and end of the reference model labeled R. This model denes denes a SELECT statement that calculates the monthly payment amount for each customer's loan. The SELECT statement uses its own MODEL clause starting at the line labeled 3 with a single rule that denes the amt value based on information from the mortgage_facts table. The measure returned by reference model R is amt, dimensioned by customer name cust and fact value fact as dened in the line labeled 4. The reference model is computed once and the values are then used in the main model for computing other calculations. Reference model R will return a row for each existing row of mortgage_fact, and it will return the newly calculated rows for each customer where the fact type is Payment and the amt is the monthly payment amount. If we wish to use a specic amount from the R output, we address it with the expression r.amt[<customer_name>,<fact_name>]. 5: This is the continuation of the main model denition. We will partition the output by customer, aliased as c. 6: The main model is dimensioned with a constant value of 0, aliased as p. This represents the payment number of a row. 7: Four measures are dened: principalp (pp) is the principal amount applied to the loan in the month, interestp (ip) is the interest paid that month, mort_ balance (m) is the remaining mortgage value after the payment of the loan, and customer (mc) is used to support the partitioning. 8: This begins the rules block. It will perform the rule calculations up to 1000 times. Because the calculations are performed once for each month for each customer, the maximum number of months that can be specied for a loan is 1000. Iteration is stopped when the ITERATION_NUMBER+1 equals the amount of payments derived from reference R. Note that the value from reference R is the amt (amount) measure dened in the reference clause. This reference value is addressed as r.amt[<customer_name>,<fact>]. The expression used in the iterate line, "r.amt[mc[0], 'Payments']" is resolved to be the amount from reference R, where the customer name is the value resolved by mc[0]. Since each partition contains only one customer, mc[0] can have only one value. Thus "r.amt[mc[0], 'Payments']" yields the reference clause's value for the number of payments for the current customer. This means that the rules will be performed as many times as there are payments for that customer. 9 through 11: The rst two rules in this block use the same type of r.amt reference that was explained in 8. The difference is that the ip rule denes the fact value as Annual_Interest. Note that each rule refers to the value of one of the other measures. The expression used on the left side of each rule, "[ITERATION_
22-53
NUMBER+1]" will create a new dimension value, so the measure will be upserted into the result set. Thus the result will include a monthly amortization row for all payments for each customer. The nal line of the example sorts the results by customer and loan payment number.
23
OLAP and Data Mining
In large data warehouse environments, many different types of analysis can occur. In addition to SQL queries, you may also apply more advanced analytical operations to your data. Two major types of such analysis are OLAP (On-Line Analytic Processing) and data mining. Rather than having a separate OLAP or data mining engine, Oracle has integrated OLAP and data mining capabilities directly into the database server. Oracle OLAP and Oracle Data Mining (ODM) are options to the Oracle Database. This chapter provides a brief introduction to these technologies, and more detail can be found in these products' respective documentation. The following topics provide an introduction to Oracle's OLAP and data mining capabilities:
s
OLAP Overview
OLAP Overview
Oracle Database OLAP adds the query performance and calculation capability previously found only in multidimensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, Oracle Database OLAP is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLAP-enabled relational database. As a result, this release provides the benets of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the Oracle Database. The Java OLAP API, which is specically designed for internet-based analytical applications, offers productive data access. See Oracle OLAP Application Developer's Guide for further information regarding OLAP.
Scalability
Oracle Database OLAP is highly scalable. In today's environment, there is tremendous growth along three dimensions of analytic applications: number of users, size of data, complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tables and very large volumes of fact data. Oracle Database can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering.
23-2
OLAP Overview
Availability
Oracle Database includes many features that support high availability. One of the most signicant is partitioning, which allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time data is unavailable. Another feature supporting high availability is transportable tablespaces. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates.
Manageability
Oracle enables you to precisely control resource utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of end-users. Consider an environment where the marketing department and the sales department share an OLAP system. Using the Database Resource Manager, you could specify that the marketing department receive at least 60 percent of the CPU resources of the machines, while the sales department receive 40 percent of the CPU resources. You can also further specify limits on the total number of active sessions, and the degree of parallelism of individual queries for each department. Another resource management facility is the progress monitor, which gives end users and administrators the status of long-running operations. Oracle Database 10g maintains statistics describing the percent-complete of these operations. Oracle Enterprise Manager enables you to view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server, using system views.
Details related to backup, restore, and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations. This reduces administrative burden and minimizes the possibility of human errors.
Backup and recovery operations are fully integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independently of the other partitions of a table. Oracle includes support for incremental backup and recovery using Recovery Manager, enabling operations to be completed efciently within times proportional to the amount of changes, rather than the overall size of the database. The backup and recovery technology is highly scalable, and provides tight interfaces to industry-leading media management subsystems. This provides for efcient operations that can scale up to handle very large volumes of data. Open Platforms for more hardware options & enterprise-level platforms.
Security
Just as the demands of real-world transaction processing required Oracle to develop robust features for scalability, manageability and backup and recovery, they lead Oracle to create industry-leading security features. The security features in Oracle have reached the highest levels of U.S. government certication for database trustworthiness. Oracle's ne grained access control feature, enables cell-level security for OLAP users. Fine grained access control works with minimal burden on query processing, and it enables efcient centralized security management.
Which items is a person most likely to buy or like? What is the likelihood that this product will be returned for repair? What is the likelihood that this person poses a credit risk?
Oracle Data Mining enables data mining inside the database for performance and scalability. Some of the capabilities are:
s
Java and PL/SQL interfaces that provide programmatic control and application integration
23-4
Several algorithms:
s
Classication: Naive Bayes, Adaptive Bayes Network, Support Vector Machine Regression: Support Vector Machine Clustering: k-Means, O-Cluster Association: Apriori Attribute Importance: Predictor Variance Feature Extraction: Non-Negative Matrix Factorization
Oracle Data Mining also supports sequence similarity search and annotation (BLAST) in the database.
All data preparation occurs in the database The data that is mined remains in the database
The models produced by mining reside in the database Scoring occurs in the database along with results immediately available as tables
Data Preparation
Data preparation usually requires the creation of new tables or views based on existing data. Both options perform faster than moving data to an external data mining utility and offer the programmer the option of snapshots or real-time updates. Oracle Data Mining provides utilities for complex, data mining-specic tasks. For example, for some types of models, binning improves model build time and model performance; therefore, ODM provides a utility for user-dened binning. ODM accepts data in either non-transactional (single-record case) format or transactional (multi-record case) format. ODM provides a pivoting utility for converting multiple non-transactional tables into a single transactional table. ODM data exploration and model evaluation features are extended by Oracle's statistical functions and OLAP capabilities. Because these also operate within the database, they can all be incorporated into a seamless application that shares database objects. This allows for more functional and faster applications.
Model Building
Oracle Data Mining supports all the major data mining functions: classication, regression, association rules, clustering, attribute importance, and feature extraction. These algorithms address a broad spectrum of business problems, ranging from predicting the future likelihood of a customer purchasing a given product, to understand which products are likely to be purchased together in a single trip to the grocery store. Since all model building takes place inside the database, the data
23-6
never needs to move outside the database, and therefore the entire data-mining process is accelerated.
Model Evaluation
Models are stored in the database and are directly accessible for evaluation, reporting, and further analysis by a wide variety of tools and application functions. ODM provides APIs for calculating confusion matrix and lift charts. ODM stores the models, the underlying data, and the results of model evaluation together in the database to enable further analysis, reporting, and application-specic model management.
oracle.dmt.odm.transformation.Transformation class. In addition, applications can embed externals transformation details in the logical data specication as an input for the build operation, then the system will persist the details with the model and perform the embedded transformations in the future apply and test operations. The ODM Java API design reects concepts present in the emerging Java standard (JSR-73) for Data Mining, which is being developed through the Java Community Process.
DBMS_DATA_MINING DBMS_DATA_MINING_TRANSFORM
DBMS_DATA_MINING provides support for in-database data mining. This package can be used to build and test models and to apply models to new data (scoring). The package provides the basic building blocks for data mining, along with utilities and functions to inspect models and their results. The package also supports export and import of native models from a user's schema or database instance. DBMS_DATA_MINING_TRANSFORM, a complementary package, provides support for popular data transformations such as numerical and categorical binning and linear and z-score normalization. The DBMS_DATA_MINING_TRANSFORM package is open source in nature, in that the package code is distributed with the product, so that users can study the utility routines and learn how to dene their own data transformations using Oracle SQL and PL/SQL scripting.
23-8
24
Using Parallel Execution
This chapter covers tuning in a parallel execution environment and discusses:
s
Introduction to Parallel Execution Tuning How Parallel Execution Works Types of Parallelism Initializing and Tuning Parameters for Parallel Execution Tuning General Parameters for Parallel Execution Monitoring and Diagnosing Parallel Execution Performance Afnity and Parallel Operations Miscellaneous Parallel Execution Tuning Tips
Note: Some features described in this chapter are available only if
you have purchased Oracle Database Enterprise Edition with the Oracle Real Application Clusters Option.
Queries requiring large table scans, joins, or partitioned index scans Creation of large indexes Creation of large tables (including materialized views) Bulk inserts, updates, merges, and deletes
You can also use parallel execution to access object types within an Oracle database. For example, you can use parallel execution to access large objects (LOBs). Parallel execution benets systems with all of the following characteristics:
s
Symmetric multiprocessors (SMPs), clusters, or massively parallel systems Sufcient I/O bandwidth Underutilized or intermittently used CPUs (for example, systems where CPU usage is typically less than 30%) Sufcient memory to support additional memory-intensive processes, such as sorts, hashing, and I/O buffers
If your system lacks any of these characteristics, parallel execution might not signicantly improve performance. In fact, parallel execution may reduce system performance on overutilized systems or systems with small I/O bandwidth.
24-2
Environments in which the typical query or transaction is very short (a few seconds or less). This includes most online transaction systems. Parallel execution is not useful in these environments because there is a cost associated with coordinating the parallel execution servers; for short transactions, the cost of this coordination may outweigh the benets of parallelism. Environments in which the CPU, memory, or I/O resources are already heavily utilized. Parallel execution is designed to exploit additional available hardware resources; if no such resources are available, then parallel execution will not yield any benets and indeed may be detrimental to performance.
Access methods Some examples are table scans, index full scans, and partitioned index range scans.
Join methods Some examples are nested loop, sort merge, hash, and star transformation.
DDL statements Some examples are CREATE TABLE AS SELECT, CREATE INDEX, REBUILD INDEX, REBUILD INDEX PARTITION, and MOVE/SPLIT/COALESCE PARTITION. You can normally use parallel DDL where you use regular DDL. There are, however, some additional details to consider when designing your database. One important restriction is that parallel DDL cannot be used on tables with object or LOB columns. All of these DDL operations can be performed in NOLOGGING mode for either parallel or serial execution. The CREATE TABLE statement for an index-organized table can be parallelized either with or without an AS SELECT clause. Different parallelism is used for different operations. Parallel create (partitioned) table as select and parallel create (partitioned) index run with a degree of parallelism equal to the number of partitions. Parallel operations require accurate statistics to perform optimally.
DML statements
Some examples are INSERT AS SELECT, updates, deletes, and MERGE operations. Parallel DML (parallel insert, update, merge, and delete) uses parallel execution mechanisms to speed up or scale up large DML operations against large database tables and indexes. You can also use INSERT ... SELECT statements to insert rows into multiple tables as part of a single DML statement. You can normally use parallel DML where you use regular DML. Although data manipulation language (DML) normally includes queries, the term parallel DML refers only to inserts, updates, upserts and deletes done in parallel.
s
Miscellaneous SQL operations Some examples are GROUP BY, NOT IN, SELECT DISTINCT, UNION, UNION ALL, CUBE, and ROLLUP, as well as aggregate and table functions.
Parallel query You can parallelize queries and subqueries in SELECT statements, as well as the query portions of DDL statements and DML statements (INSERT, UPDATE, DELETE, and MERGE).
SQL*Loader You can parallelize the use of SQL*Loader, where large amounts of data are routinely encountered. To speed up your loads, you can use a parallel direct-path load as in the following example:
sqlldr USERID=SCOTT/TIGER CONTROL=LOAD1.CTL DIRECT=TRUE PARALLEL=TRUE sqlldr USERID=SCOTT/TIGER CONTROL=LOAD2.CTL DIRECT=TRUE PARALLEL=TRUE sqlldr USERID=SCOTT/TIGER CONTROL=LOAD3.CTL DIRECT=TRUE PARALLEL=TRUE
You can also use a parameter le to achieve the same thing. An important point to remember is that indexes are not maintained during a parallel load.
24-4
parallel execution coordinator or query coordinator. The query coordinator does the following:
s
Parses the query and determines the degree of parallelism Allocates one or two set of slaves (threads or processes) Controls the query and sends instructions to the PQ slaves Determines which tables or indexes need to be scanned by the PQ slaves Produces the nal output to the user
Degree of Parallelism
The parallel execution coordinator may enlist two or more of the instance's parallel execution servers to process a SQL statement. The number of parallel execution servers associated with a single operation is known as the degree of parallelism. A single operation is a part of a SQL statement such as an order by, a full table scan to perform a join on a nonindexed column table. Note that the degree of parallelism applies directly only to intra-operation parallelism. If inter-operation parallelism is possible, the total number of parallel execution servers for a statement can be twice the specied degree of parallelism. No more than two sets of parallel execution servers can run simultaneously. Each set of parallel execution servers may process multiple operations. Only two sets of parallel execution servers need to be active to guarantee optimal inter-operation parallelism. Parallel execution is designed to effectively use multiple CPUs and disks to answer queries quickly. When multiple users use parallel execution at the same time, it is easy to quickly exhaust available CPU, memory, and disk resources. Oracle Database provides several ways to manage resource utilization in conjunction with parallel execution environments, including:
s
The adaptive multiuser algorithm, which is enabled by default, reduces the degree of parallelism as the load on the system increases. User resource limits and proles, which allow you to set limits on the amount of various system resources available to each user as part of a user's security domain. The Database Resource Manager, which lets you allocate resources to different groups of users.
The SQL statement's foreground process becomes a parallel execution coordinator. The parallel execution coordinator obtains as many parallel execution servers as needed (determined by the DOP) from the server pool or creates new parallel execution servers as needed. Oracle executes the statement as a sequence of operations. Each operation is performed in parallel, if possible. When statement processing is completed, the coordinator returns any resulting data to the user process that issued the statement and returns the parallel execution servers to the server pool.
3. 4.
The parallel execution coordinator calls upon the parallel execution servers during the execution of the SQL statement, not during the parsing of the statement. Therefore, when parallel execution is used with the shared server, the server process that processes the EXECUTE call of a user's statement becomes the parallel execution
24-6
coordinator for the statement. See "Setting the Degree of Parallelism for Parallel Execution" on page 24-32 for more information.
Each communication channel has at least one, and sometimes up to four memory buffers. Multiple memory buffers facilitate asynchronous communication among the parallel execution servers. A single-instance environment uses at most three buffers for each communication channel. An Oracle Real Application Clusters environment uses at most four buffers for each channel. Figure 241 illustrates message buffers and how producer parallel execution servers connect to consumer parallel execution servers.
Figure 241 Parallel Execution Server Connections and Buffers
...
...
DOP = 1 DOP = 2 DOP = n
When a connection is between two processes on the same instance, the servers communicate by passing the buffers back and forth. When the connection is between processes in different instances, the messages are sent using external high-speed network protocols. In Figure 241, the DOP is equal to the number of parallel execution servers, which in this case is n. Figure 241 does not show the parallel execution coordinator. Each parallel execution server actually has an additional connection to the parallel execution coordinator.
24-8
After the optimizer determines the execution plan of a statement, the parallel execution coordinator determines the parallelization method for each operation in the plan. For example, the parallelization method might be to parallelize a full table scan by block range or parallelize an index range scan by partition. The coordinator must decide whether an operation can be performed in parallel and, if so, how many parallel execution servers to enlist. The number of parallel execution servers in one set is the DOP. See "Setting the Degree of Parallelism for Parallel Execution" on page 24-32 for more information.
Note that hints have been used in the query to force the join order and join method, and to specify the DOP of the tables employees and departments. In general, you should let the optimizer determine the order and method. Figure 242 illustrates the data ow graph or query plan for this query.
Figure 242
GROUP BY SORT
HASH JOIN
query that species the DOP. In other words, the DOP will be four because each set of parallel execution servers will have four processes. Slave set SS1 rst scans the table employees while SS2 will fetch rows from SS1 and build a hash table on the rows. In other words, the parent servers in SS2 and the child servers in SS2 work concurrently: one in scanning employees in parallel, the other in consuming rows sent to it from SS1 and building the hash table for the hash join in parallel. This is an example of inter-operation parallelism. After SS1 has nished scanning the entire table employees (that is, all granules or task units for employees are exhausted), it scans the table departments in parallel. It sends its rows to servers in SS2, which then perform the probes to nish the hash-join in parallel. After SS1 is done scanning the table departments in parallel and sending the rows to SS2, it switches to performing the GROUP BY in parallel. This is how two server sets run concurrently to achieve inter-operation parallelism across various operators in the query tree while achieving intra-operation parallelism in executing each operation in parallel. Another important aspect of parallel execution is the re-partitioning of rows while they are sent from servers in one server set to another. For the query plan in Figure 242, after a server process in SS1 scans a row of employees, which server process of SS2 should it send it to? The partitioning of rows owing up the query tree is decided by the operator into which the rows are owing into. In this case, the partitioning of rows owing up from SS1 performing the parallel scan of employees into SS2 performing the parallel hash-join is done by hash partitioning on the join column value. That is, a server process scanning employees computes a hash function of the value of the column employees.employee_id to decide the number of the server process in SS2 to send it to. The partitioning method used in parallel queries is explicitly shown in the EXPLAIN PLAN of the query. Note that the partitioning of rows being sent between sets of execution servers should not be confused with Oracle's partitioning feature whereby tables can be partitioned using hash, range, and other methods.
Producer Operations
Operations that require the output of other operations are known as consumer operations. In Figure 242, the GROUP BY SORT operation is the producer of the HASH JOIN operation because GROUP BY SORT requires the HASH JOIN output. Producer operations can begin consuming rows as soon as the producer operations have produced rows. In the previous example, while the parallel execution servers are producing rows in the FULL SCAN departments operation, another set of
24-11
parallel execution servers can begin to perform the HASH JOIN operation to consume the rows. Each of the two operations performed concurrently is given its own set of parallel execution servers. Therefore, both query operations and the data ow tree itself have parallelism. The parallelism of an individual operation is called intraoperation parallelism and the parallelism between operations in a data ow tree is called interoperation parallelism. Due to the producer-consumer nature of the Oracle server's operations, only two operations in a given tree need to be performed simultaneously to minimize execution time. To illustrate intraoperation and interoperation parallelism, consider the following statement:
SELECT * FROM employees ORDER BY last_name;
The execution plan implements a full scan of the employees table. This operation is followed by a sorting of the retrieved rows, based on the value of the last_name column. For the sake of this example, assume the last_name column is not indexed. Also assume that the DOP for the query is set to 4, which means that four parallel execution servers can be active for any given operation. Figure 243 illustrates the parallel execution of the example query.
Figure 243 Interoperation Parallelism and Dynamic Partitioning
Parallel execution servers for ORDER BY operation A-G Parallel execution servers for full table scan
employees Table H-M User Process Parallel Execution Coordinator N-S SELECT * from employees ORDER BY last_name;
T-Z
IntraOperation parallelism
InterOperation parallelism
IntraOperation parallelism
Types of Parallelism
As you can see from Figure 243, there are actually eight parallel execution servers involved in the query even though the DOP is 4. This is because a parent and child operator can be performed at the same time (interoperation parallelism). Also note that all of the parallel execution servers involved in the scan operation send rows to the appropriate parallel execution server performing the SORT operation. If a row scanned by a parallel execution server contains a value for the last_name column between A and G, that row gets sent to the rst ORDER BY parallel execution server. When the scan operation is complete, the sorting processes can return the sorted results to the coordinator, which, in turn, returns the complete query results to the user.
Types of Parallelism
The following types of parallelism are discussed in this section:
s
Parallel Query Parallel DDL Parallel DML Parallel Execution of Functions Other Types of Parallelism
Parallel Query
You can parallelize queries and subqueries in SELECT statements. You can also parallelize the query portions of DDL statements and DML statements (INSERT, UPDATE, and DELETE). You can also query external tables in parallel.
See Also:
s
"Operations That Can Be Parallelized" on page 24-3 for information on the query operations that Oracle can parallelize "Parallelizing SQL Statements" on page 24-8 for an explanation of how the processes perform parallel queries "Distributed Transaction Restrictions" on page 24-27 for examples of queries that reference a remote object "Rules for Parallelizing Queries" on page 24-37 for information on the conditions for parallelizing a query and the factors that determine the DOP
24-13
Types of Parallelism
Parallel fast full scan of a nonpartitioned index-organized table Parallel fast full scan of a partitioned index-organized table Parallel index range scan of a partitioned index-organized table
These scan methods can be used for index-organized tables with overow areas and for index-organized tables that contain LOBs.
A PARALLEL hint (if present) An ALTER SESSION FORCE PARALLEL QUERY statement The parallel degree associated with the table, if the parallel degree is specied in the CREATE TABLE or ALTER TABLE statement
The allocation of work is done by dividing the index segment into a sufciently large number of block ranges and then assigning the block ranges to parallel execution servers in a demand-driven manner. The overow blocks corresponding to any row are accessed in a demand-driven manner only by the process which owns that row.
Types of Parallelism
Methods on object types Attribute access of object types Constructors to create object type instances Object views PL/SQL and OCI queries for object types
There are no limitations on the size of the object types for parallel queries. The following restrictions apply to using parallel query for object types.
s
A MAP function is needed to parallelize queries involving joins and sorts (through ORDER BY, GROUP BY, or set operations). In the absence of a MAP function, the query will automatically be executed serially. Parallel DML and parallel DDL are not supported with object types. DML and DDL statements are always performed serially.
In all cases where the query cannot execute in parallel because of any of these restrictions, the whole query executes serially without giving an error message.
Parallel DDL
This section includes the following topics on parallelism for DDL statements:
s
DDL Statements That Can Be Parallelized CREATE TABLE ... AS SELECT in Parallel Recoverability and Parallel DDL Space Management for Parallel DDL
24-15
Types of Parallelism
CREATE INDEX CREATE TABLE ... AS SELECT ALTER INDEX ... REBUILD
The parallel DDL statements for partitioned tables and indexes are:
s
CREATE INDEX CREATE TABLE ... AS SELECT ALTER TABLE ... [MOVE|SPLIT|COALESCE] PARTITION ALTER INDEX ... [REBUILD|SPLIT] PARTITION
s
This statement can be executed in parallel only if the (global) index partition being split is usable.
All of these DDL operations can be performed in no-logging mode for either parallel or serial execution. CREATE TABLE for an index-organized table can be parallelized either with or without an AS SELECT clause. Different parallelism is used for different operations (see Table 243 on page 24-44). Parallel CREATE TABLE ... AS SELECT statements on partitioned tables and parallel CREATE INDEX statements on partitioned indexes execute with a DOP equal to the number of partitions. Partition parallel analyze table is made less necessary by the ANALYZE {TABLE, INDEX} PARTITION statements, since parallel analyze of an entire partitioned table can be constructed with multiple user sessions. Parallel DDL cannot occur on tables with object columns. Parallel DDL cannot occur on non-partitioned tables with LOB columns.
Types of Parallelism
24-17
Types of Parallelism
If the unused space in each temporary segment is larger than the value of the MINIMUM EXTENT parameter set at the tablespace level, then Oracle trims the unused space when merging rows from all of the temporary segments into the table or index. The unused space is returned to the system free space and can be allocated for new extents, but it cannot be coalesced into a larger segment because it is not contiguous space (external fragmentation). If the unused space in each temporary segment is smaller than the value of the MINIMUM EXTENT parameter, then unused space cannot be trimmed when the rows in the temporary segments are merged. This unused space is not returned to the system free space; it becomes part of the table or index (internal fragmentation) and is available only for subsequent inserts or for updates that require additional space.
For example, if you specify a DOP of 3 for a CREATE TABLE ... AS SELECT statement, but there is only one datale in the tablespace, then internal fragmentation may occur, as shown in Figure 245 on page 24-19. The pockets of
Types of Parallelism
free space within the internal table extents of a datale cannot be coalesced with other free space and cannot be allocated as extents. See Oracle Database Performance Tuning Guide for more information about creating tables and indexes in parallel.
Figure 245 Unusable Free Space (Internal Fragmentation)
USERS Tablespace
DATA1.ORA
Parallel DML
Parallel DML (PARALLEL INSERT, UPDATE, DELETE, and MERGE) uses parallel execution mechanisms to speed up or scale up large DML operations against large database tables and indexes.
24-19
Types of Parallelism
the term DML refers only to inserts, updates, merges, and deletes. This section discusses the following parallel DML topics:
s
Advantages of Parallel DML over Manual Parallelism When to Use Parallel DML Enabling Parallel DML Transaction Restrictions for Parallel DML Rollback Segments Recovery for Parallel DML Space Considerations for Parallel DML Lock and Enqueue Resources for Parallel DML Restrictions on Parallel DML
Issuing multiple INSERT statements to multiple instances of an Oracle Real Application Clusters to make use of free space from multiple free list blocks. Issuing multiple UPDATE and DELETE statements with different key value ranges or rowid ranges.
It is difcult to use. You have to open multiple sessions (possibly on different instances) and issue multiple statements. There is a lack of transactional properties. The DML statements are issued at different times; and, as a result, the changes are done with inconsistent snapshots of the database. To get atomicity, the commit or rollback of the various statements must be coordinated manually (maybe across instances). The work division is complex. You may have to query the table in order to nd out the rowid or key value ranges to correctly divide the work.
Types of Parallelism
The calculation is complex. The calculation of the degree of parallelism can be complex. There is a lack of afnity and resource information. You need to know afnity information to issue the right DML statement at the right instance when running an Oracle Real Application Clusters. You also have to nd out about current resource usage to balance workload across instances.
Parallel DML removes these disadvantages by performing inserts, updates, and deletes in parallel automatically.
Refreshing Tables in a Data Warehouse System Creating Intermediate Summary Tables Using Scoring Tables Updating Historical Tables Running Batch Jobs
Refreshing Tables in a Data Warehouse System In a data warehouse system, large tables need to be refreshed (updated) periodically with new or modied data from the production system. You can do this efciently by using parallel DML combined with updatable join views. You can also use the MERGE statement. The data that needs to be refreshed is generally loaded into a temporary table before starting the refresh process. This table contains either new rows or rows that have been updated since the last refresh of the data warehouse. You can use an updatable join view with parallel UPDATE to refresh the updated rows, and you can use an anti-hash join with parallel INSERT to refresh the new rows.
24-21
Types of Parallelism
Creating Intermediate Summary Tables In a DSS environment, many applications require complex computations that involve constructing and manipulating many large intermediate summary tables. These summary tables are often temporary and frequently do not need to be logged. Parallel DML can speed up the operations against these large intermediate tables. One benet is that you can put incremental results in the intermediate tables and perform parallel update. In addition, the summary tables may contain cumulative or comparison information which has to persist beyond application sessions; thus, temporary tables are not feasible. Parallel DML operations can speed up the changes to these large summary tables. Using Scoring Tables Many DSS applications score customers periodically based on a set of criteria. The scores are usually stored in large DSS tables. The score information is then used in making a decision, for example, inclusion in a mailing list. This scoring activity queries and updates a large number of rows in the large table. Parallel DML can speed up the operations against these large tables. Updating Historical Tables Historical tables describe the business transactions of an enterprise over a recent time interval. Periodically, the DBA deletes the set of oldest rows and inserts a set of new rows into the table. Parallel INSERT ... SELECT and parallel DELETE operations can speed up this rollover task. Although you can also use parallel direct loader (SQL*Loader) to insert bulk data from an external source, parallel INSERT ... SELECT is faster for inserting data that already exists in another table in the database. Dropping a partition can also be used to delete old rows. However, to do this, the table has to be partitioned by date and with the appropriate time interval. Running Batch Jobs Batch jobs executed in an OLTP database during off hours have a xed time window in which the jobs must complete. A good way to ensure timely job completion is to parallelize their operations. As the work load increases, more machine resources can be added; the scaleup property of parallel operations ensures that the time constraint can be met.
Types of Parallelism
The default mode of a session is DISABLE PARALLEL DML. When parallel DML is disabled, no DML will be executed in parallel even if the PARALLEL hint is used. When parallel DML is enabled in a session, all DML statements in this session will be considered for parallel execution. However, even if parallel DML is enabled, the DML operation may still execute serially if there are no parallel hints or no tables with a parallel attribute or if restrictions on parallel operations are violated. The session's PARALLEL DML mode does not inuence the parallelism of SELECT statements, DDL statements, and the query portions of DML statements. Thus, if this mode is not set, the DML operation is not parallelized, but scans or join operations within the DML statement may still be parallelized.
See Also:
s
"Space Considerations for Parallel DML" on page 24-24 "Lock and Enqueue Resources for Parallel DML" on page 24-25 "Restrictions on Parallel DML" on page 24-25
Each parallel execution server creates a different parallel process transaction. If you use rollback segments instead of Automatic Undo Management, you may want to reduce contention on the rollback segments. To do this, only a few parallel process transactions should reside in the same rollback segment. See Oracle Database SQL Reference for more information.
The coordinator also has its own coordinator transaction, which can have its own rollback segment. In order to ensure user-level transactional atomicity, the coordinator uses a two-phase commit protocol to commit the changes performed by the parallel process transactions. A session that is enabled for parallel DML may put transactions in the session in a special mode: If any DML statement in a transaction modies a table in parallel, no subsequent serial or parallel query or DML statement can access the same table again in that transaction. This means that the results of parallel modications cannot be seen during the transaction. Serial or parallel statements that attempt to access a table that has already been modied in parallel within the same transaction are rejected with an error message.
24-23
Types of Parallelism
If a PL/SQL procedure or block is executed in a parallel DML enabled session, then this rule applies to statements in the procedure or block.
Rollback Segments
If you use rollback segments instead of Automatic Undo Management, there are some restrictions when using parallel DML. See Oracle Database SQL Reference for information about restrictions for parallel DML and rollback segments.
Types of Parallelism
Intra-partition parallelism for UPDATE, MERGE, and DELETE operations require that the COMPATIBLE initialization parameter be set to 9.2 or greater. INSERT, UPDATE, MERGE, and DELETE operations on nonpartitioned tables are not parallelized if there is a bitmap index on the table. If the table is partitioned and there is a bitmap index on the table, the degree of parallelism will be restricted to at most the number of partitions accessed. A transaction can contain multiple parallel DML statements that modify different tables, but after a parallel DML statement modies a table, no subsequent serial or parallel statement (DML or query) can access the same table again in that transaction. This restriction also exists after a serial direct-path INSERT statement: no subsequent SQL statement (DML or query) can access the modied table during that transaction. Queries that access the same table are allowed before a parallel DML or direct-path INSERT statement, but not after. Any serial or parallel statements attempting to access a table that has already been modied by a parallel UPDATE, DELETE, or MERGE, or a direct-path INSERT during the same transaction are rejected with an error message.
Parallel DML operations cannot be done on tables with triggers. Replication functionality is not supported for parallel DML. Parallel DML cannot occur in the presence of certain constraints: self-referential integrity, delete cascade, and deferred integrity. In addition, for direct-path INSERT, there is no support for any referential integrity. Parallel DML can be done on tables with object columns provided you are not touching the object columns.
24-25
Types of Parallelism
Parallel DML can be done on tables with LOB columns provided the table is partitioned. However, intra-partition parallelism is not supported. A transaction involved in a parallel DML operation cannot be or become a distributed transaction. Clustered tables are not supported.
Violations of these restrictions cause the statement to execute serially without warnings or error messages (except for the restriction on statements accessing the same table in a transaction, which can cause error messages). For example, an update is serialized if it is on a nonpartitioned table. Partitioning Key Restriction You can only update the partitioning key of a partitioned table to a new value if the update does not cause the row to move to a new partition. The update is possible if the table is dened with the row movement clause enabled. Function Restrictions The function restrictions for parallel DML are the same as those for parallel DDL and parallel query. See "Parallel Execution of Functions" on page 24-28 for more information.
Types of Parallelism
Table 241
Referential Integrity Restrictions Issued on Parent (Not applicable) (Not applicable) Supported Supported Not parallelized Issued on Child Not parallelized Not parallelized Supported Supported (Not applicable) Self-Referential Not parallelized Not parallelized Not parallelized Not parallelized Not parallelized
DML Statement INSERT MERGE UPDATE No Action DELETE No Action DELETE Cascade
Delete Cascade Delete on tables having a foreign key with delete cascade is not parallelized because parallel execution servers will try to delete rows from multiple partitions (parent and child tables). Self-Referential Integrity DML on tables with self-referential integrity constraints is not parallelized if the referenced keys (primary keys) are involved. For DML on all other columns, parallelism is possible. Deferrable Integrity Constraints If any deferrable constraints apply to the table being operated on, the DML operation will not be parallelized.
Trigger Restrictions
A DML operation will not be parallelized if the affected tables contain enabled triggers that may get red as a result of the statement. This implies that DML statements on tables that are being replicated will not be parallelized. Relevant triggers must be disabled in order to parallelize DML on the table. Note that, if you enable or disable triggers, the dependent shared cursors are invalidated.
24-27
Types of Parallelism
Example 241
The query operation is executed serially without notication because it references a remote object.
Example 242 Distributed Transaction Parallelization
The DELETE operation is not parallelized because it occurs in a distributed transaction (which is started by the SELECT statement).
Types of Parallelism
Parallel recovery Parallel propagation (replication) Parallel load (the SQL*Loader utility)
24-29
Like parallel SQL, parallel recovery and propagation are performed by a parallel execution coordinator and multiple parallel execution servers. Parallel load, however, uses a different mechanism. The behavior of the parallel execution coordinator and parallel execution servers may differ, depending on what kind of operation they perform (SQL, recovery, or propagation). For example, if all parallel execution servers in the pool are occupied and the maximum number of parallel execution servers has been started:
s
In parallel SQL, the parallel execution coordinator switches to serial processing. In parallel propagation, the parallel execution coordinator returns an error.
For a given session, the parallel execution coordinator coordinates only one kind of operation. A parallel execution coordinator cannot coordinate, for example, parallel SQL and parallel recovery or propagation at the same time.
See Also:
s
Oracle Database Utilities for information about parallel load and SQL*Loader Oracle Database Backup and Recovery Basics for information about parallel media recovery Oracle Database Performance Tuning Guide for information about parallel instance recovery Oracle Database Advanced Replication for information about parallel propagation
On systems where parallel execution will never be used, PARALLEL_MAX_ SERVERS can be set to zero.
On large systems with abundant SGA memory, PARALLEL_EXECUTION_ MESSAGE_SIZE can be increased to improve throughput.
You can also manually tune parallel execution parameters; however, Oracle recommends using default settings for parallel execution. Manual tuning of parallel execution is more complex than using default settings for two reasons: manual parallel execution tuning requires more attentive administration than automated tuning, and manual tuning is prone to user-load and system-resource miscalculations. Initializing and tuning parallel execution involves the following steps:
s
Using Default Parameter Settings Setting the Degree of Parallelism for Parallel Execution How Oracle Determines the Degree of Parallelism for Operations Balancing the Workload Parallelization Rules for SQL Statements Enabling Parallelism for Tables and Queries Degree of Parallelism and Adaptive Multiuser: How They Interact Forcing Parallel Execution for a Session Controlling Performance with the Degree of Parallelism
PARALLEL_EXECUTION_ MESSAGE_SIZE
24-31
Note that you can set some parameters in such a way that Oracle will be constrained. For example, if you set PROCESSES to 20, you will not be able to get 25 slaves.
At the statement level with hints and with the PARALLEL clause At the session level by issuing the ALTER SESSION FORCE PARALLEL statement At the table level in the table's denition At the index level in the index's denition
The following example shows a statement that sets the DOP to 4 on a table:
ALTER TABLE orders PARALLEL 4;
Note that the DOP applies directly only to intraoperation parallelism. If interoperation parallelism is possible, the total number of parallel execution servers for a statement can be twice the specied DOP. No more than two operations can be performed simultaneously. Parallel execution is designed to effectively use multiple CPUs and disks to answer queries quickly. When multiple users employ parallel execution at the same time, available CPU, memory, and disk resources may be quickly exhausted. Oracle provides several ways to deal with resource utilization in conjunction with parallel execution, including:
s
The adaptive multiuser algorithm, which reduces the DOP as the load on the system increases. By default, the adaptive multiuser algorithm is enabled,
which optimizes the performance of systems with concurrent parallel SQL execution operations.
s
User resource limits and proles, which allow you to set limits on the amount of various system resources available to each user as part of a user's security domain. The Database Resource Manager, which enables you to allocate resources to different groups of users.
Checks for hints or a PARALLEL clause specied in the SQL statement itself. Checks for a session value set by the ALTER SESSION FORCE PARALLEL statement. Looks at the table's or index's denition.
After a DOP is found in one of these specications, it becomes the DOP for the operation. Hints, PARALLEL clauses, table or index denitions, and default values only determine the number of parallel execution servers that the coordinator requests for a given operation. The actual number of parallel execution servers used depends upon how many processes are available in the parallel execution server pool and whether interoperation parallelism is possible.
See Also:
s
"The Parallel Execution Server Pool" on page 24-6 "Parallelism Between Operations" on page 24-10 "Default Degree of Parallelism" on page 24-34 "Parallelization Rules for SQL Statements" on page 24-37
24-33
The PARALLEL hint is used only for operations on tables. You can use it to parallelize queries and DML statements (INSERT, UPDATE, MERGE, and DELETE). The PARALLEL_INDEX hint parallelizes an index range scan of a partitioned index. (In an index operation, the PARALLEL hint is not valid and is ignored.)
See Oracle Database Performance Tuning Guide for information about using hints in SQL statements and the specic syntax for the PARALLEL, NO_PARALLEL, PARALLEL_INDEX, CACHE, and NOCACHE hints.
The value of the parameter CPU_COUNT, which is, by default, the number of CPUs on the system, the number of RAC instances, and the value of the parameter PARALLEL_THREADS_PER_CPU. For parallelizing by partition, the number of partitions that will be accessed, based on partition pruning. For parallel DML operations with global index maintenance, the minimum number of transaction free lists among all the global indexes to be updated. The minimum number of transaction free lists for a partitioned global index is the minimum number across all index partitions. This is a requirement to prevent self-deadlock.
These factors determine the default number of parallel execution servers to use. However, the actual number of processes used is limited by their availability on the requested instances during run time. The initialization parameter PARALLEL_MAX_ SERVERS sets an upper limit on the total number of parallel execution servers that an instance can have.
If a minimum fraction of the desired parallel execution servers is not available (specied by the initialization parameter PARALLEL_MIN_PERCENT), a user error is produced. You can retry the query when the system is less busy.
24-35
Real Application Clusters Deployment and Performance Guide for more information about instance groups.
Parallel query looks at each table and index, in the portion of the query being parallelized, to determine which is the reference table. The basic rule is to pick the table or index with the largest DOP. For parallel DML (INSERT, UPDATE, MERGE, and DELETE), the reference object that determines the DOP is the table being modied by an insert, update, or delete operation. Parallel DML also adds some limits to the DOP to prevent deadlock. If the parallel DML statement includes a subquery, the subquery's DOP is the same as the DML operation. For parallel DDL, the reference object that determines the DOP is the table, index, or partition being created, rebuilt, split, or moved. If the parallel DDL statement includes a subquery, the subquery's DOP is the same as the DDL operation.
The query includes a parallel hint specication (PARALLEL or PARALLEL_ INDEX) or the schema objects referred to in the query have a PARALLEL declaration associated with them. At least one of the tables specied in the query requires one of the following:
s
24-37
Degree of Parallelism The DOP for a query is determined by the following rules:
s
The query uses the maximum DOP taken from all of the table declarations involved in the query and all of the potential indexes that are candidates to satisfy the query (the reference objects). That is, the table or index that has the greatest DOP determines the query's DOP (maximum query directive). If a table has both a parallel hint specication in the query and a parallel declaration in its table specication, the hint specication takes precedence over parallel declaration specication. See Table 243 on page 24-44 for precedence rules.
Use a parallel clause in the denition of the table being updated or deleted (the reference object). Use an update, merge, or delete parallel hint in the statement.
Parallel hints are placed immediately after the UPDATE, MERGE, or DELETE keywords in UPDATE, MERGE, and DELETE statements. The hint also applies to the underlying scan of the table being changed. You can use the ALTER SESSION FORCE PARALLEL DML statement to override parallel clauses for subsequent UPDATE, MERGE, and DELETE statements in a session. Parallel hints in UPDATE, MERGE, and DELETE statements override the ALTER SESSION FORCE PARALLEL DML statement. Decision to Parallelize The following rule determines whether the UPDATE, MERGE, or DELETE operation should be parallelized: The UPDATE or DELETE operation will be parallelized if and only if at least one of the following is true:
s
The table being updated or deleted has a PARALLEL specication. The PARALLEL hint is specied in the DML statement.
An ALTER SESSION FORCE PARALLEL DML statement has been issued previously during the session.
If the statement contains subqueries or updatable views, then they may have their own separate parallel hints or clauses. However, these parallel directives do not affect the decision to parallelize the UPDATE, MERGE, or DELETE. The parallel hint or clause on the tables is used by both the query and the UPDATE, MERGE, DELETE portions to determine parallelism, the decision to parallelize the UPDATE, MERGE, or DELETE portion is made independently of the query portion, and vice versa. Degree of Parallelism The DOP is determined by the same rules as for the queries. Note that in the case of UPDATE and DELETE operations, only the target table to be modied (the only reference object) is involved. Thus, the UPDATE or DELETE parallel hint specication takes precedence over the parallel declaration specication of the target table. In other words, the precedence order is: MERGE, UPDATE, DELETE hint > Session > Parallel declaration specication of target table. See Table 243 on page 24-44 for precedence rules. A parallel execution server can update or merge into, or delete from multiple partitions, but each partition can only be updated or deleted by one parallel execution server. If the DOP is less than the number of partitions, then the rst process to nish work on one partition continues working on another partition, and so on until the work is nished on all partitions. If the DOP is greater than the number of partitions involved in the operation, then the excess parallel execution servers will have no work to do.
Example 244 Parallelization: Example 1
If tbl_1 is a partitioned table and its table denition has a parallel clause, then the update operation is parallelized even if the scan on the table is serial (such as an index scan), assuming that the table has more than one partition with c1 greater than 100.
Example 245 Parallelization: Example 2
24-39
Both the scan and update operations on tbl_2 will be parallelized with degree four.
SELECT parallel hints specied at the statement Parallel clauses specied in the denition of tables being selected INSERT parallel hint specied at the statement Parallel clause specied in the denition of tables being inserted into
You can use the ALTER SESSION FORCE PARALLEL DML statement to override parallel clauses for subsequent INSERT operations in a session. Parallel hints in insert operations override the ALTER SESSION FORCE PARALLEL DML statement. Decision to Parallelize The following rule determines whether the INSERT operation should be parallelized in an INSERT ... SELECT statement: The INSERT operation will be parallelized if and only if at least one of the following is true:
s
The PARALLEL hint is specied after the INSERT in the DML statement. The table being inserted into (the reference object) has a PARALLEL declaration specication. An ALTER SESSION FORCE PARALLEL DML statement has been issued previously during the session.
The decision to parallelize the INSERT operation is made independently of the SELECT operation, and vice versa. Degree of Parallelism Once the decision to parallelize the SELECT or INSERT operation is made, one parallel directive is picked for deciding the DOP of the whole statement, using the following precedence rule Insert hint directive >
Session> Parallel declaration specication of the inserting table > Maximum query directive. In this context, maximum query directive means that among multiple tables and indexes, the table or index that has the maximum DOP determines the parallelism for the query operation. The chosen parallel directive is applied to both the SELECT and INSERT operations.
Example 246 Parallelization: Example 3
24-41
Parallel CREATE INDEX or ALTER INDEX ... REBUILD The CREATE INDEX and ALTER INDEX ... REBUILD statements can be parallelized only by a PARALLEL clause or an ALTER SESSION FORCE PARALLEL DDL statement. ALTER INDEX ... REBUILD can be parallelized only for a nonpartitioned index, but ALTER INDEX ... REBUILD PARTITION can be parallelized by a PARALLEL clause or an ALTER SESSION FORCE PARALLEL DDL statement. The scan operation for ALTER INDEX ... REBUILD (nonpartitioned), ALTER INDEX ... REBUILD PARTITION, and CREATE INDEX has the same parallelism as the REBUILD or CREATE operation and uses the same DOP. If the DOP is not specied for REBUILD or CREATE, the default is the number of CPUs. Parallel MOVE PARTITION or SPLIT PARTITION The ALTERINDEX ... MOVE PARTITION and ALTERINDEX ...SPLIT PARTITION statements can be parallelized only by a PARALLEL clause or an ALTER SESSION FORCE PARALLEL DDL statement. Their scan operations have the same parallelism as the corresponding MOVE or SPLIT operations. If the DOP is not specied, the default is the number of CPUs.
The query includes a parallel hint specication (PARALLEL or PARALLEL_ INDEX) or the CREATE part of the statement has a PARALLEL clause specication or the schema objects referred to in the query have a PARALLEL declaration associated with them. At least one of the tables specied in the query requires one of the following: a full table scan or an index range scan spanning multiple partitions.
Degree of Parallelism (Query Part) The DOP for the query part of a CREATE TABLE ... AS SELECT statement is determined by one of the following rules:
s
The query part uses the values specied in the PARALLEL clause of the CREATE part. If the PARALLEL clause is not specied, the default DOP is the number of CPUs. If the CREATE is serial, then the DOP is determined by the query.
Note that any values specied in a hint for parallelism are ignored. Decision to Parallelize (CREATE Part) The CREATE operation of CREATE TABLE ... AS SELECT can be parallelized only by a PARALLEL clause or an ALTER SESSION FORCE PARALLEL DDL statement. When the CREATE operation of CREATE TABLE ... AS SELECT is parallelized, Oracle also parallelizes the scan operation if possible. The scan operation cannot be parallelized if, for example:
s
The SELECT clause has a NO_PARALLEL hint The operation scans an index of a nonpartitioned table
When the CREATE operation is not parallelized, the SELECT can be parallelized if it has a PARALLEL hint or if the selected table (or partitioned index) has a parallel declaration. Degree of Parallelism (CREATE Part) The DOP for the CREATE operation, and for the SELECT operation if it is parallelized, is specied by the PARALLEL clause of the CREATE statement, unless it is overridden by an ALTER SESSION FORCE PARALLEL DDL statement. If the PARALLEL clause does not specify the DOP, the default is the number of CPUs.
The priority (1) specication overrides priority (2) and priority (3). The priority (2) specication overrides priority (3).
24-43
Table 243
Parallelization Rules Parallelized by Clause, Hint, or Underlying Table/Index Declaration (priority order: 1, 2, 3)
Parallel Operation
PARALLEL Clause
Parallel ALTER SESSION Declaration (2) FORCE (3) of table PARALLEL QUERY (2) FORCE (2) of index PARALLEL QUERY (2) FORCE PARALLEL DML (2) FORCE PARALLEL DML (3) of table being updated or deleted from (3) of table being inserted into
Parallel query table scan (partitioned or nonpartitioned table) Parallel query index range scan (partitioned index) Parallel UPDATE or DELETE (partitioned table only) INSERT operation of parallel INSERT... SELECT (partitioned or nonpartitioned table) SELECT operation of INSERT ... SELECT when INSERT is parallel SELECT operation of INSERT ... SELECT when INSERT is serial CREATE operation of parallel CREATE TABLE ... AS SELECT (partitioned or nonpartitioned table)
Takes degree from INSERT statement (1) PARALLEL (Note: Hint in select clause does not affect the create operation.) (2) (1) FORCE PARALLEL DDL (2) of table being selected from
SELECT operation of CREATE Takes degree from CREATE statement TABLE ... AS SELECT when CREATE is parallel SELECT operation of CREATE (1) PARALLEL or TABLE ... AS SELECT when CREATE PARALLEL_ is serial INDEX Parallel CREATE INDEX (partitioned or nonpartitioned index) Parallel REBUILD INDEX (nonpartitioned index) (2) (1) FORCE PARALLEL DDL (1) FORCE PARALLEL DDL (2) of querying tables or partitioned indexes
(2)
Table 243
(Cont.) Parallelization Rules Parallelized by Clause, Hint, or Underlying Table/Index Declaration (priority order: 1, 2, 3)
Parallel Operation
PARALLEL Hint
PARALLEL Clause
REBUILD INDEX (partitioned index)never parallelized Parallel REBUILD INDEX partition Parallel MOVE or SPLIT partition (2) (2) (1) FORCE PARALLEL DDL (1) FORCE PARALLEL DDL
24-45
Once Oracle determines the DOP for a query, the DOP does not change for the duration of the query. It is best to use the parallel adaptive multiuser feature when users process simultaneous parallel execution operations. By default, PARALLEL_ADAPTIVE_ MULTI_USER is set to TRUE, which optimizes the performance of systems with concurrent parallel SQL execution operations. If PARALLEL_ADAPTIVE_MULTI_ USER is set to FALSE, each parallel SQL execution operation receives the requested number of parallel execution server processes regardless of the impact to the performance of the system as long as sufcient resources have been congured.
All subsequent queries will be executed in parallel provided no restrictions are violated. You can also force DML and DDL statements. This clause overrides any parallel clause specied in subsequent statements in the session, but is overridden by a parallel hint. In typical OLTP environments, for example, the tables are not set parallel, but nightly batch scripts may want to collect data from these tables in parallel. By setting the DOP in the session, the user avoids altering each table in parallel and then altering it back to serial when nished.
Parameters Establishing Resource Limits for Parallel Operations Parameters Affecting Resource Consumption Parameters Related to I/O
24-47
PARALLEL_MAX_SERVERS
The PARALLEL_MAX_SEVERS parameter sets a resource limit on the maximum number of processes available for parallel execution. Most parallel operations need at most twice the number of query server processes as the maximum DOP attributed to any table in the operation. Oracle sets PARALLEL_MAX_SERVERS to a default value that is sufcient for most systems. The default value for PARALLEL_MAX_SERVERS is as follows:
(CPU_COUNT x PARALLEL_THREADS_PER_CPU x (2 if PGA_AGGREGATE_TARGET > 0; otherwise 1) x 5)
This might not be enough for parallel queries on tables with higher DOP attributes. We recommend users who expects to run queries of higher DOP to set PARALLEL_ MAX_SERVERS as follows:
2 x DOP x NUMBER_OF_CONCURRENT_USERS
For example, setting PARALLEL_MAX_SERVERS to 64 will allow you to run four parallel queries simultaneously, assuming that each query is using two slave sets with a DOP of eight for each set. If the hardware system is neither CPU bound nor I/O bound, then you can increase the number of concurrent parallel execution users on the system by adding more query server processes. When the system becomes CPU- or I/O-bound, however, adding more concurrent users becomes detrimental to the overall performance. Careful setting of PARALLEL_MAX_SERVERS is an effective method of restricting the number of concurrent parallel operations. If users initiate too many concurrent operations, Oracle might not have enough query server processes. In this case, Oracle executes the operations sequentially or displays an error if PARALLEL_MIN_PERCENT is set to a value other than the default value of 0 (zero). This condition can be veried through the GV$SYSSTAT view by comparing the statistics for parallel operations not downgraded and parallel operations downgraded to serial. For example:
SELECT * FROM GV$SYSSTAT WHERE name LIKE 'Parallel operation%';
When Users Have Too Many Processes When concurrent users have too many query server processes, memory contention (paging), I/O contention, or excessive context switching can occur. This contention can reduce system throughput to a level lower than if parallel execution were not used. Increase the PARALLEL_MAX_SERVERS
value only if the system has sufcient memory and I/O bandwidth for the resulting load. You can use operating system performance monitoring tools to determine how much memory, swap space and I/O bandwidth are free. Look at the runq lengths for both your CPUs and disks, as well as the service time for I/Os on the system. Verify that the machine has sufcient swap space exists on the machine to add more processes. Limiting the total number of query server processes might restrict the number of concurrent users who can execute parallel operations, but system throughput tends to remain stable.
PARALLEL_MIN_SERVERS
The recommended value for the PARALLEL_MIN_SERVERS parameter is 0 (zero), which is the default. This parameter lets you specify in a single instance the number of processes to be started and reserved for parallel operations. The syntax is:
24-49
PARALLEL_MIN_SERVERS=n
The n variable is the number of processes you want to start and reserve for parallel operations. Setting PARALLEL_MIN_SERVERS balances the startup cost against memory usage. Processes started using PARALLEL_MIN_SERVERS do not exit until the database is shut down. This way, when a query is issued the processes are likely to be available. It is desirable, however, to recycle query server processes periodically since the memory these processes use can become fragmented and cause the high water mark to slowly increase. When you do not set PARALLEL_MIN_SERVERS, processes exit after they are idle for ve minutes.
SHARED_POOL_SIZE
Parallel execution requires memory resources in addition to those required by serial SQL execution. Additional memory is used for communication and passing data between query server processes and the query coordinator. Oracle Database allocates memory for query server processes from the shared pool. Tune the shared pool as follows:
s
Allow for other clients of the shared pool, such as shared cursors and stored procedures. Remember that larger values improve performance in multiuser systems, but smaller values use less memory. You must also take into account that using parallel execution generates more cursors. Look at statistics in the V$SQLAREA view to determine how often Oracle recompiles cursors. If the cursor hit ratio is poor, increase the size of the pool. This happens only when you have a large number of distinct queries. You can then monitor the number of buffers used by parallel execution and compare the shared pool PX msg pool to the current high water mark reported in output from the view V$PX_PROCESS_SYSSTAT.
Note: If you do not have enough memory available, error message
12853 occurs (insufcient memory for PX buffers: current stringK, max needed stringK). This is caused by having insufcient SGA memory available for PX buffers. You need to recongure the SGA to have at least (MAX - CURRENT) bytes of additional memory. By default, Oracle allocates parallel execution buffers from the shared pool.
You should reduce the value for SHARED_POOL_SIZE low enough so your database starts. After reducing the value of SHARED_POOL_SIZE, you might see the error:
ORA-04031: unable to allocate 16084 bytes of shared memory ("SHARED pool","unknown object","SHARED pool heap","PX msg pool")
If so, execute the following query to determine why Oracle could not allocate the 16,084 bytes:
SELECT NAME, SUM(BYTES) FROM V$SGASTAT WHERE POOL='SHARED POOL' GROUP BY ROLLUP (NAME);
If you specify SHARED_POOL_SIZE and the amount of memory you need to reserve is bigger than the pool, Oracle does not allocate all the memory it can get. Instead, it leaves some space. When the query runs, Oracle tries to get what it needs. Oracle uses the 560 KB and needs another 16KB when it fails. The error does not report the cumulative amount that is needed. The best way of determining how much more memory is needed is to use the formulas in "Adding Memory for Message Buffers" on page 24-52. To resolve the problem in the current example, increase the value for SHARED_ POOL_SIZE. As shown in the sample output, the SHARED_POOL_SIZE is about 2 MB. Depending on the amount of memory available, you could increase the value of SHARED_POOL_SIZE to 4 MB and attempt to start your database. If Oracle continues to display an ORA-4031 message, gradually increase the value for SHARED_POOL_SIZE until startup is successful.
24-51
Adding Memory for Message Buffers You must increase the value for the SHARED_ POOL_SIZE parameter to accommodate message buffers. The message buffers allow query server processes to communicate with each other. Oracle uses a xed number of buffers for each virtual connection between producer query servers and consumer query servers. Connections increase as the square of the DOP increases. For this reason, the maximum amount of memory used by parallel execution is bound by the highest DOP allowed on your system. You can control this value by using either the PARALLEL_MAX_SERVERS parameter or by using policies and proles. To calculate the amount of memory required, use one of the following formulas:
s
Each instance uses the memory computed by the formula. The terms are:
s
SIZE = PARALLEL_EXECUTION_MESSAGE_SIZE USERS = the number of concurrent parallel execution users that you expect to have running with the optimal DOP GROUPS = the number of query server process groups used for each query A simple SQL statement requires only one group. However, if your queries involve subqueries which will be processed in parallel, then Oracle uses an additional group of query server processes.
CONNECTIONS = (DOP2 + 2 x DOP) If your system is a cluster or MPP, then you should account for the number of instances because this will increase the DOP. In other words, using a DOP of 4 on a two instance cluster results in a DOP of 8. A value of PARALLEL_MAX_ SERVERS times the number of instances divided by four is a conservative estimate to use as a starting point.
Add this amount to your original setting for the shared pool. However, before setting a value for either of these memory structures, you must also consider additional memory for cursors, as explained in the following section. Calculating Additional Memory for Cursors Parallel execution plans consume more space in the SQL area than serial execution plans. You should regularly monitor shared pool resource use to ensure that the memory used by both messages and cursors can accommodate your system's processing requirements.
24-53
shared shared shared shared shared shared shared shared shared shared shared shared shared shared shared
pool pool pool pool pool pool pool pool pool pool pool pool pool pool pool
library cache log_buffer log_checkpoint_timeout long op statistics array message pool freequeue miscellaneous processes session param values sessions sql area table columns trace_buffers_per_process transactions trigger inform
2176808 24576 24700 30240 116232 267624 76896 41424 170016 9549116 148104 1476320 18480 24684 52248968 90641768
Evaluate the memory used as shown in your output, and alter the setting for SHARED_POOL_SIZE based on your processing needs. To obtain more memory usage statistics, execute the following query:
SELECT * FROM V$PX_PROCESS_SYSSTAT WHERE STATISTIC LIKE 'Buffers%';
The amount of memory used appears in the Buffers Current and Buffers HWM statistics. Calculate a value in bytes by multiplying the number of buffers by the value for PARALLEL_EXECUTION_MESSAGE_SIZE. Compare the high water mark to the parallel execution message pool size to determine if you allocated too much memory. For example, in the rst output, the value for large pool as shown in px msg pool is 38,092,812 or 38 MB. The Buffers HWM from the second output is 3,620, which when multiplied by a parallel execution message size of 4,096 is 14,827,520, or approximately 15 MB. In this case, the high water mark has reached approximately 40 percent of its capacity.
PARALLEL_MIN_PERCENT
The recommended value for the PARALLEL_MIN_PERCENT parameter is 0 (zero). This parameter enables users to wait for an acceptable DOP, depending on the application in use. Setting this parameter to values other than 0 (zero) causes Oracle to return an error when the requested DOP cannot be satised by the system at a given time. For example, if you set PARALLEL_MIN_PERCENT to 50, which translates to 50 percent, and the DOP is reduced by 50 percent or greater because of the adaptive algorithm or because of a resource limitation, then Oracle returns ORA-12827. For example:
SELECT /*+ PARALLEL(e, 8, 1) */ d.department_id, SUM(SALARY) FROM employees e, departments d WHERE e.department_id = d.department_id GROUP BY d.department_id ORDER BY d.department_id;
PGA_AGGREGATE_TARGET PARALLEL_EXECUTION_MESSAGE_SIZE
A second subset of parameters discussed in this section explains parameters affecting parallel DML and DDL. To control resource consumption, you should congure memory at two levels:
s
At the Oracle level, so the system uses an appropriate amount of memory from the operating system. At the operating system level for consistency. On some platforms, you might need to set operating system parameters that control the total amount of virtual memory available, summed across all processes.
The SGA is typically part of real physical memory. The SGA is static and of xed size; if you want to change its size, shut down the database, make the change, and restart the database. Oracle allocates the shared pool out of the SGA. A large percentage of the memory used in data warehousing operations is more dynamic. This memory comes from process memory (PGA), and both the size of
24-55
process memory and the number of processes can vary greatly. Use the PGA_ AGGREGATE_TARGET parameter to control both the process memory and the number of processes.
PGA_AGGREGATE_TARGET
You can simplify and improve the way PGA memory is allocated by enabling automatic PGA memory management. In this mode, Oracle dynamically adjusts the size of the portion of the PGA memory dedicated to work areas, based on an overall PGA memory target explicitly set by the DBA. To enable automatic PGA memory management, you have to set the initialization parameter PGA_AGGREGATE_ TARGET. See Oracle Database Performance Tuning Guide for descriptions of how to use PGA_AGGREGATE_TARGET in different scenarios. HASH_AREA_SIZE HASH_AREA_SIZE has been deprecated and you should use PGA_ AGGREGATE_TARGET instead. SORT_AREA_SIZE SORT_AREA_SIZE has been deprecated and you should use PGA_ AGGREGATE_TARGET instead.
PARALLEL_EXECUTION_MESSAGE_SIZE
The PARALLEL_EXECUTION_MESSAGE_SIZE parameter species the size of the buffer used for parallel execution messages. The default value is os specic, but is typically 2K. This value should be adequate for most applications, however, increasing this value can improve performance. Consider increasing this value if you have adequate free memory in the shared pool or if you have sufcient operating system memory and can increase your shared pool size to accommodate the additional amount of memory required. Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL
Parameters Affecting Resource Consumption for Parallel DML and Parallel DDL
The parameters that affect parallel DML and parallel DDL resource consumption are:
s
Parallel inserts, updates, and deletes require more resources than serial DML operations. Similarly, PARALLEL CREATE TABLE ... AS SELECT and PARALLEL CREATE INDEX can require more resources. For this reason, you may need to increase the value of several additional initialization parameters. These parameters do not affect resources for queries. TRANSACTIONS For parallel DML and DDL, each query server process starts a transaction. The parallel coordinator uses the two-phase commit protocol to commit transactions; therefore, the number of transactions being processed increases by the DOP. As a result, you might need to increase the value of the TRANSACTIONS initialization parameter. The TRANSACTIONS parameter species the maximum number of concurrent transactions. The default assumes no parallelism. For example, if you have a DOP of 20, you will have 20 more new server transactions (or 40, if you have two server sets) and 1 coordinator transaction. In this case, you should increase TRANSACTIONS by 21 (or 41) if the transactions are running in the same instance. If you do not set this parameter, Oracle sets it to a value equal to 1.1 x SESSIONS. This discussion does not apply if you are using server-managed undo. FAST_START_PARALLEL_ROLLBACK If a system fails when there are uncommitted parallel DML or DDL transactions, you can speed up transaction recovery during startup by using the FAST_START_PARALLEL_ROLLBACK parameter. This parameter controls the DOP used when recovering terminated transactions. Terminated transactions are transactions that are active before a system failure. By default, the DOP is chosen to be at most two times the value of the CPU_COUNT parameter. If the default DOP is insufcient, set the parameter to the HIGH. This gives a maximum DOP of at most four times the value of the CPU_COUNT parameter. This feature is available by default. LOG_BUFFER Check the statistic redo buffer allocation retries in the V$SYSSTAT view. If this value is high relative to redo blocks written, try to increase the LOG_BUFFER size. A common LOG_BUFFER size for a system generating numerous logs is 3 MB to 5 MB. If the number of retries is still high after increasing LOG_BUFFER size, a problem might exist with the disk on which the log les reside. In that case, tune the I/O subsystem to increase the I/O rates for redo. One way of doing this is to use ne-grained striping across multiple disks. For example, use a stripe size of 16 KB. A simpler approach is to isolate redo logs on their own disk.
24-57
DML_LOCKS This parameter species the maximum number of DML locks. Its value should equal the total number of locks on all tables referenced by all users. A parallel DML operation's lock and enqueue resource requirement is very different from serial DML. Parallel DML holds many more locks, so you should increase the value of the ENQUEUE_RESOURCES and DML_LOCKS parameters by equal amounts. Table 244 shows the types of locks acquired by coordinator and parallel execution server processes for different types of parallel DML statements. Using this information, you can determine the value required for these parameters.
Table 244 Locks Acquired by Parallel DML Statements Coordinator process acquires: 1 table lock SX Each parallel execution server acquires: 1 table lock SX
Type of statement Parallel UPDATE or DELETE into partitioned table; WHERE clause pruned to a subset of partitions or subpartitions
1 partition lock X for 1 partition lock NULL for each each pruned pruned (sub)partition owned (sub)partition by the query server process 1 partition-wait lock S for each pruned (sub)partition owned by the query server process
Parallel row-migrating UPDATE into 1 table lock SX partitioned table; WHERE clause 1 partition X lock for pruned to a subset of each pruned (sub)partitions (sub)partition
1 table lock SX 1 partition lock NULL for each pruned (sub)partition owned by the query server process 1 partition-wait lock S for each pruned partition owned by the query server process
1 partition lock SX for all other (sub)partitions Parallel UPDATE, MERGE, DELETE, or INSERT into partitioned table 1 table lock SX Partition locks X for all (sub)partitions
1 partition lock SX for all other (sub)partitions 1 table lock SX 1 partition lock NULL for each (sub)partition 1 partition-wait lock S for each (sub)partition
Table 244
(Cont.) Locks Acquired by Parallel DML Statements Coordinator process acquires: 1 table lock SX Each parallel execution server acquires: 1 table lock SX
Type of statement Parallel INSERT into partitioned table; destination table with partition or subpartition clause
1 partition lock X for 1 partition lock NULL for each each specied specied (sub)partition (sub)partition 1 partition-wait lock S for each specied (sub)partition 1 table lock X None
TM locks in the V$LOCK view. Consider a table with 600 partitions running with a DOP of 100. Assume all partitions are involved in a parallel UPDATE or DELETE statement with no row-migrations. The coordinator acquires:
s
100 table locks SX 600 partition locks NULL 600 partition-wait locks S
ENQUEUE_RESOURCES This parameter sets the number of resources that can be locked by the lock manager. Parallel DML operations require many more resources than serial DML. Oracle allocates more enqueue resources as needed.
DB_CACHE_SIZE
24-59
These parameters also affect the optimizer which ensures optimal performance for parallel execution I/O operations.
DB_CACHE_SIZE
When you perform parallel updates, merges, and deletes, the buffer cache behavior is very similar to any OLTP system running a high volume of updates.
DB_BLOCK_SIZE
The recommended value for this parameter is 8 KB or 16 KB. Set the database block size when you create the database. If you are creating a new database, use a large block size such as 8 KB or 16 KB.
DB_FILE_MULTIBLOCK_READ_COUNT
The recommended value for this parameter is eight for 8 KB block size, or four for 16 KB block size. The default is 8. This parameter determines how many database blocks are read with a single operating system READ call. The upper limit for this parameter is platform-dependent. If you set DB_FILE_MULTIBLOCK_READ_COUNT to an excessively high value, your operating system will lower the value to the highest allowable level when you start your database. In this case, each platform uses the highest value possible. Maximum values generally range from 64 KB to 1 MB.
Figure 246
Asynchronous Read
Synchronous read I/O: read block #1 CPU: process block #1 I/O: read block #2 CPU: process block #2
Asynchronous read I/O: read block #1 CPU: process block #1 I/O: read block #2 CPU: process block #2
Asynchronous operations are currently supported for parallel table scans, hash joins, sorts, and serial table scans. However, this feature can require operating system specic conguration and may not be supported on all platforms. Check your Oracle operating system-specic documentation.
Quantify your performance expectations to determine whether there is a problem. Determine whether a problem pertains to optimization, such as inefcient plans that might require reanalyzing tables or adding hints, or whether the problem pertains to execution, such as simple operations like scanning, loading, grouping, or indexing running much slower than published guidelines. Determine whether the problem occurs when running in parallel, such as load imbalance or resource bottlenecks, or whether the problem is also present for serial operations.
Performance expectations are based on either prior performance metrics (for example, the length of time a given query took last week or on the previous version of Oracle) or scaling and extrapolating from serial execution times (for example, serial execution took 10 minutes while parallel execution took 5 minutes). If the performance does not meet your expectations, consider the following questions:
24-61
Did the execution plan change? If so, you should gather statistics and decide whether to use index-only access and a CREATE TABLE AS SELECT statement. You should use index hints if your system is CPU-bound. You should also study the EXPLAIN PLAN output.
Did the data set change? If so, you should gather statistics to evaluate any differences.
Is the hardware overtaxed? If so, you should check CPU, I/O, and swap memory.
After setting your basic goals and answering these questions, you need to consider the following topics:
s
Is There Regression? Is There a Plan Change? Is There a Parallel Plan? Is There a Serial Plan? Is There Parallel Execution? Is the Workload Evenly Distributed?
Is There Regression?
Does parallel execution's actual performance deviate from what you expected? If performance is as you expected, could there be an underlying performance problem? Perhaps you have a desired outcome in mind to which you are comparing the current outcome. Perhaps you have justiable performance expectations that the system does not achieve. You might have achieved this level of performance or a particular execution plan in the past, but now, with a similar environment and operation, the system is not meeting this goal. If performance is not as you expected, can you quantify the deviation? For data warehousing operations, the execution plan is key. For critical data warehousing operations, save the EXPLAIN PLAN results. Then, as you analyze and reanalyze the data, upgrade Oracle, and load new data, over time you can compare new execution plans with old plans. Take this approach either proactively or reactively. Alternatively, you might nd that plan performance improves if you use hints. You might want to understand why hints are necessary and determine how to get the
optimizer to generate the desired plan without hints. Try increasing the statistical sample size: better statistics can give you a better plan. See Oracle Database Performance Tuning Guide for information on preserving plans throughout changes to your system, using plan stability and outlines.
Use an index. Sometimes adding an index can greatly improve performance. Consider adding an extra column to the index. Perhaps your operation could obtain all its data from the index, and not require a table scan. Perhaps you need to use hints in a few cases. Verify that the hint provides better results. Compute statistics. If you do not analyze often and you can spare the time, it is a good practice to compute statistics. This is particularly important if you are performing many joins, and it will result in better plans. Alternatively, you can estimate statistics. If you use different sample sizes, the plan may change. Generally, the higher the sample size, the better the plan. Use histograms for nonuniform distributions. Check initialization parameters to be sure the values are reasonable. Replace bind variables with literals unless CURSOR_SHARING is set to force or similar. Determine whether execution is I/O- or CPU-bound. Then check the optimizer cost model. Convert subqueries to joins.
24-63
Use the CREATE TABLE ... AS SELECT statement to break a complex operation into smaller pieces. With a large query referencing ve or six tables, it may be difcult to determine which part of the query is taking the most time. You can isolate bottlenecks in the query by breaking it into steps and analyzing each step.
Is there device contention? Is there controller contention? Is the system I/O-bound with too little parallelism? If so, consider increasing parallelism up to the number of devices.
Is the system CPU-bound with too much parallelism? Check the operating system CPU monitor to see whether a lot of time is being spent in system calls. The resource might be overcommitted, and too much parallelism might cause processes to compete with themselves. Are there more concurrent users than the system can support?
V$PX_BUFFER_ADVICE
The V$PX_BUFFER_ADVICE view provides statistics on historical and projected maximum buffer usage by all parallel queries. You can consult this view to recongure SGA size in response to insufcient memory problems for parallel queries.
V$PX_SESSION
The V$PX_SESSION view shows data about query server sessions, groups, sets, and server numbers. It also displays real-time data about the processes working on behalf of parallel execution. This table includes information about the requested DOP and the actual DOP granted to the operation.
V$PX_SESSTAT
The V$PX_SESSTAT view provides a join of the session information from V$PX_ SESSION and the V$SESSTAT table. Thus, all session statistics available to a normal session are available for all sessions performed using parallel execution.
V$PX_PROCESS
The V$PX_PROCESS view contains information about the parallel processes, including status, session ID, process ID, and other information.
24-65
V$PX_PROCESS_SYSSTAT
The V$PX_PROCESS_SYSSTAT view shows the status of query servers and provides buffer allocation statistics.
V$PQ_SESSTAT
The V$PQ_SESSTAT view shows the status of all current server groups in the system such as data about how queries allocate processes and how the multiuser and load balancing algorithms are affecting the default and hinted values. V$PQ_ SESSTAT will be obsolete in a future release. You might need to adjust some parameter settings to improve performance after reviewing data from these views. In this case, refer to the discussion of "Tuning General Parameters for Parallel Execution" on page 24-47. Query these views periodically to monitor the progress of long-running parallel operations. For many dynamic performance views, you must set the parameter TIMED_ STATISTICS to TRUE in order for Oracle to collect statistics for each view. You can use the ALTER SYSTEM or ALTER SESSION statements to turn TIMED_ STATISTICS on and off.
V$FILESTAT
The V$FILESTAT view sums read and write requests, the number of blocks, and service times for every datale in every tablespace. Use V$FILESTAT to diagnose I/O and workload distribution problems. You can join statistics from V$FILESTAT with statistics in the DBA_DATA_FILES view to group I/O by tablespace or to nd the lename for a given le number. Using a ratio analysis, you can determine the percentage of the total tablespace activity used by each le in the tablespace. If you make a practice of putting just one large, heavily accessed object in a tablespace, you can use this technique to identify objects that have a poor physical layout. You can further diagnose disk space allocation problems using the DBA_EXTENTS view. Ensure that space is allocated evenly from all les in the tablespace. Monitoring V$FILESTAT during a long-running operation and then correlating I/O activity to the EXPLAIN PLAN output is a good way to follow progress.
V$PARAMETER
The V$PARAMETER view lists the name, current value, and default value of all system parameters. In addition, the view shows whether a parameter is a session
parameter that you can modify online with an ALTER SYSTEM or ALTER SESSION statement.
V$PQ_TQSTAT
As a simple example, consider a hash join between two tables, with a join on a column with only two distinct values. At best, this hash function will have one hash value to parallel execution server A and the other to parallel execution server B. A DOP of two is ne, but, if it is four, then at least two parallel execution servers have no work. To discover this type of skew, use a query similar to the following example:
SELECT dfo_number, tq_id, server_type, process, num_rows FROM V$PQ_TQSTAT ORDER BY dfo_number DESC, tq_id, server_type, process;
The best way to resolve this problem might be to choose a different join method; a nested loop join might be the best option. Alternatively, if one of the join tables is small relative to the other, a BROADCAST distribution method can be hinted using PQ_DISTRIBUTE hint. Note that the optimizer considers the BROADCAST distribution method, but requires OPTIMIZER_FEATURES_ENABLE set to 9.0.2 or higher. Now, assume that you have a join key with high cardinality, but one of the values contains most of the data, for example, lava lamp sales by year. The only year that had big sales was 1968, and thus, the parallel execution server for the 1968 records will be overwhelmed. You should use the same corrective actions as described previously. The V$PQ_TQSTAT view provides a detailed report of message trafc at the table queue level. V$PQ_TQSTAT data is valid only when queried from a session that is executing parallel SQL statements. A table queue is the pipeline between query server groups, between the parallel coordinator and a query server group, or between a query server group and the coordinator. The table queues are represented explicitly in the operation column by PX SEND <partitioning type> (for example, PX SEND HASH) and PX RECEIVE. For backward compatibility, the row labels of PARALLEL_TO_PARALLEL, SERIAL_TO_PARALLEL, or PARALLEL_TO_ SERIAL will continue to have the same semantics as previous releases and can be used as before to infer the table queue allocation. In addition, the top of the parallel plan is marked by a new node with operation PX COORDINATOR. V$PQ_TQSTAT has a row for each query server process that reads from or writes to in each table queue. A table queue connecting 10 consumer processes to 10 producer processes has 20 rows in the view. Sum the bytes column and group by TQ_ID, the table queue identier, to obtain the total number of bytes sent through each table
24-67
queue. Compare this with the optimizer estimates; large variations might indicate a need to analyze the data using a larger sample. Compute the variance of bytes grouped by TQ_ID. Large variances indicate workload imbalances. You should investigate large variances to determine whether the producers start out with unequal distributions of data, or whether the distribution itself is skewed. If the data itself is skewed, this might indicate a low cardinality, or low number of distinct values. Note that the V$PQ_TQSTAT view will be renamed in a future release to V$PX_ TQSTSAT.
For a single instance, use SELECT FROM V$PX_SESSION and do not include the column name Instance ID. The processes shown in the output from the previous example using GV$PX_SESSION collaborate to complete the same task. The next example shows the execution of a join query to determine the progress of these processes in terms of physical reads. Use this query to track any specic statistic:
SELECT QCSID, SID, INST_ID "Inst", SERVER_GROUP "Group", SERVER_SET "Set", NAME "Stat Name", VALUE FROM GV$PX_SESSTAT A, V$STATNAME B WHERE A.STATISTIC# = B.STATISTIC# AND NAME LIKE 'PHYSICAL READS' AND VALUE > 0 ORDER BY QCSID, QCINST_ID, SERVER_GROUP, SERVER_SET;
Use the previous type of query to track statistics in V$STATNAME. Repeat this query as often as required to observe the progress of the query server processes. The next query uses V$PX_PROCESS to check the status of the query servers.
SELECT * FROM V$PX_PROCESS;
24-69
The following query shows the current wait state of each slave and QC process on the system:
SELECT px.SID "SID", p.PID, p.SPID "SPID", px.INST_ID "Inst", px.SERVER_GROUP "Group", px.SERVER_SET "Set", px.DEGREE "Degree", px.REQ_DEGREE "Req Degree", w.event "Wait Event" FROM GV$SESSION s, GV$PX_SESSION px, GV$PROCESS p, GV$SESSION_WAIT w WHERE s.sid (+) = px.sid AND s.inst_id (+) = px.inst_id AND
s.sid = w.sid (+) AND s.inst_id = w.inst_id (+) AND s.paddr = p.addr (+) AND s.inst_id = p.inst_id (+) ORDER BY DECODE(px.QCINST_ID, NULL, px.INST_ID, px.QCINST_ID), px.QCSID, DECODE(px.SERVER_GROUP, NULL, 0, px.SERVER_GROUP), px.SERVER_SET, px.INST_ID;
24-71
Oracle considers afnity when allocating work to parallel execution servers. The use of afnity for parallel execution of SQL statements is transparent to users.
For certain MPP architectures, Oracle uses device-to-node afnity information to determine on which nodes to spawn parallel execution servers (parallel process allocation) and which work granules (rowid ranges or partitions) to send to particular nodes (work assignment). Better performance is achieved by having nodes mainly access local devices, giving a better buffer cache hit ratio for every node and reducing the network overhead and I/O latency. For SMP, cluster, and MPP architectures, process-to-device afnity is used to achieve device isolation. This reduces the chances of having multiple parallel
execution servers accessing the same device simultaneously. This process-to-device afnity information is also used in implementing stealing between processes. For partitioned tables and indexes, partition-to-node afnity information determines process allocation and work assignment. For shared-nothing MPP systems, Oracle Real Application Clusters tries to assign partitions to instances, taking the disk afnity of the partitions into account. For shared-disk MPP and cluster systems, partitions are assigned to instances in a round-robin manner. Afnity is only available for parallel DML when running in an Oracle Real Application Clusters conguration. Afnity information which persists across statements improves buffer cache hit ratios and reduces block pings between instances.
Setting Buffer Cache Size for Parallel Operations Overriding the Default Degree of Parallelism Rewriting SQL Statements Creating and Populating Tables in Parallel Creating Temporary Tablespaces for Parallel Sort and Hash Join Executing Parallel SQL Statements Using EXPLAIN PLAN to Show Parallel Operations Plans Additional Considerations for Parallel DML Creating Indexes in Parallel Parallel DML Tips Incremental Data Loading in Parallel Using Hints with Query Optimization FIRST_ROWS(n) Hint Enabling Dynamic Sampling
24-73
Modify the default DOP by changing the value for the PARALLEL_THREADS_ PER_CPU parameter. Adjust the DOP either by using ALTER TABLE, ALTER SESSION, or by using hints. To increase the number of concurrent parallel operations, reduce the DOP, or set the parameter PARALLEL_ADAPTIVE_MULTI_USER to TRUE.
You can also use the utlxplp.sql script to present the EXPLAIN PLAN output with all relevant parallel information. You can increase the optimizer's ability to generate parallel plans converting subqueries, especially correlated subqueries, into joins. Oracle can parallelize joins more efciently than subqueries. This also applies to updates. See "Updating the Table in Parallel" on page 24-86 for more information.
These tables can also be incrementally loaded with parallel INSERT. You can take advantage of intermediate tables using the following techniques:
s
Common subqueries can be computed once and referenced many times. This can allow some queries against star schemas (in particular, queries without selective WHERE-clause predicates) to be better parallelized. Note that star queries with selective WHERE-clause predicates using the star-transformation technique can be effectively parallelized automatically without any modication to the SQL. Decompose complex queries into simpler steps in order to provide application-level checkpoint or restart. For example, a complex multitable join on a database 1 terabyte in size could run for dozens of hours. A failure during this query would mean starting over from the beginning. Using CREATE TABLE ... AS SELECT or PARALLEL INSERT AS SELECT, you can rewrite the query as a
24-75
sequence of simpler queries that run for a few hours each. If a system failure occurs, the query can be restarted from the last completed step.
s
Implement manual parallel deletes efciently by creating a new table that omits the unwanted rows from the original table, and then dropping the original table. Alternatively, you can use the convenient parallel delete feature, which directly deletes rows from the original table. Create summary tables for efcient multidimensional drill-down analysis. For example, a summary table might store the sum of revenue grouped by month, brand, region, and salesman. Reorganize tables, eliminating chained rows, compressing free space, and so on, by copying the old table to a new table. This is much faster than export/import and easier than reloading.
Be sure to use the DBMS_STATS package on newly created tables. Also consider creating indexes. To avoid I/O bottlenecks, specify a tablespace with at least as many devices as CPUs. To avoid fragmentation in allocating space, the number of les in a tablespace should be a multiple of the number of CPUs. See Chapter 4, "Hardware and I/O Considerations in Data Warehouses", for more information about bottlenecks.
You can associate temporary tablespaces to a database by issuing a statement such as:
ALTER DATABASE TEMPORARY TABLESPACE TStemp;
1MB to 10MB. Once you allocate an extent, it is available for the duration of an operation. If you allocate a large extent but only need to use a small amount of space, the unused space in the extent is unavailable. At the same time, temporary extents should be large enough that processes do not have to wait for space. Temporary tablespaces use less overhead than permanent tablespaces when allocating and freeing a new extent. However, obtaining a new temporary extent still requires the overhead of acquiring a latch and searching through the SGA structures, as well as SGA space consumption for the sort extent pool. See Oracle Database Performance Tuning Guide for information regarding locally-managed temporary tablespaces.
Verify optimizer selectivity estimates. If the optimizer thinks that only one row will be produced from a query, it tends to favor using a nested loop. This could
24-77
be an indication that the tables are not analyzed or that the optimizer has made an incorrect estimate about the correlation of multiple predicates on the same table. A hint may be required to force the optimizer to use another join method. Consequently, if the plan says only one row is produced from any particular stage and this is incorrect, consider hints or gather statistics.
s
Use hash join on low cardinality join keys. If a join key has few distinct values, then a hash join may not be optimal. If the number of distinct values is less than the DOP, then some parallel query servers may be unable to work on the particular query. Consider data skew. If a join key involves excessive data skew, a hash join may require some parallel query servers to work more than others. Consider using a hint to cause a BROADCAST distribution method if the optimizer did not choose it. Note that the optimizer will consider the BROADCAST distribution method only if the OPTIMIZER_FEATURES_ENABLE is set to 9.0.2 or higher. See "V$PQ_TQSTAT" on page 24-67 for further details.
PDML and Direct-Path Restrictions Limitation on the Degree of Parallelism Using Local and Global Striping Increasing INITRANS Limitation on Available Number of Transaction Free Lists for Segments Using Multiple Archivers Database Writer Process (DBWn) Workload [NO]LOGGING Clause
Increasing INITRANS
If you have global indexes, a global index segment and global index blocks are shared by server processes of the same parallel DML statement. Even if the operations are not performed against the same row, the server processes can share the same index blocks. Each server transaction needs one transaction entry in the index block header before it can make changes to a block. Therefore, in the CREATE INDEX or ALTER INDEX statements, you should set INITRANS, the initial number of transactions allocated within each data block, to a large value, such as the maximum DOP against this index.
24-79
limitation the next time you re-create the segment header by decreasing the number of process free lists; this leaves more room for transaction free lists in the segment header. For UPDATE and DELETE operations, each server process can require its own transaction free list. The parallel DML DOP is thus effectively limited by the smallest number of transaction free lists available on the table and on any of the global indexes the DML statement must maintain. For example, if the table has 25 transaction free lists and the table has two global indexes, one with 50 transaction free lists and one with 30 transaction free lists, the DOP is limited to 25. If the table had had 40 transaction free lists, the DOP would have been limited to 30. The FREELISTS parameter of the STORAGE clause is used to set the number of process free lists. By default, no process free lists are created. The default number of transaction free lists depends on the block size. For example, if the number of process free lists is not set explicitly, a 4 KB block has about 80 transaction free lists by default. The minimum number of transaction free lists is 25.
In this case, you should consider increasing the DBWn processes. If there are no waits for free buffers, the query will not return any rows.
[NO]LOGGING Clause
The [NO]LOGGING clause applies to tables, partitions, tablespaces, and indexes. Virtually no log is generated for certain operations (such as direct-path INSERT) if the NOLOGGING clause is used. The NOLOGGING attribute is not specied at the INSERT statement level but is instead specied when using the ALTER or CREATE statement for a table, partition, index, or tablespace.
When a table or index has NOLOGGING set, neither parallel nor serial direct-path INSERT operations generate redo logs. Processes running with the NOLOGGING option set run faster because no redo is generated. However, after a NOLOGGING operation against a table, partition, or index, if a media failure occurs before a backup is taken, then all tables, partitions, and indexes that have been modied might be corrupted. Direct-path INSERT operations (except for dictionary updates) never generate redo logs. The NOLOGGING attribute does not affect undo, only redo. To be precise, NOLOGGING allows the direct-path INSERT operation to generate a negligible amount of redo (range-invalidation redo, as opposed to full image redo). For backward compatibility, [UN]RECOVERABLE is still supported as an alternate keyword with the CREATE TABLE statement. This alternate keyword might not be supported, however, in future releases. At the tablespace level, the logging clause species the default logging attribute for all tables, indexes, and partitions created in the tablespace. When an existing tablespace logging attribute is changed by the ALTER TABLESPACE statement, then all tables, indexes, and partitions created after the ALTER statement will have the new logging attribute; existing ones will not change their logging attributes. The tablespace-level logging attribute can be overridden by the specications at the table, index, or partition level. The default logging attribute is LOGGING. However, if you have put the database in NOARCHIVELOG mode, by issuing ALTER DATABASE NOARCHIVELOG, then all operations that can be done without logging will not generate logs, regardless of the specied logging attribute.
24-81
Parallel local index creation uses a single server set. Each server process in the set is assigned a table partition to scan and for which to build an index partition. Because half as many server processes are used for a given DOP, parallel local index creation can be run with a higher DOP. However, the DOP is restricted to be less than or equal to the number of index partitions you wish to create. To avoid this, you can use the DBMS_PCLXUTIL package. You can optionally specify that no redo and undo logging should occur during index creation. This can signicantly improve performance but temporarily renders the index unrecoverable. Recoverability is restored after the new index is backed up. If your application can tolerate a window where recovery of the index requires it to be re-created, then you should consider using the NOLOGGING clause. The PARALLEL clause in the CREATE INDEX statement is the only way in which you can specify the DOP for creating the index. If the DOP is not specied in the parallel clause of CREATE INDEX, then the number of CPUs is used as the DOP. If there is no PARALLEL clause, index creation is done serially. When creating an index in parallel, the STORAGE clause refers to the storage of each of the subindexes created by the query server processes. Therefore, an index created with an INITIAL of 5 MB and a DOP of 12 consumes at least 60 MB of storage during index creation because each process starts with an extent of 5 MB. When the query coordinator process combines the sorted subindexes, some of the extents might be trimmed, and the resulting index might be smaller than the requested 60 MB. When you add or enable a UNIQUE or PRIMARY KEY constraint on a table, you cannot automatically create the required index in parallel. Instead, manually create an index on the desired columns, using the CREATE INDEX statement and an appropriate PARALLEL clause, and then add or enable the constraint. Oracle then uses the existing index when enabling or adding the constraint. Multiple constraints on the same table can be enabled concurrently and in parallel if all the constraints are already in the ENABLE NOVALIDATE state. In the following example, the ALTER TABLE ... ENABLE CONSTRAINT statement performs the table scan that checks the constraint in parallel:
CREATE TABLE a (a1 NUMBER CONSTRAINT ach CHECK (a1 > 0) ENABLE NOVALIDATE) PARALLEL; INSERT INTO a values (1); COMMIT; ALTER TABLE a ENABLE CONSTRAINT ach;
Parallel DML Tip 1: INSERT Parallel DML Tip 2: Direct-Path INSERT Parallel DML Tip 3: Parallelizing INSERT, MERGE, UPDATE, and DELETE
ALTER SESSION ENABLE PARALLEL DML APPEND hint Table PARALLEL attribute or PARALLEL hint APPEND hint (optional)
If parallel DML is enabled and there is a PARALLEL hint or PARALLEL attribute set for the table in the data dictionary, then inserts are parallel and appended, unless a restriction applies. If either the PARALLEL hint or PARALLEL attribute is missing, the insert is performed serially.
24-83
used when recovery is needed for the table or partition. If recovery is needed, be sure to take a backup immediately after the operation. Use the ALTER TABLE [NO]LOGGING statement to set the appropriate value.
Add the new employees who were hired after the acquisition of ACME.
INSERT /*+ PARALLEL(EMP) */ INTO employees SELECT /*+ PARALLEL(ACME_EMP) */ * FROM ACME_EMP;
The APPEND keyword is not required in this example because it is implied by the PARALLEL hint.
Parallelizing UPDATE and DELETE The PARALLEL hint (placed immediately after the UPDATE or DELETE keyword) applies not only to the underlying scan operation, but also to the UPDATE or DELETE operation. Alternatively, you can specify UPDATE or DELETE parallelism in the PARALLEL clause specied in the denition of the table to be modied. If you have explicitly enabled parallel DML for the session or transaction, UPDATE or DELETE statements that have their query operation parallelized can also have their UPDATE or DELETE operation parallelized. Any subqueries or updatable views in the statement can have their own separate PARALLEL hints or clauses, but these parallel directives do not affect the decision to parallelize the update or delete. If these operations cannot be performed in parallel, it has no effect on whether the UPDATE or DELETE portion can be performed in parallel. Tables must be partitioned in order to support parallel UPDATE and DELETE.
Example 248 Parallelizing UPDATE and DELETE
The PARALLEL hint is applied to the UPDATE operation as well as to the scan.
Example 249 Parallelizing UPDATE and DELETE
Remove all products in the grocery category because the grocery business line was recently spun off into a separate company.
DELETE /*+ PARALLEL(PRODUCTS) */ FROM PRODUCTS WHERE PRODUCT_CATEGORY ='GROCERY';
Again, the parallelism is applied to the scan as well as UPDATE operation on table employees.
24-85
contains either new rows or rows that have been updated since the last refresh of the data warehouse. In this example, the updated data is shipped from the production system to the data warehouse system by means of ASCII les. These les must be loaded into a temporary table, named diff_customer, before starting the refresh process. You can use SQL*Loader with both the parallel and direct options to efciently perform this task. You can use the APPEND hint when loading in parallel as well. Once diff_customer is loaded, the refresh process can be started. It can be performed in two phases or by merging in parallel, as demonstrated in the following:
s
Updating the Table in Parallel Inserting the New Rows into the Table in Parallel Merging in Parallel
Unfortunately, the two subqueries in this statement affect performance. An alternative is to rewrite this query using updatable join views. To do this, you must rst add a primary key constraint to the diff_customer table to ensure that the modied columns map to a key-preserved table:
CREATE UNIQUE INDEX diff_pkey_ind ON diff_customer(c_key) PARALLEL NOLOGGING; ALTER TABLE diff_customer ADD PRIMARY KEY (c_key);
You can then update the customers table with the following SQL statement:
UPDATE /*+ PARALLEL(cust_joinview) */ (SELECT /*+ PARALLEL(customers) PARALLEL(diff_customer) */ CUSTOMER.c_name AS c_name CUSTOMER.c_addr AS c_addr, diff_customer.c_name AS c_newname, diff_customer.c_addr AS c_newaddr WHERE customers.c_key = diff_customer.c_key) cust_joinview SET c_name = c_newname, c_addr = c_newaddr;
The base scans feeding the join view cust_joinview are done in parallel. You can then parallelize the update to further improve performance, but only if the customers table is partitioned.
However, you can guarantee that the subquery is transformed into an anti-hash join by using the HASH_AJ hint. Doing so enables you to use parallel INSERT to execute the preceding statement efciently. Parallel INSERT is applicable even if the table is not partitioned.
Merging in Parallel
You can combine updates and inserts into one statement, commonly known as a merge. The following statement achieves the same result as all of the statements in "Updating the Table in Parallel" on page 24-86 and "Inserting the New Rows into the Table in Parallel" on page 24-87:
MERGE INTO customers USING diff_customer ON (diff_customer.c_key = customer.c_key) WHEN MATCHED THEN UPDATE SET (c_name, c_addr) = (SELECT c_name, c_addr FROM diff_customer WHERE diff_customer.c_key = customers.c_key) WHEN NOT MATCHED THEN INSERT VALUES (diff_customer.c_key,diff_customer.c_data);
24-87
advantage. In such cases, begin with the execution plan recommended by query optimization, and go on to test the effect of hints only after you have quantied your performance expectations. Remember that hints are powerful. If you use them and the underlying data changes, you might need to change the hints. Otherwise, the effectiveness of your execution plans might deteriorate.
FIRST_ROWS(n) Hint
The FIRST_ROWS(n) hint enables the optimizer to use a new optimization mode to optimize the query to return n rows in the shortest amount of time. Oracle Corporation recommends that you use this new hint in place of the old FIRST_ ROWS hint for online queries because the new optimization mode may improve the response time compared to the old optimization mode. Use the FIRST_ROWS(n) hint in cases where you want the rst n number of rows in the shortest possible time. For example, to obtain the rst 10 rows in the shortest possible time, use the hint as follows:
SELECT /*+ FIRST_ROWS(10) */ article_id FROM articles_tab WHERE CONTAINS(article, 'Oracle')>0 ORDER BY pub_date DESC;
dynamic sampling, in terms of both the type (unanalyzed/analyzed) of tables sampled and the amount of I/O spent on sampling. Oracle also provides the table-specic hint DYNAMIC_SAMPLING. If the table name is omitted, the hint is considered cursor-level. The table-level hint forces dynamic sampling for the table. See Oracle Database Performance Tuning Guide for more information regarding dynamic sampling.
24-89
Glossary
additive Describes a fact (or measure) that can be summarized through addition. An additive fact is the most common type of fact. Examples include sales, cost, and prot. Contrast with nonadditive and semi-additive.
See Also: fact
advisor See: SQLAccess Advisor. aggregate Summarized data. For example, unit sales of a particular product could be aggregated by day, month, quarter and yearly sales. aggregation The process of consolidating data values into a single value. For example, sales data could be collected on a daily basis and then be aggregated to the week level, the week data could be aggregated to the month level, and so on. The data can then be referred to as aggregate data. Aggregation is synonymous with summarization, and aggregate data is synonymous with summary data. ancestor A value at any level higher than a given value in a hierarchy. For example, in a Time dimension, the value 1999 might be the ancestor of the values Q1-99 and Jan-99.
See Also: hierarchy and level
Glossary-1
attribute A descriptive characteristic of one or more levels. For example, the product dimension for a clothing manufacturer might contain a level called item, one of whose attributes is color. Attributes represent logical groupings that enable end users to select data based on like characteristics. Note that in relational modeling, an attribute is dened as a characteristic of an entity. In Oracle Database 10g, an attribute is a column in a dimension that characterizes elements of a single level. cardinality From an OLTP perspective, this refers to the number of rows in a table. From a data warehousing perspective, this typically refers to the number of distinct values in a column. For most data warehouse DBAs, a more important issue is the degree of cardinality.
See Also: degree of cardinality
change set A set of logically grouped change data that is transactionally consistent. It contains one or more change tables. change table A relational table that contains change data for a single source table. To Change Data Capture subscribers, a change table is known as a publication. child A value at the level under a given value in a hierarchy. For example, in a Time dimension, the value Jan-99 might be the child of the value Q1-99. A value can be a child for more than one parent if the child value belongs to multiple hierarchies.
See Also:
s
cleansing The process of resolving inconsistencies and xing the anomalies in source data, typically as part of the ETL process.
Glossary-2
Common Warehouse Metadata (CWM) A repository standard used by Oracle data warehousing, and decision support. The CWM repository schema is a standalone product that other products can shareeach product owns only the objects within the CWM repository that it creates. cross product A procedure for combining the elements in multiple sets. For example, given two columns, each element of the rst column is matched with every element of the second column. A simple example is illustrated as follows:
Col1 ---a b Col2 ---c d Cross Product ------------ac ad bc bd
Cross products are performed when grouping sets are concatenated, as described in Chapter 20, "SQL for Aggregation in Data Warehouses". data mart A data warehouse that is designed for a particular line of business, such as sales, marketing, or nance. In a dependent data mart, the data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.
See Also: data warehouse
data source A database, application, repository, or le that contributes data to a warehouse. data warehouse A relational database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.
Glossary-3
In addition to a relational database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
See Also: ETL and online analytical processing (OLAP)
degree of cardinality The number of unique values of a column divided by the total number of rows in the table. This is particularly important when deciding which indexes to build. You typically want to use bitmap indexes on low degree of cardinality columns and B-tree indexes on high degree of cardinality columns. As a general rule, a cardinality of under 1% makes a good candidate for a bitmap index. denormalize The process of allowing redundancy in a table. Contrast with normalize. derived fact (or measure) A fact (or measure) that is generated from existing data using a mathematical operation or a data transformation. Examples include averages, totals, percentages, and differences. detail See: fact table. detail table See: fact table. dimension The term dimension is commonly used in two ways:
s
A general term for any characteristic that is used to specify the members of a data set. The 3 most common dimensions in sales-oriented data warehouses are time, geography, and product. Most dimensions have hierarchies. An object dened in a database to enable queries to navigate dimensions. In Oracle Database 10g, a dimension is a database object that denes hierarchical (parent/child) relationships between pairs of column sets. In Oracle Express, a dimension is a database object that consists of a list of values.
Glossary-4
dimension table Dimension tables describe the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables. dimension value One element in the list that makes up a dimension. For example, a computer company might have dimension values in the product dimension called LAPPC and DESKPC. Values in the geography dimension might include Boston and Paris. Values in the time dimension might include MAY96 and JAN97. drill To navigate from one item to a set of related items. Drilling typically involves navigating up and down through the levels in a hierarchy. When selecting data, you can expand or collapse a hierarchy by drilling down or up in it, respectively.
See Also: drill down and drill up
drill down To expand the view to include child values that are associated with parent values in the hierarchy.
See Also: drill and drill up
drill up To collapse the list of descendant values that are associated with a parent value in the hierarchy. element An object or process. For example, a dimension is an object, a mapping is a process, and both are elements. entity Entity is used in database modeling. In relational databases, it typically maps to a table.
Glossary-5
ETL Extraction, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into a data warehouse. The order in which these processes are performed varies. Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL.
See Also:
s
extraction The process of taking data out of a source as part of an initial phase of ETL.
See Also: ETL
fact Data, usually numeric and additive, that can be examined and analyzed. Examples include sales, cost, and prot. Fact and measure are synonymous; fact is more commonly used with relational environments, measure is more commonly used with multidimensional environments.
See Also: derived fact (or measure)
fact table A table in a star schema that contains facts. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). A fact table usually contains facts with the same level of aggregation.
Glossary-6
fast refresh An operation that applies only the data changes to a materialized view, thus eliminating the need to rebuild the materialized view from scratch. le-to-table mapping Maps data from at les to tables in the warehouse. hierarchy A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to dene data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level to the Year level. Hierarchies can be dened in Oracle as part of the dimension object. A hierarchy can also be used to dene a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals.
See Also: dimension and level
high boundary The newest row in a subscription window. level A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.
See Also: hierarchy
level value table A database table that stores the values or data for the levels you created as part of your dimensions and hierarchies. low boundary The oldest row in a subscription window. mapping The denition of the relationship and data ow between source and target objects. materialized view A pre-computed table comprising aggregated or joined data from fact and possibly dimension tables. Also known as a summary or aggregate table.
Glossary-7
measure See: fact. metadata Data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata. Examples include: for data, the denition of a source to target transformation that is used to generate and populate the data warehouse; for information, denitions of tables, columns and associations that are stored inside a relational modeling tool; for business rules, discount by 10 percent after selling 1,000 items. model An object that represents something to be made. A representative style, plan, or design. Metadata that denes the structure of the data warehouse. nonadditive Describes a fact (or measure) that cannot be summarized through addition. An example includes Average. Contrast with additive and semi-additive. normalize In a relational database, the process of removing redundancy in data by separating the data into multiple tables. Contrast with denormalize. The process of removing redundancy in data by separating the data into multiple tables. OLAP See: online analytical processing (OLAP). online analytical processing (OLAP) OLAP functionality is characterized by dynamic, multidimensional analysis of historical data, which supports activities such as the following:
s
Calculating across dimensions and through hierarchies Analyzing trends Drilling up and down through hierarchies Rotating to change the dimensional orientation
Glossary-8
OLAP tools can run against a multidimensional database or interact directly with a relational database. OLTP See: online transaction processing (OLTP). online transaction processing (OLTP) Online transaction processing. OLTP systems are optimized for fast and reliable transaction handling. Compared to data warehouse systems, most OLTP interactions will involve a relatively small number of rows, but a larger group of tables. parallelism Breaking down a task so that several processes do part of the work. When multiple CPUs each do their portion simultaneously, very large performance gains are possible. parallel execution Breaking down a task so that several processes do part of the work. When multiple CPUs each do their portion simultaneously, very large performance gains are possible. parent A value at the level above a given value in a hierarchy. For example, in a Time dimension, the value Q1-99 might be the parent of the value Jan-99.
See Also:
s
partition Very large tables and indexes can be difcult and time-consuming to work with. To improve manageability, you can break your tables and indexes into smaller pieces called partitions.
Glossary-9
pivoting A transformation where each record in an input stream is converted to many records in the appropriate table in the data warehouse. This is particularly important when taking data from nonrelational databases. publication A relational table that contains change data for a single source table. Change Data Capture publishers refer to a publication as a change table. publication ID A publication ID is a unique numeric value that Change Data Capture assigns to each change table dened by a publisher. publisher Usually a database administrator who is in charge of creating and maintaining schema objects that make up the Change Data Capture system. refresh The mechanism whereby materialized views are changed to reect new data. schema A collection of related database objects. Relational schemas are grouped by database user ID and include tables, views, and other objects. The sample schemas sh are used throughout this Guide.
See Also: snowake schema and star schema
semi-additive Describes a fact (or measure) that can be summarized through addition along some, but not all, dimensions. Examples include headcount and on hand stock. Contrast with additive and nonadditive. slice and dice This is an informal term referring to data retrieval and manipulation. We can picture a data warehouse as a cube of data, where each axis of the cube represents a dimension. To "slice" the data is to retrieve a piece (a slice) of the cube by specifying measures and values for some or all of the dimensions. When we retrieve a data slice, we may also move and reorder its columns and rows as if we had diced the slice into many small pieces. A system with good slicing and dicing makes it easy to navigate through large amounts of data.
Glossary-10
snowake schema A type of star schema in which the dimension tables are partly or fully normalized.
See Also: schema and star schema
source A database, application, le, or other storage facility from which the data in a data warehouse is derived. source system A database, application, le, or other storage facility from which the data in a data warehouse is derived. source tables The tables in a source database. SQLAccess Advisor The SQLAccess Advisor helps you achieve your performance goals by recommending the proper set of materialized views, materialized view logs, and indexes for a given workload. It is a GUI in Oracle Enterprise Manager, and has similar capabilities to the DBMS_ADVISOR package. staging area A place where data is processed before entering the warehouse. staging le A le used when data is processed before entering the warehouse. star query A join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. star schema A relational schema whose design represents a multidimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys.
See Also: schema and snowake schema
Glossary-11
subject area A classication system that represents or distinguishes parts of an organization or areas of knowledge. A data mart is often developed to support a subject area such as sales, marketing, or geography.
See Also: data mart
subscribers Consumers of the published change data. These are normally applications. subscription A mechanism for Change Data Capture subscribers that controls access to the change data from one or more source tables of interest within a single change set. A subscription contains one or more subscriber views. subscription window A mechanism that denes the range of rows in a Change Data Capture publication that the subscriber can currently see in subscriber views. summary See: materialized view. Summary Advisor Replaced by the SQLAccess Advisor. See: SQLAccess Advisor. target Holds the intermediate or nal results of any part of the ETL process. The target of the entire ETL process is the data warehouse.
See Also: data warehouse and ETL
third normal form (3NF) A classical relational database modeling technique that minimizes data redundancy through normalization. third normal form schema A schema that uses the same kind of normalization as typically found in an OLTP system. Third normal form schemas are sometimes chosen for large data
Glossary-12
warehouses, especially environments with signicant data loading requirements that are used to feed data marts and execute long-running queries.
See Also: snowake schema and star schema
transformation The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources. transportation The process of moving copied or transformed data from a source to a data warehouse.
See Also: transformation
unique identier An identier whose purpose is to differentiate between the same item when it appears in more than one place. update window The length of time available for updating a warehouse. For example, you might have 8 hours at night to update your warehouse. update frequency How often a data warehouse is updated with new information. For example, a warehouse might be updated nightly from an OLTP system. validation The process of verifying metadata denitions and conguration parameters. versioning The ability to create new versions of a data warehouse project for new requirements and changes.
Glossary-13
Glossary-14
Index
A
adaptive multiuser algorithm for, 24-46 definition, 24-45 affinity parallel DML, 24-72 partitions, 24-71 aggregates, 8-12, 18-73 computability check, 18-34 ALL_PUBLISHED_COLUMNS view, 16-70 ALTER, 18-5 ALTER INDEX statement partition attributes, 5-42 ALTER MATERIALIZED VIEW statement, 8-21 ALTER SESSION statement ENABLE PARALLEL DML clause, 24-22 FORCE PARALLEL DDL clause, 24-41, 24-44 create or rebuild index, 24-42, 24-44 create table as select, 24-43, 24-44 move or split partition, 24-42, 24-45 FORCE PARALLEL DML clause insert, 24-40, 24-44 update and delete, 24-38, 24-39, 24-44 ALTER TABLE statement NOLOGGING clause, 24-84 altering dimensions, 10-14 amortization calculating, 22-51 analytic functions concepts, 21-3 analyzing data for parallel processing, 24-65 APPEND hint, 24-83 applications data warehouses star queries, 19-3 decision support, 24-2 decision support systems (DSS), 6-2 parallel SQL, 24-16 direct-path INSERT, 24-22 parallel DML, 24-21 ARCH processes multiple, 24-80 architecture data warehouse, 1-5 MPP, 24-72 SMP, 24-72 asychronous AutoLog publishing requirements for, 16-19 asynchronous AutoLog publishing latency for, 16-20 location of staging database, 16-20 setting database initialization parameters for, 16-22 asynchronous Autolog publishing source database performace impact, 16-20 Asynchronous Change Data Capture columns of built-in Oracle datatypes supported by, 16-51 asynchronous Change Data Capture archived redo log files and, 16-48 ARCHIVELOGMODE and, 16-48 supplemental logging, 16-50 supplemental logging and, 16-11 asynchronous change sets disabling, 16-53 enabling, 16-53
Index-1
exporting, 16-68 importing, 16-68 managing, 16-52 recovering from capture errors, 16-55 example of, 16-56, 16-57 removing DDL, 16-58 specifying ending values for, 16-52 specifying starting values for, 16-52 stopping capture on DDL, 16-54 excluded statements, 16-55 asynchronous change tables exporting, 16-68 importing, 16-68 asynchronous HotLog publishing latency for, 16-20 location of staging database, 16-20 requirements for, 16-19 setting database initialization parameters for, 16-21, 16-22 asynchronous Hotlog publishing source database performace impact, 16-20 asynchronous I/O, 24-60 attributes, 2-3, 10-6 AutoLog change sets, 16-15 Automatic Storage Management, 4-4
B
bandwidth, 5-2, 24-2 bind variables with query rewrite, 18-61 bitmap indexes, 6-2 nulls and, 6-5 on partitioned tables, 6-6 parallel query and DML, 6-3 bitmap join indexes, 6-6 block range granules, 5-3 B-tree indexes, 6-10 bitmap indexes versus, 6-3 build methods, 8-23
C
capture errors recovering from, 16-55
cardinality degree of, 6-3 CASE expressions, 21-43 cell referencing, 22-15 Change Data Capture, 12-5 asynchronous Streams apply process and, 16-25 Streams capture process and, 16-25 benefits for subscribers, 16-9 choosing a mode, 16-20 effects of stopping on DDL, 16-54 latency, 16-20 location of staging database, 16-20 modes of data capture asynchronous AutoLog, 16-13 asynchronous HotLog, 16-12 synchronous, 16-10 Oracle Data Pump and, 16-67 removing from database, 16-71 restriction on direct-path INSERT statement, 16-72 setting up, 16-18 source database performance impact, 16-20 static data dictionary views, 16-16 supported export utility, 16-67 supported import utility, 16-67 systemwide triggers installed by, 16-71 Change Data Capture publisher default tablespace for, 16-19 change sets AutoLog, 16-15 AutoLog change sources and, 16-15 defined, 16-14 effects of disabling, 16-54 HotLog, 16-15 HOTLOG_SOURCE change sources and, 16-15 managing asynchronous, 16-52 synchronous, 16-15 synchronous Change Data Capture and, 16-15 valid combinations with change sources, 16-15 change sources asynchronous AutoLog Change Data Capture and, 16-13 database instance represented, 16-15 defined, 16-6
Index-2
HOTLOG_SOURCE, 16-12 SYNC_SOURCE, 16-10 valid combinations with change sets, 16-15 change tables adding a column to, 16-70 control columns, 16-60 defined, 16-6 dropping, 16-67 dropping with active subscribers, 16-67 effect of SQL DROP USER CASCADE statement on, 16-67 exporting, 16-67 granting subscribers access to, 16-64 importing, 16-67 importing for Change Data Capture, 16-69 managing, 16-58 purging all in a named change set, 16-66 purging all on staging database, 16-66 purging by name, 16-66 purging of unneeded data, 16-65 source tables referenced by, 16-59 tablespaces created in, 16-59 change-value selection, 16-3 columns cardinality, 6-3 COMMIT_TIMESTAMP$ control column, 16-61 common joins, 18-29 COMPLETE clause, 8-26 complete refresh, 15-15 complex queries snowflake schemas, 19-5 composite columns, 20-20 partitioning, 5-8 partitioning methods, 5-8 performance considerations, 5-12, 5-14 compression See data segment compression, 8-22 concatenated groupings, 20-22 concatenated ROLLUP, 20-29 concurrent users increasing the number of, 24-49 configuration bandwidth, 4-2
CONNECT role, 16-19 constraints, 7-2, 10-12 foreign key, 7-5 parallel create table, 24-42 RELY, 7-6 states, 7-3 unique, 7-4 view, 7-7, 18-46 with partitioning, 7-7 with query rewrite, 18-72 control columns used to indicate changed columns in a row, 16-62 controls columns COMMIT_TIMESTAMP$, 16-61 CSCN$, 16-61 OPERATION$, 16-61 ROW_ID$, 16-62 RSID$, 16-61 SOURCE_COLMAP$, 16-61 interpreting, 16-62 SYS_NC_OID$, 16-62 TARGET_COLMAP$, 16-61 interpreting, 16-62 TIMESTAMP$, 16-61 USERNAME$, 16-62 XIDSEQ$, 16-62 XIDSLT$, 16-62 XIDUSN$, 16-62 cost-based rewrite, 18-3 CPU utilization, 5-2, 24-2 CREATE DIMENSION statement, 10-4 CREATE INDEX statement, 24-82 partition attributes, 5-42 rules of parallelism, 24-42 CREATE MATERIALIZED VIEW statement, 8-21 enabling query rewrite, 18-5 CREATE SESSION privilege, 16-19 CREATE TABLE AS SELECT rules of parallelism index-organized tables, 24-3 CREATE TABLE AS SELECT statement, 24-64, 24-75 rules of parallelism
Index-3
index-organized tables, 24-16 CREATE TABLE privilege, 16-19 CREATE TABLE statement AS SELECT decision support systems, 24-16 rules of parallelism, 24-42 space fragmentation, 24-18 temporary storage space, 24-18 parallelism, 24-16 index-organized tables, 24-3, 24-16 CREATE TABLESPACE privilege, 16-19 CSCN$ control column, 16-61 CUBE clause, 20-9 partial, 20-11 when to use, 20-9 cubes hierarchical, 9-10 CUME_DIST function, 21-12
D
data integrity of parallel DML restrictions, 24-26 partitioning, 5-4 purging, 15-12 sufficiency check, 18-33 transformation, 14-8 transportation, 13-2 data compression See data segment compression, 8-22 data cubes hierarchical, 20-24 data densification, 21-45 time series calculation, 21-53 with sparse data, 21-46 data dictionary asynchronous change data capture and, 16-38 data extraction with and without Change Data Capture, 16-5 data manipulation language parallel DML, 24-19 transaction model for parallel DML, 24-23 data marts, 1-6
data segment compression, 3-5 bitmap indexes, 5-17 materialized views, 8-22 partitioning, 3-5, 5-16 data transformation multistage, 14-2 pipelined, 14-3 data warehouse, 8-2 architectures, 1-5 dimension tables, 8-7 dimensions, 19-3 fact tables, 8-7 logical design, 2-2 partitioned tables, 5-10 physical design, 3-2 refresh tips, 15-20 refreshing table data, 24-21 star queries, 19-3 database scalability, 24-21 staging, 8-2 database initialization paramters adjusting when Streams values change, 16-25 determining current setting of, 16-25 retaining settings when database is restarted, 16-25 database writer process (DBWn) tuning, 24-80 DATE datatype partition pruning, 5-32 partitioning, 5-32 date folding with query rewrite, 18-49 DB_BLOCK_SIZE initialization parameter, 24-60 and parallel query, 24-60 DB_FILE_MULTIBLOCK_READ_COUNT initialization parameter, 24-60 DBA role, 16-19 DBA_DATA_FILES view, 24-66 DBA_EXTENTS view, 24-66 DBMS_ADVISOR package, 17-2 DBMS_CDC_PUBLISH package, 16-6 privileges required to use, 16-19 DBMS_CDC_PUBLISH.DROP_CHANGE_TABLE PL/SQL procedure, 16-67
Index-4
DBMS_CDC_PUBLISH.PURGE PL/SQL procedure, 16-65, 16-66 DBMS_CDC_PUBLISH.PURGE_CHANG_SET PL/SQL procedure, 16-66 DBMS_CDC_PUBLISH.PURGE_CHANGE_TABLE PL/SQL procedure, 16-66 DBMS_CDC_SUBSCRIBE package, 16-7 DBMS_CDC_SUBSCRIBE.PURGE_WINDOW PL/SQL procedure, 16-65 DBMS_JOB PL/SQL procedure, 16-65 DBMS_MVIEW package, 15-16, 15-17 EXPLAIN_MVIEW procedure, 8-37 EXPLAIN_REWRITE procedure, 18-66 REFRESH procedure, 15-14, 15-17 REFRESH_ALL_MVIEWS procedure, 15-14 REFRESH_DEPENDENT procedure, 15-14 DBMS_STATS package, 17-4, 18-3 decision support systems (DSS) bitmap indexes, 6-2 disk striping, 24-72 parallel DML, 24-21 parallel SQL, 24-16, 24-21 performance, 24-21 scoring tables, 24-22 default partition, 5-8 degree of cardinality, 6-3 degree of parallelism, 24-5, 24-32, 24-37, 24-39 and adaptive multiuser, 24-45 between query operations, 24-12 parallel SQL, 24-33 DELETE statement parallel DELETE statement, 24-38 DENSE_RANK function, 21-5 design logical, 3-2 physical, 3-2 dimension tables, 2-5, 8-7 normalized, 10-10 dimensional modeling, 2-3 dimensions, 2-6, 10-2, 10-12 altering, 10-14 analyzing, 20-2 creating, 10-4 definition, 10-2 dimension tables, 8-7
dropping, 10-14 hierarchies, 2-6 hierarchies overview, 2-6 multiple, 20-2 star joins, 19-4 star queries, 19-3 validating, 10-12 with query rewrite, 18-73 direct-path INSERT restrictions, 24-25 direct-path INSERT statement Change Data Capture restriction, 16-72 disk affinity parallel DML, 24-72 partitions, 24-71 disk redundancy, 4-3 disk striping affinity, 24-71 DISK_ASYNCH_IO initialization parameter, 24-60 distributed transactions parallel DDL restrictions, 24-4 parallel DML restrictions, 24-4, 24-27 DML access subscribers, 16-65 DML_LOCKS initialization parameter, 24-58 downstream capture, 16-35 drilling down, 10-2 hierarchies, 10-2 DROP MATERIALIZED VIEW statement, 8-21 prebuilt tables, 8-35 dropping dimensions, 10-14 materialized views, 8-37 dropping change tables, 16-67 DSS database partitioning indexes, 5-42
E
ENFORCED mode, 18-7 ENQUEUE_RESOURCES initialization parameter, 24-58 entity, 2-2 equipartitioning examples, 5-35
Index-5
local indexes, 5-34 errors ORA-31424, 16-67 ORA-31496, 16-67 ETL. See extraction, transformation, and loading (ETL), 11-2 EXCHANGE PARTITION statement, 7-7 EXECUTE_CATALOG_ROLE privilege, 16-19 EXECUTE_TASK procedure, 17-26 execution plans parallel operations, 24-63 star transformations, 19-9 EXPLAIN PLAN statement, 18-65, 24-63 partition pruning, 5-33 query parallelization, 24-77 star transformations, 19-9 EXPLAIN_MVIEW procedure, 17-47 exporting a change table, 16-67 asynchronous change sets, 16-68 asynchronous change tables, 16-68 EXP utility, 12-10 expression matching with query rewrite, 18-49 extents parallel DDL, 24-18 external tables, 14-5 extraction, transformation, and loading (ETL), 11-2 overview, 11-2 process, 7-2 extractions data files, 12-7 distributed operations, 12-11 full, 12-3 incremental, 12-3 OCI, 12-9 online, 12-4 overview, 12-2 physical, 12-4 Pro*C, 12-9 SQL*Plus, 12-8
F
fact tables, 2-5
star joins, 19-4 star queries, 19-3 facts, 10-2 FAST clause, 8-26 fast refresh, 15-15 restrictions, 8-27 with UNION ALL, 15-28 FAST_START_PARALLEL_ROLLBACK initialization parameter, 24-57 features, new, 1-xxxix files ultralarge, 3-4 FIRST_ROWS(n) hint, 24-88 FIRST_VALUE function, 21-21 FIRST/LAST functions, 21-26 FORCE clause, 8-26 foreign key constraints, 7-5 joins snowflake schemas, 19-5 fragmentation parallel DDL, 24-18 FREELISTS parameter, 24-80 full partition-wise joins, 5-20 full table scans parallel execution, 24-2 functions COUNT, 6-5 CUME_DIST, 21-12 DENSE_RANK, 21-5 FIRST_VALUE, 21-21 FIRST/LAST, 21-26 GROUP_ID, 20-16 GROUPING, 20-12 GROUPING_ID, 20-15 LAG/LEAD, 21-25 LAST_VALUE, 21-21 linear regression, 21-33 NTILE, 21-13 parallel execution, 24-28 PERCENT_RANK, 21-13 RANK, 21-5 ranking, 21-5 RATIO_TO_REPORT, 21-24 REGR_AVGX, 21-34
Index-6
REGR_AVGY, 21-34 REGR_COUNT, 21-34 REGR_INTERCEPT, 21-34 REGR_SLOPE, 21-34 REGR_SXX, 21-35 REGR_SXY, 21-35 REGR_SYY, 21-35 reporting, 21-22 ROW_NUMBER, 21-15 WIDTH_BUCKET, 21-39, 21-41 windowing, 21-15
G
global indexes, 24-79 global indexes partitioning, 5-37 managing partitions, 5-38 summary of index types, 5-39 granules, 5-3 block range, 5-3 partition, 5-4 GROUP_ID function, 20-16 grouping compatibility check, 18-34 conditions, 18-74 GROUPING function, 20-12 when to use, 20-15 GROUPING_ID function, 20-15 GROUPING_SETS expression, 20-17 groups, instance, 24-35 GT GlossaryTitle, Glossary-1 GV$FILESTAT view, 24-65
overview, 2-6 rolling up and drilling down, 10-2 high boundary defined, 16-8 hints FIRST_ROWS(n), 24-88 PARALLEL, 24-34 PARALLEL_INDEX, 24-34 query rewrite, 18-5, 18-8 histograms creating with user-defined buckets, 21-44 HotLog change sets, 16-15 HOTLOG_SOURCE change sources, 16-12 change sets and, 16-15 hypothetical rank, 21-32
I
importing a change table, 16-67, 16-69 asynchronous change sets, 16-68 asynchronous change tables, 16-68 data into a source table, 16-69 indexes bitmap indexes, 6-6 bitmap join, 6-6 B-tree, 6-10 cardinality, 6-3 creating in parallel, 24-81 global, 24-79 global partitioned indexes, 5-37 managing partitions, 5-38 local, 24-79 local indexes, 5-34 nulls and, 6-5 parallel creation, 24-81, 24-82 parallel DDL storage, 24-18 parallel local, 24-82 partitioned tables, 6-6 partitioning, 5-9 partitioning guidelines, 5-41 partitions, 5-33 index-organized tables parallel CREATE, 24-3, 24-16 parallel queries, 24-14
H
hash partitioning, 5-7 HASH_AREA_SIZE initialization parameter and parallel execution, 24-56 hierarchical cubes, 9-10, 20-29 in SQL, 20-29 hierarchies, 10-2 how used, 2-6 multiple, 10-9
Index-7
initcdc.sql script, 16-72 initialization parameters DB_BLOCK_SIZE, 24-60 DB_FILE_MULTIBLOCK_READ_ COUNT, 24-60 DISK_ASYNCH_IO, 24-60 DML_LOCKS, 24-58 ENQUEUE_RESOURCES, 24-58 FAST_START_PARALLEL_ROLLBACK, 24-57 HASH_AREA_SIZE, 24-56 JOB_QUEUE_PROCESSES, 15-20 LARGE_POOL_SIZE, 24-50 LOG_BUFFER, 24-57 NLS_LANGUAGE, 5-31 NLS_SORT, 5-31 OPTIMIZER_MODE, 15-21, 18-6 PARALLEL_ADAPTIVE_MULTI_USER, 24-46 PARALLEL_EXECUTION_MESSAGE_ SIZE, 24-56 PARALLEL_MAX_SERVERS, 15-21, 24-7, 24-48 PARALLEL_MIN_PERCENT, 24-35, 24-48, 24-55 PARALLEL_MIN_SERVERS, 24-6, 24-7, 24-49 PGA_AGGREGATE_TARGET, 15-21 QUERY_REWRITE_ENABLED, 18-5, 18-6 QUERY_REWRITE_INTEGRITY, 18-6 SHARED_POOL_SIZE, 24-50 STAR_TRANSFORMATION_ENABLED, 19-6 TAPE_ASYNCH_IO, 24-60 TIMED_STATISTICS, 24-66 TRANSACTIONS, 24-57 INSERT statement functionality, 24-83 parallelizing INSERT ... SELECT, 24-40 instance groups for parallel operations, 24-35 instances instance groups, 24-35 integrity constraints, 7-2 integrity rules parallel DML restrictions, 24-26 invalidating materialized views, 9-14 I/O asynchronous, 24-60 parallel execution, 5-2, 24-2
J
Java used by Change Data Capture, 16-71 JOB_QUEUE_PROCESSES initialization parameter, 15-20 join compatibility, 18-28 joins full partition-wise, 5-20 partial partition-wise, 5-26 partition-wise, 5-20 star joins, 19-4 star queries, 19-4
K
key lookups, 14-21 keys, 8-7, 19-4
L
LAG/LEAD functions, 21-25 LARGE_POOL_SIZE initialization parameter, 24-50 LAST_VALUE function, 21-21 level relationships, 2-6 purpose, 2-6 levels, 2-6 linear regression functions, 21-33 list partitioning, 5-7 LOB datatypes restrictions parallel DDL, 24-3, 24-16 parallel DML, 24-25, 24-26 local indexes, 5-34, 5-39, 6-3, 6-6, 24-79 equipartitioning, 5-34 locks parallel DML, 24-25 LOG_BUFFER initialization parameter and parallel execution, 24-57 LOGGING clause, 24-80 logging mode parallel DDL, 24-3, 24-16, 24-17 logical design, 3-2 logs materialized views, 8-31
Index-8
8-7
M
manual refresh, 15-17 manual refresh with DBMS_MVIEW package, 15-16 massively parallel processing (MPP) affinity, 24-71, 24-72 massively parallel systems, 5-2, 24-2 materialized view logs, 8-31 materialized views aggregates, 8-12 altering, 9-17 build methods, 8-23 checking status, 15-22 containing only joins, 8-15 creating, 8-20 data segment compression, 8-22 delta joins, 18-31 dropping, 8-35, 8-37 invalidating, 9-14 logs, 12-7 naming, 8-22 nested, 8-17 OLAP, 9-9 OLAP cubes, 9-9 Partition Change Tracking (PCT), 9-2 partitioned tables, 15-29 partitioning, 9-2 prebuilt, 8-20 query rewrite hints, 18-5, 18-8 matching join graphs, 8-24 parameters, 18-5 privileges, 18-9 refresh dependent, 15-19 refreshing, 8-26, 15-14 refreshing all, 15-18 registration, 8-34 restrictions, 8-24
rewrites enabling, 18-5 schema design, 8-8 schema design guidelines, 8-8 security, 9-14 set operators, 9-11 storage characteristics, 8-22 tuning, 17-47 types of, 8-12 uses for, 8-2 with VPD, 9-15 MAXVALUE partitioned tables and indexes, 5-31 measures, 8-7, 19-4 memory configure at 2 levels, 24-55 MERGE statement, 15-8 Change Data Capture restriction, 16-72 MINIMUM EXTENT parameter, 24-18 MODEL clause, 22-2 cell referencing, 22-15 data flow, 22-4 keywords, 22-14 parallel execution, 22-42 rules, 22-17 monitoring parallel processing, 24-65 refresh, 15-21 mortgage calculation, 22-51 MOVE PARTITION statement rules of parallelism, 24-42 multiple archiver processes, 24-80 multiple hierarchies, 10-9 MV_CAPABILITIES_TABLE table, 8-38
N
National Language Support (NLS) DATE datatype and partitions, 5-32 nested materialized views, 8-17 refreshing, 15-27 restrictions, 8-20 net present value calculating, 22-48 NEVER clause, 8-26
Index-9
new features, 1-xxxix NLS_LANG environment variable, 5-31 NLS_LANGUAGE parameter, 5-31 NLS_SORT parameter no effect on partitioning keys, 5-31 NOAPPEND hint, 24-83 NOARCHIVELOG mode, 24-81 nodes disk affinity in Real Application Clusters, NOLOGGING clause, 24-75, 24-80, 24-82 with APPEND hint, 24-83 NOLOGGING mode parallel DDL, 24-3, 24-16, 24-17 nonprefixed indexes, 5-36, 5-40 global partitioned indexes, 5-38 nonvolatile data, 1-3 NOPARALLEL attribute, 24-74 NOREWRITE hint, 18-5, 18-8 NTILE function, 21-13 nulls indexes and, 6-5 partitioned tables and indexes, 5-32
24-71
O
object types parallel query, 24-15 restrictions, 24-15 restrictions parallel DDL, 24-3, 24-16 parallel DML, 24-25, 24-26 OLAP, 23-2 materialized views, 9-9 OLAP cubes materialized views, 9-9 OLTP database batch jobs, 24-22 parallel DML, 24-21 partitioning indexes, 5-41 ON COMMIT clause, 8-25 ON DEMAND clause, 8-25 OPERATION$ control column, 16-61 optimization partition pruning
indexes, 5-40 partitioned indexes, 5-40 optimizations parallel SQL, 24-8 query rewrite enabling, 18-5 hints, 18-5, 18-8 matching join graphs, 8-24 query rewrites privileges, 18-9 optimizer with rewrite, 18-2 OPTIMIZER_MODE initialization parameter, 15-21, 18-6 ORA-31424 error, 16-67 ORA-31496 error, 16-67 Oracle Data Pump using with Change Data Capture, Oracle Real Application Clusters disk affinity, 24-71 instance groups, 24-35 ORDER BY clause, 8-31 outer joins with query rewrite, 18-73
16-67
P
packages DBMS_ADVISOR, 17-2 paragraph tags GT GlossaryTitle, Glossary-1 PARALLEL clause, 24-83, 24-84 parallelization rules, 24-37 PARALLEL CREATE INDEX statement, 24-57 PARALLEL CREATE TABLE AS SELECT statement resources required, 24-57 parallel DDL, 24-15 extent allocation, 24-18 parallelization rules, 24-37 partitioned tables and indexes, 24-16 restrictions LOBs, 24-3, 24-16 object types, 24-3, 24-15, 24-16 parallel delete, 24-38 parallel DELETE statement, 24-38
Index-10
parallel DML, 24-19 applications, 24-21 bitmap indexes, 6-3 degree of parallelism, 24-37, 24-39 enabling PARALLEL DML, 24-22 lock and enqueue resources, 24-25 parallelization rules, 24-37 recovery, 24-24 restrictions, 24-25 object types, 24-15, 24-25, 24-26 remote transactions, 24-27 transaction model, 24-23 parallel execution full table scans, 24-2 index creation, 24-81 interoperator parallelism, 24-12 intraoperator parallelism, 24-12 introduction, 5-2 I/O parameters, 24-60 plans, 24-63 query optimization, 24-87 resource parameters, 24-55 rewriting SQL, 24-74 solving problems, 24-74 tuning, 5-2, 24-2 PARALLEL hint, 24-34, 24-74, 24-83 parallelization rules, 24-37 UPDATE and DELETE, 24-38 parallel partition-wise joins performance considerations, 5-29 parallel query, 24-13 bitmap indexes, 6-3 index-organized tables, 24-14 object types, 24-15 restrictions, 24-15 parallelization rules, 24-37 parallel SQL allocating rows to parallel execution servers, 24-9 degree of parallelism, 24-33 instance groups, 24-35 number of parallel execution servers, optimizer, 24-8 parallelization rules, 24-37 shared server, 24-6
24-6
parallel update, 24-38 parallel UPDATE statement, 24-38 PARALLEL_ADAPTIVE_MULTI_USER initialization parameter, 24-46 PARALLEL_EXECUTION_MESSAGE_SIZE initialization parameter, 24-56 PARALLEL_INDEX hint, 24-34 PARALLEL_MAX_SERVERS initialization parameter, 15-21, 24-7, 24-48 and parallel execution, 24-48 PARALLEL_MIN_PERCENT initialization parameter, 24-35, 24-48, 24-55 PARALLEL_MIN_SERVERS initialization parameter, 24-6, 24-7, 24-49 PARALLEL_THREADS_PER_CPU initialization parameter, 24-47 parallelism, 5-2 degree, 24-5, 24-32 degree, overriding, 24-74 enabling for tables and queries, 24-45 interoperator, 24-12 intraoperator, 24-12 parameters FREELISTS, 24-80 partition default, 5-8 granules, 5-4 Partition Change Tracking (PCT), 9-2, 15-29, 18-52 with Pmarkers, 18-55 partitioned outer join, 21-45 partitioned tables data warehouses, 5-10 materialized views, 15-29 partitioning, 12-6 composite, 5-8 data, 5-4 data segment compression, 5-16 bitmap indexes, 5-17 hash, 5-7 indexes, 5-9 list, 5-7 materialized views, 9-2 prebuilt tables, 9-7 range, 5-5 range-list, 5-14
Index-11
partitions affinity, 24-71 bitmap indexes, 6-6 DATE datatype, 5-32 equipartitioning examples, 5-35 local indexes, 5-34 global indexes, 5-37 local indexes, 5-34 multicolumn keys, 5-33 nonprefixed indexes, 5-36, 5-40 parallel DDL, 24-16 partition bounds, 5-31 partition pruning DATE datatype, 5-32 disk striping and, 24-72 indexes, 5-40 partitioning indexes, 5-33, 5-41 partitioning keys, 5-30 physical attributes, 5-42 prefixed indexes, 5-35 pruning, 5-19 range partitioning disk striping and, 24-72 restrictions datatypes, 5-32 rules of parallelism, 24-42 partition-wise joins, 5-20 benefits of, 5-28 full, 5-20 partial, 5-26 PERCENT_RANK function, 21-13 performance DSS database, 24-21 prefixed and nonprefixed indexes, 5-40 PGA_AGGREGATE_TARGET initialization parameter, 15-21 physical design, 3-2 structures, 3-3 pivoting, 14-23 plans star transformations, 19-9 PL/SQL procedures DBMS_CDC_PUBLISH_DROP_CHANGE_ TABLE, 16-67
DBMS_CDC_PUBLISH.PURGE, 16-65, 16-66 DBMS_CDC_PUBLISH.PURGE_CHANGE_ SET, 16-66 DBMS_CDC_PUBLISH.PURGE_CHANGE_ TABLE, 16-66 DBMS_CDC_SUBSCRIBE.PURGE_ WINDOW, 16-65 DBMS_JOB, 16-65 Pmarkers with PCT, 18-55 prebuilt materialized views, 8-20 predicates partition pruning indexes, 5-40 prefixed indexes, 5-35, 5-39 PRIMARY KEY constraints, 24-82 privileges SQLAccess Advisor, 17-9 privileges required to publish change data, 16-19 procedures EXPLAIN_MVIEW, 17-47 TUNE_MVIEW, 17-47 process monitor process (PMON) parallel DML process recovery, 24-24 processes and memory contention in parallel processing, 24-48 pruning partitions, 5-19, 24-72 using DATE columns, 5-20 pruning partitions DATE datatype, 5-32 EXPLAIN PLAN, 5-33 indexes, 5-40 publication defined, 16-6 publishers components associated with, 16-7 defined, 16-5 determining the source tables, 16-6 privileges for reading views, 16-16 purpose, 16-6 table partitioning properties and, 16-59 tasks, 16-6
Index-12
publishing asynchronous AutoLog mode step-by-step example, 16-35 asynchronous HotLog mode step-by-step example, 16-30 synchronous mode step-by-step example, 16-27 publishing change data preparations for, 16-18 privileges required, 16-19 purging change tables automatically, 16-65 by name, 16-66 in a named changed set, 16-66 on the staging database, 16-66 publishers, 16-66 subscribers, 16-65 purging data, 15-12
VPD, 9-16 when it occurs, 18-4 with bind variables, 18-61 with DBMS_MVIEW package, 18-66 with expression matching, 18-49 with inline views, 18-44 with partially stale materialized views, 18-35 with selfjoins, 18-45 with set operator materialized views, 18-62 with view constraints, 18-46 QUERY_REWRITE_ENABLED initialization parameter, 18-5, 18-6 QUERY_REWRITE_INTEGRITY initialization parameter, 18-6
R
range partitioning, 5-5 key comparison, 5-31, 5-33 partition bounds, 5-31 performance considerations, 5-9 range-list partitioning, 5-14 RANK function, 21-5 ranking functions, 21-5 RATIO_TO_REPORT function, 21-24 REBUILD INDEX PARTITION statement rules of parallelism, 24-42 REBUILD INDEX statement rules of parallelism, 24-42 recovery from asychronous change set capture errors, 16-55 parallel DML, 24-24 redo buffer allocation retries, 24-57 redo log files archived asynchronous Change Data Capture and, 16-48 determining which are no longer needed by Change Data Capture, 16-48 reference tables See dimension tables, 8-7 refresh monitoring, 15-21 options, 8-25
Q
queries ad hoc, 24-16 enabling parallelism for, 24-45 star queries, 19-3 query delta joins, 18-31 query optimization, 24-87 parallel execution, 24-87 query rewrite advanced, 18-75 checks made by, 18-28 controlling, 18-6 correctness, 18-7 date folding, 18-49 enabling, 18-5 hints, 18-5, 18-8 matching join graphs, 8-24 methods, 18-11 parameters, 18-5 privileges, 18-9 restrictions, 8-25 using equivalences, 18-75 using GROUP BY extensions, 18-39 using nested materialezed views, 18-38 using PCT, 18-52
Index-13
scheduling, 15-22 with UNION ALL, 15-28 refreshing materialized views, 15-14 nested materialized views, 15-27 partitioning, 15-2 REGR_AVGX function, 21-34 REGR_AVGY function, 21-34 REGR_COUNT function, 21-34 REGR_INTERCEPT function, 21-34 REGR_R2 function, 21-35 REGR_SLOPE function, 21-34 REGR_SXX function, 21-35 REGR_SXY function, 21-35 REGR_SYY function, 21-35 regression detecting, 24-62 RELY constraints, 7-6 remote transactions parallel DML and DDL restrictio, 24-4 removing Change Data Capture from source database, 16-71 replication restrictions parallel DML, 24-25 reporting functions, 21-22 RESOURCE role, 16-19 resources consumption, parameters affecting, 24-55, 24-57 limiting for users, 24-49 limits, 24-48 parallel query usage, 24-55 restrictions direct-path INSERT, 24-25 fast refresh, 8-27 nested materialized views, 8-20 parallel DDL, 24-3, 24-16 parallel DML, 24-25 remote transactions, 24-27 partitions datatypes, 5-32 query rewrite, 8-25 result set, 19-6 REWRITE hint, 18-5, 18-8
rewrites hints, 18-8 parameters, 18-5 privileges, 18-9 query optimizations hints, 18-5, 18-8 matching join graphs, 8-24 rmcdc.sql script, 16-71 rolling up hierarchies, 10-2 ROLLUP, 20-6 concatenated, 20-29 partial, 20-8 when to use, 20-6 root level, 2-6 ROW_ID$ control column, 16-62 ROW_NUMBER function, 21-15 RSID$ control column, 16-61 rules in MODEL clause, 22-17 in SQL modeling, 22-17 order of evaluation, 22-21
S
sar UNIX command, 24-71 scalability batch jobs, 24-22 parallel DML, 24-21 scalable operations, 24-77 scans full table parallel query, 24-2 schema-level export operations, 16-68 schema-level import operations, 16-68 schemas, 19-2 design guidelines for materialized views, 8-8 snowflake, 2-3 star, 2-3 third normal form, 19-2 scripts initcdc.sql for Change Data Capture, 16-72 rmcdc.sql for Change Data Capture, 16-71 SELECT_CATALOG_ROLE privilege, 16-17, 16-19
Index-14
sessions enabling parallel DML, 24-22 set operators materialized views, 9-11 shared server parallel SQL execution, 24-6 SHARED_POOL_SIZE initialization parameter, 24-50 SHARED_POOL_SIZE parameter, 24-50 simultaneous equations, 22-49 single table aggregate requirements, 8-15 skewing parallel DML workload, 24-36 SMP architecture disk affinity, 24-72 snowflake schemas, 19-5 complex queries, 19-5 SORT_AREA_SIZE initialization parameter and parallel execution, 24-56 source database defined, 16-6 source systems, 12-2 source tables importing for Change Data Capture, 16-69 referenced by change tables, 16-59 SOURCE_COLMAP$ control column, 16-61 interpreting, 16-62 space management MINIMUM EXTENT parameter, 24-18 parallel DDL, 24-17 sparse data data densification, 21-46 SPLIT PARTITION clause rules of parallelism, 24-42 SQL GRANT statement, 16-64 SQL modeling, 22-2 cell referencing, 22-15 keywords, 22-14 order of evaluation, 22-21 performance, 22-42 rules, 22-17 rules and restrictions, 22-40 SQL REVOKE statement, 16-64 SQL statements parallelizing, 24-3, 24-8
SQL Workload Journal, 17-20 SQL*Loader, 24-4 SQLAccess Advisor, 17-2, 17-10 constants, 17-38 creating a task, 17-4 defining the workload, 17-4 EXECUTE_TASK procedure, 17-26 generating the recommendations, 17-6 implementing the recommendations, 17-6 maintaining workloads, 17-22 privileges, 17-9 quick tune, 17-36 recommendation process, 17-32 steps in using, 17-4 workload objects, 17-12 SQLAccess Advisor workloads maintaining, 17-22 staging areas, 1-6 databases, 8-2 files, 8-2 staging database defined, 16-6 STALE_TOLERATED mode, 18-7 star joins, 19-4 star queries, 19-3 star transformation, 19-6 star schemas advantages, 2-4 defining fact tables, 2-5 dimensional model, 2-4, 19-3 star transformations, 19-6 restrictions, 19-11 STAR_TRANSFORMATION_ENABLED initialization parameter, 19-6 statistics, 18-74 estimating, 24-63 operating system, 24-71 storage fragmentation in parallel DDL, 24-18 index partitions, 5-42 STORAGE clause parallel execution, 24-18 Streams apply parallelism value determining, 16-26
Index-15
Streams apply process asynchronous Change Data Capture and, 16-25 Streams capture parallelism value determining, 16-26 Streams capture process asynchronous Change Data Capture and, 16-25 striping, 4-3 subpartition mapping, 5-14 template, 5-13 subqueries in DDL statements, 24-16 subscriber view defined, 16-8 returning DML changes in order, 16-60 subscribers access to change tables, 16-64 ALL_PUBLISHED_COLUMNS view, 16-70 components associated with, 16-8 controlling access to tables, 16-64 defined, 16-5 DML access, 16-65 privileges, 16-8 purpose, 16-7 retrieve change data from the subscriber views, 16-8 tasks, 16-7 subscribing step-by-step example, 16-42 subscription windows defined, 16-8 subscriptions changes to change tables and, 16-70 defined, 16-7 effect of SQL DROP USER CASCADE statement on, 16-67 summary management components, 8-5 summary tables, 2-5 supplemental logging asynchronous Change Data Capture, 16-50 asynchronous Change Data Capture and, 16-11 symmetric multiprocessors, 5-2, 24-2 SYNC_SET predefined change set, 16-15 SYNC_SOURCE change source, 16-10
synchronous Change Data Capture change sets and, 16-15 synchronous change sets defined, 16-15 disabling, 16-53 enabling, 16-53 synchronous publishing latency for, 16-20 location of staging database, 16-20 requirements for, 16-27 setting database initialization parameters for, 16-21 source database performace impact, 16-20 SYS_NC_OID$ control column, 16-62 system monitor process (SMON) parallel DML system recovery, 24-24
T
table differencing, 16-2 table partitioning publisher and, 16-59 table queues, 24-67 tables detail tables, 8-7 dimension tables (lookup tables), 8-7 dimensions star queries, 19-3 enabling parallelism for, 24-45 external, 14-5 fact tables, 8-7 star queries, 19-3 historical, 24-22 lookup, 19-3 parallel creation, 24-16 parallel DDL storage, 24-18 refreshing in data warehouse, 24-21 STORAGE clause with parallel execution, 24-18 summary or rol, 24-16 tablespace specifying default for Change Data Capture publisher, 16-19 tablespaces change tables and, 16-59
Index-16
transportable, 12-5, 13-3, 13-6 TAPE_ASYNCH_IO initialization parameter, 24-60 TARGET_COLMAP$ control column, 16-61 interpreting, 16-62 templates SQLAccess Advisor, 17-10 temporary segments parallel DDL, 24-18 text match, 18-11 with query rewrite, 18-73 third normal form queries, 19-3 schemas, 19-2 time series calculations, 21-53 TIMED_STATISTICS initialization parameter, 24-66 TIMESTAMP$ control column, 16-61 timestamps, 12-6 TO_DATE function partitions, 5-32 transactions distributed parallel DDL restrictions, 24-4 parallel DML restrictions, 24-4, 24-27 TRANSACTIONS initialization parameter, 24-57 transformations, 14-2 scenarios, 14-21 SQL and PL/SQL, 14-8 SQL*Loader, 14-5 transportable tablespaces, 12-5, 13-3, 13-6 transportation definition, 13-2 distributed operations, 13-2 flat files, 13-2 triggers, 12-6 installed by Change Data Capture, 16-71 restrictions, 24-27 parallel DML, 24-25 TRUSTED mode, 18-7 TUNE_MVIEW procedure, 17-47 two-phase commit, 24-57
U
ultralarge files, 3-4 unique constraints, 7-4, 24-82 identifier, 2-3, 3-2 UNLIMITED TABLESPACE privilege, 16-19 update frequencies, 8-11 UPDATE statement parallel UPDATE statement, 24-38 update windows, 8-11 user resources limiting, 24-49 USERNAME$ control column, 16-62
V
V$FILESTAT view and parallel query, 24-66 V$PARAMETER view, 24-66 V$PQ_SESSTAT view, 24-64, 24-66 V$PQ_SYSSTAT view, 24-64 V$PQ_TQSTAT view, 24-64, 24-67 V$PX_PROCESS view, 24-65, 24-66 V$PX_SESSION view, 24-65 V$PX_SESSTAT view, 24-65 V$SESSTAT view, 24-68, 24-71 V$SYSSTAT view, 24-57, 24-68, 24-80 validating dimensions, 10-12 VALUES LESS THAN clause, 5-31 MAXVALUE, 5-32 view constraints, 7-7, 18-46 views DBA_DATA_FILES, 24-66 DBA_EXTENTS, 24-66 V$FILESTAT, 24-66 V$PARAMETER, 24-66 V$PQ_SESSTAT, 24-66 V$PQ_TQSTAT, 24-67 V$PX_PROCESS, 24-66 V$SESSTAT, 24-68, 24-71 V$SYSSTAT, 24-68 vmstat UNIX command, 24-71 VPD
Index-17
9-16
W
WIDTH_BUCKET function, 21-39, 21-41 windowing functions, 21-15 workload objects, 17-12 workloads deleting, 17-14 distribution, 24-64 skewing, 24-36
X
XIDSEQ$ control column, XIDSLT$ control column, XIDUSN$ control column, 16-62 16-62 16-62
Index-18