Chapter_4 - Algorithms for Query Processing and Optimization
Chapter_4 - Algorithms for Query Processing and Optimization
2
1. Introduction to Query Processing
3
Typical steps when processing a high-level query
4
COMPANY Database Schema
5
2. Translating SQL Queries into Relational
Algebra (1)
◼ Query block: the basic unit that can be translated into the algebraic
operators and optimized.
◼ A query block contains a single SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING clause if these are part of the
block.
◼ Nested queries within a query are identified as separate query blocks.
◼ Aggregate operators in SQL must be included in the extended algebra.
6
Translating SQL Queries into Relational Algebra (2)
◼ External sorting : refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in main memory, such
as most database files.
◼ Sort-Merge strategy : starts by sorting small subfiles (runs ) of the main
file and then merges the sorted runs, creating larger sorted subfiles that
are merged in turn.
– Sorting phase: nR = (b/nB)
– Merging phase: dM = Min(nB-1, nR);
nP= (logdM(nR))
nR: number of initial runs; b: number of file blocks;
nB: available buffer space; dM: degree of merging;
nP: number of passes.
8
Algorithms for External Sorting (2)
9
Algorithms for External Sorting (3)
/*Merge phase: merge subfiles until only 1 remains */
set i 1;
p logk-1m; /* p is the number of passes for the merging phase */
j m; /* the number of runs */
while (i<= p) do
{
n 1;
q (j/(k-1); /* the number of runs to write in this pass */
while ( n <= q) do
{
read next k-1 subfiles or remaining subfiles (from previous pass) one block at a time
merge and write as new subfile one block at a time;
n n+1;
}
j q;
The number of block accesses for the merge phase = 2*(b* logdMnR)
i i+1;
}
10
Example of External Sorting (1)
1 block = 2 records
15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29
Sort phase:
Read 3 blocks of the file → sort.
→ run: 3 blocks
11
Example of External Sorting (2)
Sort phase:
15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29
15 22 2 27 14 6 2 6 14 15 22 27
2 6 14 15 22 27
1 run
12
Example of External Sorting (3)
Sort phase
15 22 2 27 14 6 51 18 35 16 50 36 9 8 32 12 11 33 30 30 23 21 24 29
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
13
Example of External Sorting (4)
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
Merge phase:
Each step:
- Read 1 block from (nB - 1) runs to buffer
- Merge → temp block
- If temp block full: write to file
- If any empty block: Read next block from
corresponding run
14
Example of External Sorting (5)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
2 6 16 18 6 16 18 2 16 18 2 6
15
Example of External Sorting (6)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
14 15 16 18 15 16 18 14 16 18 14 15
2 6
16
Example of External Sorting (7)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
22 27 16 18 22 27 18 16 22 27 16 18
2 6 14 15
17
Example of External Sorting (8)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
22 27 35 36 27 35 36 22 35 36 22 27
2 6 14 15 16 18
18
Example of External Sorting (9)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
35 36
Temp block
2 6 14 15 16 18 22 27
19
Example of External Sorting (10)
Merge phase: Pass 1
2 6 14 15 22 27 16 18 35 36 50 51 8 9 11 12 32 33 20 21 23 24 29 30
Temp block
1 new run
2 6 14 15 16 18 22 27 35 36 50 51
20
Example of External Sorting (11)
Merge phase: Pass 2
2 6 14 15 16 18 22 27 35 36 50 51 8 9 11 12 20 21 23 24 29 30 32 33
2 6 8 9 6 8 9 2 8 9 2 6
21
Example of External Sorting (12)
Merge phase: Pass 2
2 6 14 15 16 18 22 27 35 36 50 51 8 9 11 12 20 21 23 24 29 30 32 33
14 15 8 9
Temp block
2 6
22
Example of External Sorting (13)
Result:
2 6 8 9 11 12 14 15 16 18 20 21 22 23 24 27 29 30 32 33 35 36 50 51
23
4. Algorithms for SELECT and JOIN
Operations (1)
Implementing the SELECT Operation:
Examples:
◼ (OP1): σSSN='123456789'(EMPLOYEE)
◼ (OP2): σDNUMBER>5(DEPARTMENT)
◼ (OP3): σDNO=5(EMPLOYEE)
◼ (OP4): σDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◼ (OP5): σESSN='123456789' AND PNO=10(WORKS_ON)
◼ (OP6): σDNO IN (3, 27, 49) (EMPLOYEE)
25
Algorithms for SELECT and JOIN Operations (3)
26
Algorithms for SELECT and JOIN Operations (4)
27
Algorithms for SELECT and JOIN Operations (4)
28
Algorithms for SELECT and JOIN (5)
29
Algorithms for SELECT and JOIN (6)
❑ Records satisfying the disjunctive condition are the union of the records
satisfying the individual conditions.
❑ If any one of the conditions does not have an access path, we are
compelled to use the brute force, linear search approach (S1).
❑ Only if an access path exists on every simple condition in the disjunction
can we optimize the selection by retrieving the records satisfying each
condition - or their record ids - and then applying the union operation to
eliminate duplicates.
31
Algorithms for SELECT and JOIN Operations (7)
Implementing the SELECT
Operation (cont.):
◼ S11. Disjunctive (OR)
selection conditions:
32
Algorithms for SELECT and JOIN Operations (7)
33
Which search method should be used? (1)
◼ (OP1): σSSN='123456789'(EMPLOYEE)
◼ (OP2): σDNUMBER>5(DEPARTMENT)
◼ (OP3): σDNO=5(EMPLOYEE)
◼ (OP4): σDNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◼ (OP5): σESSN='123456789' AND PNO=10(WORKS_ON)
◼ (OP6): σDNO IN (3, 27, 49) (EMPLOYEE)
34
Which search method should be used? (2)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
(OP1): σSSN='123456789'(EMPLOYEE)
◼ DEPARTMENT
❑ A primary index on DNUMBER
❑ A secondary index on MGRSSN
36
Which search method should be used? (4)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
39
Which search method should be used? (7)
◼ WORKS_ON
❑ A composite primary index on (ESSN, PNO)
40
Which search method should be used? (8)
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
◼ EMPLOYEE
❑ A clustering index on SALARY
❑ A secondary index on the key attribute SSN
❑ A secondary index on the nonkey attribute DNO
❑ A secondary index on SEX
◼ Examples
(OP8): EMPLOYEE DNO=DNUMBERDEPARTMENT
(OP9): DEPARTMENT MGRSSN=SSNEMPLOYEE
43
Algorithms for SELECT and JOIN Operations (9)
44
Algorithms for SELECT and JOIN Operations (10)
45
Algorithms for SELECT and JOIN Operations (12)
sort the tuples in R on attribute A; /* assume R has n tuples */
sort the tuples in S on attribute B; /* assume S has m tuples */
set i 1, j 1;
while (i ≤ n) and (j ≤ m)
do {
if R(i)[A] > S(j)[B]
then set j j + 1
elseif R(i)[A] < S(j)[B]
then set i i + 1
else { /* output a matched tuple */
output the combined tupe <R(i), S(j)> to T;
/* output other tuples that match R(i), if any */
set l j + 1 ;
while ( l ≤ m) and (R(i)[A] = S(l)[B])
do { output the combined tuple <R(i), S(l)> to T;
set l l + 1
}
/* output other tuples that match S(j), if any */
set k i+1
while ( k ≤ n) and (R(k)[A] = S(j)[B])
do { output the combined tuple <R(k), S(j)> to T;
set k k + 1
} Implementing Sort-Merge Join (J3): T R A=B S
set i i+1, j j+1;
}
46
}
R S
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
47
R S
R(i)[A] > S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
48
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
49
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(2), S(2)
6 6
9 6
10 10
17 17
20 18
50
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(2), S(3)
6 6
9 6
10 10
17 17
20 18
51
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
52
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
53
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(4), S(4)
6 6
9 6
10 10
17 17
20 18
54
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
55
R S
R(i)[A] = S(j)[B]
C A B D
5 4 → R(5), S(5)
6 6
9 6
10 10
17 17
20 18
56
R S
R(i)[A] < S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
57
R S
R(i)[A] > S(j)[B]
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
58
R S
j > m → end.
C A B D
5 4
6 6
9 6
10 10
17 17
20 18
Result: C A B D
R(2), S(2) 6 6
R(2), S(3) 6 6
R(4), S(4) 10 10
R(5), S(5) 17 17
59
Algorithms for SELECT and JOIN Operations (11)
60
Algorithms for SELECT and JOIN Operations (11)
61
Algorithms for SELECT and JOIN Operations (12)
62
Algorithms for SELECT and JOIN Operations (12)
63
Algorithms for SELECT and JOIN Operations (12)
64
Algorithms for SELECT and JOIN Operations (12)
65
5. Algorithms for PROJECT and SET
Operations (1)
◼ Algorithm for PROJECT operations π<attribute list>(R)
(Figure 19.3b)
◼ If <attribute list> has a key of relation R, extract all tuples from R with only the values for
the attributes in <attribute list>.
◼ If <attribute list> does NOT include a key of relation R, duplicated tuples must be
removed from the results.
◼ Methods to remove duplicate tuples:
◼ Sorting
◼ Hashing
66
Implementing T ∏<attribute list>(R)
67
Algorithms for PROJECT and SET Operations (2)
68
Algorithms for PROJECT and SET Operations (3)
❑ 2. Scan and merge both sorted files concurrently, keep in the merged results only those
tuples that appear in both relations.
◼ SET DIFFERENCE R-S (See Figure 19.3e)(keep in the merged results only those
tuples that appear in relation R but not in relation S.)
69
Union: T R S
sort the tuples in R and S using the same unique sort attributes;
set i 1, j 1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
{ output S(j) to T;
set j j+1
}
else if R(i) < S(j)
then
{ output R(i) to T;
set i i+1
}
else set j j+1 /* R(i) = S(j), so we skip one of the duplicate tuples */
}
if (i ≤ n) then add tuples R(i) to R(n) to T;
if (j ≤ m) then add tuples S(j) to S(m) to T;
70
Intersection T R S
sort the tuples in R and S using the same unique sort attributes;
set i 1, j 1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
set j j+1
else if R(i) < S(j)
then
set i i+1
else
{ output R(i) to T; /* R(i) = S(j), so we output the tuple
*/
set i i+1, j j+1
}
} 71
Difference T R − S
sort the tuples in R and S using the same unique sort attributes;
set i 1, j 1;
while (i ≤ n) and (j ≤ m) do
{
if R(i) > S(j)
then
set j j+1
else if R(i) < S(j)
then
{ output R(i) to T; /* R(i) has no matching S(j), so output R(i) */
set i i+1
}
else
set i i+1, j j+1
}
if (i ≤ n) then add tuples R(i) to R(n) to T;
72
6. Implementing Aggregate Operations
and Outer Joins (1)
Implementing Aggregate Operations:
◼ Aggregate operators : MIN, MAX, SUM, COUNT and AVG
❑ Table Scan
❑ Index
◼ Example
◼ If an (ascending) index on SALARY exists for the employee relation, then the optimizer could
decide on traversing the index for the largest value, which would entail following the right
most pointer in each index node from the root to a leaf.
73
Implementing Aggregate Operations and
Outer Joins (2)
◼ Implementing Aggregate Operations (cont.):
◼ SUM, COUNT and AVG
❑ For a dense index (each record has one index entry):
apply the associated computation to the values in the index.
❑ For a non-dense index: actual number of records associated with each index entry must
be accounted for
◼ With GROUP BY: the aggregate operator must be applied separately to each group of
tuples.
❑ Use sorting or hashing on the group attributes to partition the file into the appropriate
groups;
❑ Compute the aggregate function for the tuples in each group.
74
Implementing Aggregate Operations and
Outer Joins (3)
◼ Implementing Outer Join:
◼ Outer Join Operators : LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER
JOIN.
◼ The full outer join produces a result which is equivalent to the union of the results of the
left and right outer joins.
◼ Example:
SELECT FNAME, DNAME
FROM ( EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON DNO = DNUMBER);
◼ Note: The result of this query is a table of employee names and their associated
departments. It is similar to a regular join result, with the exception that if an employee
does not have an associated department, the employee's name will still appear in the
resulting table, although the department name would be indicated as null.
75
Implementing Aggregate Operations and
Outer Joins (4)
◼ Implementing Outer Join (cont.):
◼ Modifying Join Algorithms:
76
Implementing Aggregate Operations and
Outer Joins (5)
◼ Implementing Outer Join (cont.):
77
7. Combining Operations using Pipelining (1)
◼ Motivation
❑ A query is mapped into a sequence of operations.
78
Combining Operations using Pipelining (2)
◼ Example: 2-way join, 2 selections on the input files and one final
projection on the resulting file.
◼ Dynamic generation of code to allow for multiple operations to be
pipelined.
◼ Results of a select operation are fed in a "Pipeline " to the join
algorithm.
◼ Also known as stream-based processing.
79
8. Using Heuristics in Query Optimization(1)
80
Using Heuristics in Query Optimization (2)
81
Using Heuristics in Query Optimization (3)
◼ Example:
For every project located in ‘Stafford’, retrieve the project number, the
controlling department number and the department manager’s last name,
address and birthdate.
◼ Relation algebra :
PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((σPLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
◼ SQL query :
Q2: SELECT P.NUMBER, P.DNUM,E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’; 82
Two query trees for the query Q2
83
Query graph for Q2
84
Using Heuristics in Query Optimization (6)
◼ The same query could correspond to many different relational algebra expressions —
and hence many different query trees.
◼ The task of heuristic optimization of query trees is to find a final query tree that is
efficient to execute.
◼ Example :
Q: SELECT LNAME
86
Step 2: Moving SELECT operations down the query tree.
87
Step 3: Apply more restrictive SELECT operation first
88
Step 4: Replacing Cartesian Product and Select with Join operation.
89
Step 5: Moving Project operations down the query tree
90
Using Heuristics in Query Optimization (10)
General Transformation Rules for Relational Algebra Operations:
◼ 3. Cascade of π : In a cascade (sequence) of π operations, all but the last one can be
ignored:
πList1(π List2(...( πListn (R))...) ) = π List1(R)
◼ 4. Commuting σ with π : If the selection condition c involves only the attributes A1, ..., An in
the projection list, the two operations can be commuted:
πA1, A2,., An(σc(R)) = σc(πA1, A2,., An (R))
91
Using Heuristics in Query Optimization (11)
◼ Alternatively, if the selection condition c can be written as (c1 and c2), where condition c1
involves only the attributes of R and condition c2 involves only the attributes of S, the
operations commute as follows:
93
Using Heuristics in Query Optimization (13)
94
Using Heuristics in Query Optimization (14)
95
Using Heuristics in Query Optimization (15)
96
Using Heuristics in Query Optimization (16)
97
Using Heuristics in Query Optimization (17)
98
Using Heuristics in Query Optimization (18)
99
9. Using Selectivity and Cost Estimates in
Query Optimization (1)
◼ Cost-based query optimization: Estimate and compare the costs of
executing a query using different execution strategies and choose the
strategy with the lowest cost estimate.
◼ Issues
❑ Cost function
❑ Number of execution strategies to be considered
100
Using Selectivity and Cost Estimates in Query Optimization (2)
◼ 3. Computation cost
101
Using Selectivity and Cost Estimates in Query
Optimization (3)
Catalog Information Used in Cost Functions
◼ Information about the size of a file
105
Example
◼ rE = 10,000 , bE = 2000 , bfrE = 5
◼ Access paths:
❑ 1. A clustering index on SALARY, with levels xSALARY = 3 and average
selection cardinality SSALARY = 20.
❑ 2. A secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).
❑ 3. A secondary index on the nonkey attribute DNO, with xDNO= 2 and first-
level index blocks bI1DNO= 4. There are dDNO = 125 distinct values for DNO,
so the selection cardinality of DNO is SDNO = (r/dDNO) = 80.
❑ 4. A secondary index on SEX, with xSEX = 1. There are dSEX = 2 values for
the sex attribute, so the average selection cardinality is SSEX = (r/dSEX) =
5000.
106
Example
❑ CS1b = 1000
❑ CS6a = xSSN + 1 = 4+1 = 5
❑ CS1a = 2000
❑ CS6b = xDNO + (bl1DNO/2) + (r/2) = 2 + 4/2 + 10000/2 = 5004
107
Example
◼ (op3): σDNO=5 (EMPLOYEE)
❑ CS1a = 2000
❑ CS6a = xDNO + sDNO = 2 + 80 = 82 (option 1 & 2)
❑ CS6a = xDNO + sDNO + 1 = 2 + 80 + 1= 83 (option 3)
108
Using Selectivity and Cost Estimates in Query
Optimization (7)
Examples of Cost Functions for JOIN
109
Using Selectivity and Cost Estimates in Query
Optimization (8)
Examples of Cost Functions for JOIN (cont.)
110
Using Selectivity and Cost Estimates in Query
Optimization (9)
Examples of Cost Functions for JOIN (cont.)
◼ J2. Single-loop join (cont.)
112
DEPARTMENT: rD = 125 and bD = 13 , xDNUMBER = 1, primary index on DNUMBER of DEPARTMENT, xDNUMBER = 1,
Example
jsOP6 = (1/IDEPARTMENTI ) = 1/rD = 1/125 , bfrED = 4
EMPLOYEE : rE = 10000, bE = 2000, secondary index on the nonkey attribute DNO, xDNO = 2, SDNO = 80).
◼ (op8): EMPLOYEE DNO=DNUMBER DEPARTMENT
❑ Method J1 with Employee as outer:
◼ CJ1 = bE + (bE * bD) + ((jsOP6 * rE * rD)/bfrED)
◼ = 2000 + (2000 * 13) + (((1/125) * 10,000 * 125)/4) =30,500
❑ Method J1 with Department as outer:
◼ CJ1 = bD + (bE * bD) + (((jsOP6 * rE * rD)/bfrED)
◼ = 13 + (13 * 2000) + (((1/125) * 10,000 * 125/4) = 28,513
❑ Method J2 with EMPLOYEE as outer loop:
◼ CJ2c = bE + (rE * (xDNUMBER + 1)) + ((jsOP6 * rE * rD)/bfrED
◼ = 2000 + (10,000 * 2) + (((1/125) * 10,000 * 125/4) = 24,500
❑ Method J2 with DEPARTMENT as outer loop:
◼ CJ2a = bD + (rD * (xDNO+ sDNO)) + ((jsOP6 * rE * rD)/bfrED) (option 1 & 2)
◼ = 13 + (125 * (2 + 80)) + (((1/125) * 10,000 * 125/4) = 12,763
◼ CJ2a = bD + (rD * (xDNO+ sDNO + 1)) + ((jsOP6 * rE * rD)/bfrED) (option 3)
◼ = 13 + (125 * (2 + 80 + 1)) + (((1/125) * 10,000 * 125/4) = 12,888
113
DEPARTMENT: rD = 125 and bD = 13 , xDNUMBER = 1, secondary index on MGRSSN of DEPARTMENT, sMGRSSN =
Example
1, xMGRSSN = 2, jsOP7 = (1/IEMPLOYEEI ) = 1/rE = 1/10,000 , bfrED = 4
EMPLOYEE : : rE = 10000, bE = 2000, secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).
115
Using Selectivity and Cost Estimates in Query
Optimization (11)
◼ Example: 2 left-deep trees
116
10. Overview of Query Optimization in Oracle
◼ Oracle DBMS V8
❑ Rule-based query optimization: the optimizer chooses execution plans based
on heuristically ranked operations.
◼ (Currently it is being phased out)
❑ Cost-based query optimization: the optimizer examines alternative access
paths and operator algorithms and chooses the execution plan with lowest
estimate cost.
◼ The query cost is calculated based on the estimated usage of resources such as I/O,
CPU and memory needed.
❑ Application developers could specify hints to the ORACLE query optimizer.
❑ The idea is that an application developer might know more information about the
data.
117
11. Semantic Query Optimization
◼ Semantic Query Optimization:
❑ Uses constraints specified on the database schema in order to modify one query into
another query that is more efficient to execute.
◼ Explanation:
❑ Suppose that we had a constraint on the database schema that stated that no employee
can earn more than his or her direct supervisor. If the semantic query optimizer checks for
the existence of this constraint, it need not execute the query at all because it knows that
the result of the query will be empty. Techniques known as theorem proving can be used
for this purpose.
118
120