Advancedchapter 2 2013
Advancedchapter 2 2013
Query processing: The activities involved in retrieving data from the database are called as query
processing. The activities involved in parsing, validating, optimizing, and executing a query. The
aims of query processing are to transform a query written in a high-level language (SQL) into low-
level language (implementing the relational algebra).
Query optimization: The activity of choosing an efficient execution strategy for processing a
query is called query optimization. Its aim is to choose the one that minimizes the resource usage.
A DBMS uses different techniques to process, optimize, and execute highlevel queries. A query
expressed in high-level query language must be first scanned, parsed, and validated.
The scanner identifies the language components (tokens) in the text of the query, while the parser
checks the correctness of the query syntax. The query is also validated (by accessing the system
catalog) whether the attribute names and relation names are valid. An internal representation (tree
or graph) of the query is created. The optimizer generates alternative plans and chooses the plan
with the least estimated cost.
2.2. Query Processing
The aim of query processing is to find information in one or more databases and deliver it to the
user quickly and efficiently. Traditional techniques work well for databases with standard, single-
site relational structures, but databases containing more complex and diverse types of data demand
new query processing and optimization techniques.
1
2.2.1. Query Processing Phases
Query processing can be divided into four main phases: decomposition (consisting of parsing and
validation), optimization, code generation, and execution, as illustrated in Figure 2.1.
Step 1. Parsing and translation: System checks the syntax of the query.
• Creates a parse-tree representation of the query.
• Translates the query into a relational-algebra expression.
• Parser checks syntax, verifies relations
Step2: Optimization: finding the cheapest evaluation plan for a query.
• A query optimizer must know the cost of each operation.
• Each relational-algebra operation can be executed by one of several different algorithms.
Step 3: Evaluation: The query-execution engine takes a query-evaluation plan, executes that plan,
and returns the answers to the query.
2
Query in high-level language (SQL)
Query
Decomposition
Database catalog
Relational algebra expression
Query
Optimization
Execution plan Database statistics
Query
Generation
Generated code
Runtime query
execution
Query output Main database
• Conjunctive normal form: A sequence of conjuncts that are connected with the ∧ (AND)
operator. Each conjunct contains one or more terms connected by the ∨ (OR) operator. For
example: (position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’. A conjunctive
selection contains only those tuples that satisfy all conjuncts.
3
• Disjunctive normal form: A sequence of disjuncts that are connected with the ∨ (OR)
operator. Each disjunct contains one or more terms connected by the ∧ (AND) operator. For
example, we could rewrite the above conjunctive normal form as: (position =‘Manager’ ∧
branchNo =‘B003’ ) ∨(salary >20000 ∧ branchNo =‘B003’). A disjunctive selection contains
those tuples formed by the union of all tuples that satisfy the disjuncts.
3. Semantic Analysis: The objective of semantic analysis is to reject normalized queries that are
incorrectly formulated or contradictory. A query is incorrectly formulated if components do
not contribute to the generation of the result, which may happen if some join specifications are
missing. A query is contradictory if its predicate cannot be satisfied by any tuple. For example,
the predicate (position = ‘Manager’ ∧ position = ‘Assistant’) on the Staff relation is
contradictory, as a member of staff cannot be both a Manager and an Assistant simultaneously.
However, the predicate ((position = ‘Manager’ ∧ position = ‘Assistant’) ∨ salary > 20000)
could be simplified to (salary > 20000) by interpreting the contradictory clause as the boolean
value FALSE. Unfortunately, the handling of contradictory clauses is not consistent between
DBMSs. Algorithms to handle contradictory clauses are.
• Construct a relation connection graph: If the graph is not connected, the query is incorrectly
formulated that represent the source of projection operations.
• Construct a normalized attribute connection graph: If the graph has a cycle for which the
valuation sum is negative, the query is contradictory that represents a selection operation.
4. Simplification: The objectives of the simplification stage are to detect redundant
qualifications, eliminate common subexpressions, and transform the query to a semantically
equivalent but more easily and efficiently computed form. Typically, access restrictions, view
definitions, and integrity constraints are considered at this stage. If the user does not have the
appropriate access to all the components of the query, the query must be rejected. For example:
CREATE VIEW Staff3 AS SELECT * SELECT staffNo, fName, lName, salary, branchNo
FROM Staff WHERE branchNo = ‘B003’ and salary > 20000;
5. Query Restructuring: In the final stage of query decomposition, the query is restructured to
provide a more efficient implementation. More than one translation is possible use
transformation rules.
Most real-world data is not well structured. Today's databases typically contain much non-
structured data such as text, images, video, and audio, often distributed across computer networks.
4
In this complex environment, efficient and accurate query processing becomes quite challenging.
There could be tons of tricks (not only in storage and query processing, but also in concurrency
control, recovery, etc.)
2.3. Query Optimization
The activity of choosing an efficient execution strategy for processing a query is called as query
optimization. Everyone wants the performance of their database to be optimal. In particular, there
is often a requirement for a specific query or object that is query based, to run faster. Problem of
query optimization is to find the sequence of steps that produces the answer to user request in the
most efficient manner, given the database structure. The performance of a query is affected by the
tables or queries that underlies the query and by the complexity of the query. When data/workload
characteristics change:
• The best navigation strategy changes
• The best way of organizing the data changes
Query optimizers are one of the main means by which modern database systems achieve their
performance advantages. Given a request for data manipulation or retrieval, an optimizer will
choose an optimal plan for evaluating the request from among the manifold alternative
strategies. That means there are many ways (access paths) for accessing desired file/record. The
optimizer tries to select the most efficient (cheapest) access path for accessing the data. DBMS is
responsible to pick the best execution strategy based on various considerations. Query optimizers
were already among the largest and most complex modules of database systems.
Most efficient processing: Least amount of I/O and CPU resources.
Selection of the best method: In a non-procedural language the system does the optimization at
the time of execution. On the other hand, in a procedural language, programmers have some
flexibility in selecting the best method. For optimizing the execution of a query the programmer
must know:
• File organization.
• Record access mechanism and primary or secondary key.
• Data location on disk.
• Data access limitations.
5
To write correct code, application programmers need to know how data is organized physically
(e.g., which indexes exist), to write efficient code, application programmers also need to worry
about data/workload characteristics.
The process of choosing a suitable execution strategy for processing a query. Two internal
representations of a query: Query Tree and Query Graph A query tree is a tree data structure
that corresponds to a relational algebra expression. It represents the input relations of the query as
leaf nodes of the tree and represents the relational algebra operations as internal nodes. Query
graph: A graph data structure that corresponds to a relational calculus expression relations in the
query are represented by relation nodes, which are displayed as single circles. It does not indicate
an order on which operations to perform first. There is only a single graph corresponding to each
query.
Query Optimization Can be achieved through two techniques:
Using heuristic rules: Reorder the operations in the internal representation of a query (tree or
graph) to improve performance. A heuristic rule works well in MOST cases but it is NOT
GUARANTEED to work in ALL possible cases. Selections before joins better efficiency.
Using cost estimations: Find the costs of the different execution strategies and choose the one
with the lowest cost. Computationally intensive and Most DBMSs combine both.
2.3.1. Approaches to Query Optimization
2.3.1.1. Heuristics Approach
The heuristical approach to query optimization, which uses transformation rules to convert one
relational algebra expression into an equivalent form that is known to be more efficient. The
heuristic approach uses the knowledge of the characteristics of the relational algebra operations
and the relationship between the operators to optimize the query. Thus, the heuristic approach
of optimization will make use of:
• Properties of individual operators:
• Association between operators:
• Query Tree: a graphical representation of the operators, relations, attributes and predicates
and processing sequence during query processing. Query tree has three main parts:
o The Leaves: the base relations used for processing the query/ extracting the required
information
6
o The Root: the final result/relation as an output based on the operation on the relations
used for query processing
o Nodes: intermediate results or relations before reaching the final result.
Sequence of execution of operation in a query tree will start from the leaves and continues to the
intermediate nodes and ends at the root. The properties of each operation and the association
between operators is analyzed using set of rules called transformation rules. Use of the
transformation rules will transform the query to relatively good execution strategy. Process for
heuristics optimization: The parser of a high-level query generates an initial internal
representation. Apply heuristics rules to optimize the internal representation. A query execution
plan is generated to execute groups of operations based on the access paths available on the files
involved in the query.
2.3.2. Transformation Rules for the Relational Algebra Operations
By applying transformation rules, the optimizer can transform one relational algebra expression
into an equivalent expression that is known to be more efficient. Use these rules to restructure the
(canonical) relational algebra tree generated during query decomposition. In listing these rules, we
use three relations R, S, and T, with R defined over the attributes A ={A1, A2, . . . , An}, and S
defined over B ={B1, B2, . . . , Bn}; p, q, and r denote predicates, and L, L1, L2, M, M1, M2, and
N denote sets of attributes.
1. Conjunctive selection operations can cascade into individual selection operations (and vice
versa). This transformation is sometimes referred to as cascade of selection.
7
Example: σ branchNo=‘B003’ ∧ salary>15000(Staff) =σ branchNo=‘B003’(σ salary>15000(Staff))
2. Commutativity of Selection operations.
σp(σq(R))=σq(σp(R)) where p and q are predicates
Example: σ branchNo=‘B003’(σ salary>15000(Staff)) =σ salary>15000(σ branchNo=‘B003’(Staff))
3. In a sequence of Projection operations, only the last in the sequence is required. Also, called
Cascade of projection: Π L Π M ...Π N(R) =Π L(R)
Example: Π lNameΠ branchNo, lName(Staff) =Π lName(Staff)
4. Commutativity of Selection and Projection. If the predicate p involves only the attributes in
the projection list, then the Selection and Projection operations commute:
Π A1, . . . , Am(σ p(R)) =σ p(Π A1, . . . , Am(R)) where p ∈{A1, A2, . . . , Am}
8
branchNo(Staff)) ⋈Staff.branchNo=Branch.branchNo(Π city, branchNo(Branch))
b. If the join condition contains additional attributes not in L, say attributes M = M1 ∪ M2
where M1 involves only attributes of R, and M2 involves only attributes of S, then a final
Projection operation is required:
ΠL1 ∪ L2(R ⋈r S) =Π L1 ∪ L2(Π L1 ∪ M1(R)) ⋈r (Π L2 ∪ M2(S))
Example: Πposition, city(Staff⋈Staff.branchNo=Branch.branchNo Branch)=Πposition, city((Πposition,
9
SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p WHERE
c.prefType = ‘Flat’ AND c.clientNo = v.clientNo AND v.propertyNo = p.propertyNo AND
c.maxRent >= p.rent AND c.prefType = p.type AND p.ownerNo = ‘CO93’;
Converting the SQL to relational algebra, we have: Πp.propertyNo, p.street(σ c.prefType=‘Flat’ ∧
The SQL equivalence for the above query will be: SELECT FName, LName FROM EMPLOYEE,
WORKS_ON, PROJECT WHEREDoB<Jan 1 1965 EEmpID=WEmpID WProjID=PProjID
PName=”Ring Road”
11
The initial query tree will be:
<FName, LName>
X PROJECT
EMPLOYEE WORKS_ON
By applying the first step (cascading the selection) we will come up with the following structure.
WORKS_ON X PROJECT)) ) )
By applying the second step it can be seen that some conditions have attribute that belong to a
single relation (DoB belongs to EMPLOYEE and PName belongs to PROJECT) thus the selection
operation can be commuted with Cartesian Operation. Then, since the condition WEmpID=EEmpID
base the employee andWORKS_ON relation the selection with this condition can be cascaded.
( (PProjID=WProjID) ( PName=’Ring
( Road’) PROJECT ) X ( (WEmpID=EEmpID) (WORKS_ONX
( (DoB<Jan1 1965) EMPLOYEE)))) The query tree after this modification will be:
12
<FName, LName>
(PProjID=WProjID)
X
(PName=’Ring Road’)
(WEmpID=EEmpID)
X PROJECT
(DoB<Jan1 1965)
WORKS_ON
EMPLOYEE
Using the third step, perform most restrictive operations first. From the query given we can see
that selection on PROJECT is most restrictive than selection on EMPLOYEE. Thus, it is better to
perform selection on PROJECT before selection on EMPLOYEE. Rearrange the nodes to achieve
this.
<FName, LName>
(WEmpID=EEmpID)
X
(DoB<Jan1 1965)
(PProjID=WProjID)
X EMPLOYEE
(PName=’Ring Road’)
WORKS_ON
PROJECT
Using the forth step, Perform Cartesian Operations with the subsequent Selection Operation.
13
<FName, LName>
(WEmpID=EEmpID)
PROJECT
<FName, LName>
(WEmpID=EEmpID)
<FName, LName,EEmpID>
<WEmpID>
(DoB<Jan1 1965)
(PProjID=WProjID)
EMPLOYEE
<PProjID>
WORKS_ON
(PName=’Ring Road’)
PROJECT
14
• Data Transpiration
• Storage space in the Primary Memory
• Writing on Disk
The statistics in the system catalogue used for cost estimation purpose are:
• Cardinality of a relation: the number of tuples contained in a relation currently (r)
• Degree of a relation: number of attributes of a relation
• Number of tuples on a relation that can be stored in one block of memory
• Total number of blocks used by a relation
• Number of distinct values of an attribute (d)
• Selection Cardinality of an attribute (S): that is average number of records that will satisfy
an equality condition S=r/d
By sing the above information one could calculate the cost of executing a query and selecting the
best strategy, which is with the minimum cost of processing.
15
4. Communication Cost: In most database systems the database resides in one station and
various queries originate from different terminals. This will have impact on the performance
of the system adding cost for query processing. Thus, the cost of transporting data between
the database site and the terminal from where the query originate should be analyzed.
2.4. Pipelining
Pipelining is another method used for query optimization. It used to improve the performance of
queries. It is sometime known as stream-based processing or on-the-fly processing or queries. As
query optimization tries to reduce the size of the intermediate result, pipelining uses a better way
of reducing the size by performing different conditions on a single intermediate result
continuously. Thus the technique is said to reduce the number of intermediate relations in query
execution. Pipelining performs multiple operations on a single relation in a pipeline.
Generally, a pipeline is implemented as a separate process or thread within the DBMS. Each
pipeline takes a stream of tuples from its inputs and creates a stream of tuples as its output. A
buffer is created for each pair of adjacent operations to hold the tuples being passed from the first
operation to the second one. One drawback with pipelining is that the inputs to operations are not
necessarily available all at once for processing. This can restrict the choice of algorithms.
Examples, Let’s say we have a relation on employee with the following schema Employee(ID,
FName, LName, DoB, Salary, Position, Dept). If a query would like to extract supervisors with
salary greater than 2000, the relational algebra representation of the query will be:
(Salary>2000) (Position=Supervisor)(Employee)
After reading the relation from the memory, the system could perform the operation by cascading
the SELECT operation.
16