Choice of Evaluation Plans
Choice of Evaluation Plans
Generation of expressions is only part of the query-optimization process, since each operation in the expression can be implemented with different algorithms. An evaluation plan is therefore needed to define exactly what algorithm should be used for each operation, and how the execution of the operations should be coordinated. Figure 14.4 illustrates one possible evaluation plan for the expression from Figure 14.3. As we have seen, several different algorithms can be used for each relational operation, giving rise to alternative evaluation plans. Further, decisions about pipelining have to be made. In the figure, the edges from the selection operations to the merge join operation are marked as pipelined; pipelining is feasible if the selection operations generate their output sorted on the join attributes. They would do so if the indices on branch and account store records with equal values for the index attributes sorted by branch-name.
Interaction of Evaluation Techniques
One way to choose an evaluation plan for a query expression is simply to choose for each operation the cheapest algorithm for evaluating it. We can choose any ordering of the operations that ensures that operations lower in the tree are executed before operations higher in the tree.
However, choosing the cheapest algorithm for each operation independently is not necessarily a good idea. Although a merge join at a given level may be costlier than
a hash join, it may provide a sorted output that makes evaluating a later operation (such as duplicate elimination, intersection, or another merge join) cheaper. Similarly, a nested-loop join with indexing may provide opportunities for pipelining the results to the next operation, and thus may be useful even if it is not the cheapest way of
performing the join. To choose the best overall algorithm, we must consider even nonoptimal algorithms for individual operations.
Thus, in addition to considering alternative expressions for a query, we must also consider alternative algorithms for each operation in an expression. We can use rules much like the equivalence rules to define what algorithms can be used for each operation, and whether its result can be pipelined or must be materialized. We can use these rules to generate all the query-evaluation plans for a given expression.
Given an evaluation plan, we can estimate its cost using statistics estimated by the techniques in Section 14.2 coupled with cost estimates for various algorithms and evaluation methods described in Chapter 13. That still leaves the problem of choosing the best evaluation plan for a query. There are two broad approaches: The first searches all the plans, and chooses the best plan in a cost-based fashion. The second uses heuristics to choose a plan. We discuss these approaches next. Practical query optimizers incorporate elements of both approaches.
Cost-Based Optimization
A cost-based optimizer generates a range of query-evaluation plans from the given query by using the equivalence rules, and chooses the one with the least cost. For a complex query, the number of different query plans that are equivalent to a given plan can be large. As an illustration, consider the expression
In general, with n relations, there are (2(n − 1))!/(n − 1)! different join orders. (We leave the computation of this expression for you to do in Exercise 14.10.) For joins involving small numbers of relations, this number is acceptable; for example, with n = 5, the number is 1680. However, as n increases, this number rises quickly. With n = 7, the number is 665280; with n = 10, the number is greater than 17.6 billion!
Luckily, it is not necessary to generate all the expressions equivalent to a given expression. For example, suppose we want to find the best join order of the form
which represents all join orders where r1, r2, and r3 are joined first (in some order), and the result is joined (in some order) with r4 and r5. There are 12 different join orders for computing r1 r2 r3, and 12 orders for computing the join of this result with r4 and r5. Thus, there appear to be 144 join orders to examine. However, once we have found the best join order for the subset of relations {r1, r2, r3}, we can use that order for further joins with r4 and r5, and can ignore all costlier join orders of r1 r2 r3. Thus, instead of 144 choices to examine, we need to examine only 12 + 12 choices.
Using this idea, we can develop a dynamic-programming algorithm for finding optimal join orders. Dynamic programming algorithms store results of computations and reuse them, a procedure that can reduce execution time greatly. A recursive pro- cedure implementing the dynamic programming algorithm appears in Figure 14.5.
The procedure stores the evaluation plans it computes in an associative array bestplan, which is indexed by sets of relations. Each element of the associative ar- ray contains two components: the cost of the best plan of S, and the plan itself. The value of bestplan[S].cost is assumed to be initialized to ∞ if bestplan[S] has not yet been computed.
The procedure first checks if the best plan for computing the join of the given set of relations S has been computed already (and stored in the associative array bestplan); if so it returns the already computed plan. Otherwise, the procedure tries every way of dividing S into two disjoint subsets. For each division, the procedure recursively finds the best plans for each of the two subsets, and then computes the cost of the overall plan by using that division. The procedure picks the cheapest plan from among all the alternatives for dividing S into two sets. The cheapest plan and its cost are stored in the array bestplan, and returned by the procedure. The time complexity of the procedure can be shown to be O(3n) (see Exercise 14.11).
Actually, the order in which tuples are generated by the join of a set of relations is also important for finding the best overall join order, since it can affect the cost of further joins (for instance, if merge join is used). A particular sort order of the tuples is said to be an interesting sort order if it could be useful for a later operation. For instance, generating the result of r1 r2 r3 sorted on the attributes common with r4 or r5 may be useful, but generating it sorted on the attributes common to only r1 and r2 is not useful. Using merge join for computing r1 r2 r3 may be costlier than using some other join technique, but may provide an output sorted in an interesting sort order.
Hence, it is not sufficient to find the best join order for each subset of the set of n given relations. Instead, we have to find the best join order for each subset, for each interesting sort order of the join result for that subset. The number of subsets of n relations is 2n. The number of interesting sort orders is generally not large. Thus, about 2n join expressions need to be stored. The dynamic-programming algorithm for finding the best join order can be easily extended to handle sort orders. The cost of the extended algorithm depends on the number of interesting orders for each subset of relations; since this number has been found to be small in practice, the cost remains at O(3n).
With n = 10, this number is around 59000, which is much better than the 17.6 billion different join orders. More important, the storage required is much less than before, since we need to store only one join order for each interesting sort order of each of 1024 subsets of r1,..., r10. Although both numbers still increase rapidly with n, commonly occurring joins usually have less than 10 relations, and can be handled easily.
We can use several techniques to reduce further the cost of searching through a large number of plans. For instance, when examining the plans for an expression, we can terminate after we examine only a part of the expression, if we determine that the cheapest plan for that part is already costlier than the cheapest evaluation plan for a full expression examined earlier. Similarly, suppose that we determine that the cheapest way of evaluating a subexpression is costlier than the cheapest evaluation plan for a full expression examined earlier. Then, no full expression involving that subexpression needs to be examined. We can further reduce the number of evaluation plans that need to be considered fully by first making a heuristic guess of a good plan, and estimating that plan’s cost. Then, only a few competing plans will require a full analysis of cost. These optimizations can reduce the overhead of query optimizationsignificantly.
Heuristic Optimization
A drawback of cost-based optimization is the cost of optimization itself. Although the cost of query processing can be reduced by clever optimizations, cost-based optimization is still expensive. Hence, many systems use heuristics to reduce the number of choices that must be made in a cost-based fashion. Some systems even choose to use only heuristics, and do not use cost-based optimization at all.
An example of a heuristic rule is the following rule for transforming relational- algebra queries:
• Perform selection operations as early as possible.
A heuristic optimizer would use this rule without finding out whether the cost is reduced by this transformation. In the first transformation example in Section 14.3, the selection operation was pushed into a join.
We say that the preceding rule is a heuristic because it usually, but not always, helps to reduce the cost. For an example of where it can result in an increase in cost, consider an expression σθ (r s), where the condition θ refers to only attributes in s.
The selection can certainly be performed before the join. However, if r is extremely small compared to s, and if there is an index on the join attributes of s, but no index on the attributes used by θ, then it is probably a bad idea to perform the selection early. Performing the selection early — that is, directly on s — would require doing a scan of all tuples in s. It is probably cheaper, in this case, to compute the join by using the index, and then to reject tuples that fail the selection.
The projection operation, like the selection operation, reduces the size of relations. Thus, whenever we need to generate a temporary relation, it is advantageous to ap- ply immediately any projections that are possible. This advantage suggests a companion to the “perform selections early” heuristic:
• Perform projections early.
It is usually better to perform selections earlier than projections, since selections have the potential to reduce the sizes of relations greatly, and selections enable the use of indices to access tuples. An example similar to the one used for the selection heuristic should convince you that this heuristic does not always reduce the cost.
Drawing on the equivalences discussed in Section 14.3.1, a heuristic optimization algorithm will reorder the components of an initial query tree to achieve improved query execution. We now present an overview of the steps in a typical heuristic optimization algorithm. You can understand the heuristics by visualizing a query expression as a tree, as illustrated in Figure 14.3
1. Deconstruct conjunctive selections into a sequence of single selection operations. This step, based on equivalence rule 1, facilitates moving selection operations down the query tree.
2. Move selection operations down the query tree for the earliest possible execution. This step uses the commutativity and distributivity properties of the selection operation noted in equivalence rules 2, 7.a, 7.b, and 11.
For instance, this step transforms σθ (r s) into either σθ (r) s or r σθ (s) whenever possible. Performing value-based selections as early as possible reduces the cost of sorting and merging intermediate results. The degree of reordering permitted for a particular selection is determined by the attributes involved in that selection condition.
3. Determine which selection operations and join operations will produce the smallest relations — that is, will produce the relations with the least number of tuples. Using associativity of the operation, rearrange the tree so that the leaf-node relations with these restrictive selections are executed first.
This step considers the selectivity of a selection or join condition. Recall that the most restrictive selection — that is, the condition with the smallest selectivity — retrieves the fewest records. This step relies on the associativity of binary operations given in equivalence rule 6.
4. Replace with join operations those Cartesian product operations that are followed by a selection condition (rule 4.a). The Cartesian product operation is often expensive to implement since r1 × r2 includes a record for each combination of records from r1 and r2. The selection may significantly reduce the number of records, making the join much less expensive than the Cartesian product.
5. Deconstruct and move as far down the tree as possible lists of projection at- tributes, creating new projections where needed. This step draws on the properties of the projection operation given in equivalence rules 3, 8.a, 8.b, and 12.
6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining.
In summary, the heuristics listed here reorder an initial query-tree representation in such a way that the operations that reduce the size of intermediate results are applied first; early selection reduces the number of tuples, and early projection reduces the number of attributes. The heuristic transformations also restructure the tree so that the system performs the most restrictive selection and join operations before other similar operations.
Heuristic optimization further maps the heuristically transformed query expression into alternative sequences of operations to produce a set of candidate evaluation plans. An evaluation plan includes not only the relational operations to be performed, but also the indices to be used, the order in which tuples are to be accessed, and the order in which the operations are to be performed. The access-plan – selection phase of a heuristic optimizer chooses the most efficient strategy for each operation.
Structure of Query Optimizers∗∗
So far, we have described the two basic approaches to choosing an evaluation plan; as noted, most practical query optimizers combine elements of both approaches. For example, certain query optimizers, such as the System R optimizer, do not consider all join orders, but rather restrict the search to particular kinds of join orders. The System R optimizer considers only those join orders where the right operand of each join is one of the initial relations r1,..., rn. Such join orders are called left-deep join orders. Left-deep join orders are particularly convenient for pipelined evaluation, since the right operand is a stored relation, and thus only one input to each join is pipelined.
Figure 14.6 illustrates the difference between left-deep join trees and non-left-deep join trees. The time it takes to consider all left-deep join orders is O(n!), which is much less than the time to consider all join orders. With the use of dynamic programming optimizations, the System R optimizer can find the best join order in time O(n2n). Contrast this cost with the O(3n) time required to find the best overall join order. The System R optimizer uses heuristics to push selections and projections down the query tree.
The cost estimate that we presented for scanning by secondary indices assumed that every tuple access results in an I/O operation. The estimate is likely to be ac- curate with small buffers; with large buffers, however, the page containing the tuple may already be in the buffer. Some optimizers incorporate a better cost-estimation technique for such scans: They take into account the probability that the page containing the tuple is in the buffer.
Query optimization approaches that integrate heuristic selection and the generation of alternative access plans have been adopted in several systems. The approach used in System R and in its successor, the Starburst project, is a hierarchical procedure based on the nested-block concept of SQL. The cost-based optimization techniques described here are used for each block of the query separately.
The heuristic approach in some versions of Oracle works roughly this way: For an n-way join, it considers n evaluation plans. Each plan uses a left-deep join order, starting with a different one of the n relations. The heuristic constructs the join or- der for each of the n evaluation plans by repeatedly selecting the “best” relation to join next, on the basis of a ranking of the available access paths. Either nested-loop or sort – merge join is chosen for each of the joins, depending on the available access paths. Finally, the heuristic chooses one of the n evaluation plans in a heuristic manner, based on minimizing the number of nested-loop joins that do not have an index available on the inner relation, and on the number of sort – merge joins.
The intricacies of SQL introduce a good deal of complexity into query optimizers.
In particular, it is hard to translate nested subqueries in SQL into relational algebra.
We briefly outline how to handle nested subqueries in Section 14.4.5. For compound SQL queries (using the ∪, ∩, or − operation), the optimizer processes each component separately, and combines the evaluation plans to form the overall evaluation plan.
Even with the use of heuristics, cost-based query optimization imposes a substantial overhead on query processing. However, the added cost of cost-based query optimization is usually more than offset by the saving at query-execution time, which is dominated by slow disk accesses. The difference in execution time between a good plan and a bad one may be huge, making query optimization essential. The achieved saving is magnified in those applications that run on a regular basis, where the query can be optimized once, and the selected query plan can be used on each run. Therefore, most commercial systems include relatively sophisticated optimizers. The bibliographical notes give references to descriptions of the query optimizers of actual
database systems. Choice of Evaluation Plans 551
SQL conceptually treats nested subqueries in the where clause as functions that take parameters and return either a single value or a set of values (possibly an empty set). The parameters are the variables from outer level query that are used in the nested subquery (these variables are called correlation variables). For instance, suppose we have the following query.
Conceptually, the subquery can be viewed as a function that takes a parameter (here, borrower.customer-name) and returns the set of all depositors with the same name.
SQL evaluates the overall query (conceptually) by computing the Cartesian product of the relations in the outer from clause and then testing the predicates in the where clause for each tuple in the product. In the preceding example, the predicate tests if the result of the subquery evaluation is empty.
This technique for evaluating a query with a nested subquery is called correlated evaluation. Correlated evaluation is not very efficient, since the subquery is separately evaluated for each tuple in the outer level query. A large number of random disk I/O operations may result.
SQL optimizers therefore attempt to transform nested subqueries into joins, where possible. Efficient join algorithms help avoid expensive random I/O. Where the transformation is not possible, the optimizer keeps the subqueries as separate expressions, optimizes them separately, and then evaluates them by correlated evaluation.
As an example of transforming a nested subquery into a join, the query in the preceding example can be rewritten as
(To properly reflect SQL semantics, the number of duplicate derivations should not change because of the rewriting; the rewritten query can be modified to ensure this property, as we will see shortly.)
In the example, the nested subquery was very simple. In general, it may not be possible to directly move the nested subquery relations into the from clause of the outer query. Instead, we create a temporary relation that contains the results of the nested query without the selections using correlation variables from the outer query, and join the temporary table with the outer level query. For instance, a query of the form
where P 1 contains predicates in P2 without selections involving correlation variables, and P 2 reintroduces the selections involving correlation variables (with relations referenced in the predicate appropriately renamed). Here, V contains all attributes that are used in selections with correlation variables in the nested subquery.
In our example, the original query would have been transformed to
The query we rewrote to illustrate creation of a temporary relation can be obtained by simplifying the above transformed query, assuming the number of duplicates of each tuple does not matter.
The process of replacing a nested query by a query with a join (possibly with a temporary relation) is called decorrelation.
Decorrelation is more complicated when the nested subquery uses aggregation, or when the result of the nested subquery is used to test for equality, or when the condition linking the nested subquery to the outer query is not exists, and so on. We do not attempt to give algorithms for the general case, and instead refer you to relevant items in the bibliographical notes.
Optimization of complex nested subqueries is a difficult task, as you can infer from the above discussion, and many optimizers do only a limited amount of decorrelation. It is best to avoid using complex nested subqueries, where possible, since we cannot be sure that the query optimizer will succeed in converting them to a form that can be evaluated efficiently.
I Love your articles guys keep it up. challenger coaching
ReplyDelete