Summary of Query Optimization

Summary

• Given a query, there are generally a variety of methods for computing the answer. It is the responsibility of the system to transform the query as entered by the user into an equivalent query that can be computed more efﬁciently. The process of ﬁnding a good strategy for processing a query, is called query optimization.

• The evaluation of complex queries involves many accesses to disk. Since the transfer of data from disk is slow relative to the speed of main memory and the CPU of the computer system, it is worthwhile to allocate a considerable amount of processing to choose a method that minimizes disk accesses.

• The strategy that the database system chooses for evaluating an operation de- pends on the size of each relation and on the distribution of values within columns. So that they can base the strategy choice on reliable information, database systems may store statistics for each relation r. These statistics include The number of tuples in the relation r

The size of a record (tuple) of relation r in bytes The number of distinct values that appear in the relation r for a particular attribute

• These statistics allow us to estimate the sizes of the results of various operations, as well as the cost of executing the operations. Statistical information about relations is particularly useful when several indices are available to assist in the processing of a query. The presence of these structures has a significant inﬂuence on the choice of a query-processing strategy.

• Each relational-algebra expression represents a particular sequence of operations. The ﬁrst step in selecting a query-processing strategy is to ﬁnd a relation- al-algebra expression that is equivalent to the given expression and is estimated to cost less to execute.

• There are a number of equivalence rules that we can use to transform an expression into an equivalent one. We use these rules to generate systematically all expressions equivalent to the given query.

• Alternative evaluation plans for each expression can be generated by similar rules, and the cheapest plan across all expressions can be chosen. Several optimization techniques are available to reduce the number of alternative expressions and plans that need to be generated.

• We use heuristics to reduce the number of plans considered, and thereby to reduce the cost of optimization. Heuristic rules for transforming relational- algebra queries include “Perform selection operations as early as possible,” “Perform projections early,” and “Avoid Cartesian products.”

• Materialized views can be used to speed up query processing. Incremental view maintenance is needed to efﬁciently update materialized views when the underlying relations are modiﬁed. The differential of an operation can be computed by means of algebraic expressions involving differentials of the in- puts of the operation. Other issues related to materialized views include how to optimize queries by making use of available materialized views, and how to select views to be materialized.

Review Terms

Exercises

Clustering indices may allow faster access to data than a nonclustering index affords. When must we create a nonclustering index, despite the advantages of a clustering index? Explain your answer.

SQL allows relations with duplicates (Chapter 4).

a. Deﬁne versions of the basic relational-algebra operations σ, Π, ×, , −, ∪, and ∩ that work on relations with duplicates, in a way consistent with SQL.

b. Check which of the equivalence rules 1 through 7.b hold for the multiset version of the relational-algebra deﬁned in part a.

∗∗ Show that, with n relations, there are (2(n−1))!/(n−1)! different join orders.

Hint: A complete binary tree is one where every internal node has exactly two children. Use the fact that the number of different complete binary trees with n leaf nodes is 1 2(n−1) .

If you wish, you can derive the formula for the number of complete binary trees with n nodes from the formula for the number of binary trees with n nodes. The number of binary trees with n nodes is 1 2n ; this number is n+1 ( n )

known as the Catalan number, and its derivation can be found in any standard textbook on data structures or algorithms.

∗∗ Show that the lowest-cost join order can be computed in time O(3n). As- sume that you can store and look up information about a set of relations (such as the optimal join order for the set, and the cost of that join order) in constant time. (If you ﬁnd this exercise difﬁcult, at least show the looser time bound of O(22n).)

Show that, if only left-deep join trees are considered, as in the System R opti- mizer, the time taken to ﬁnd the most efﬁcient join order is around n2n. Assume that there is only one interesting sort order.

A set of equivalence rules is said to be complete if, whenever two expressions are equivalent, one can be derived from the other by a sequence of uses of the equivalence rules. Is the set of equivalence rules that we considered in Sec- tion 14.3.1 complete? Hint: Consider the equivalence σ3=5(r)= { }.

Decorrelation:

a. Write a nested query on the relation account to ﬁnd for each branch with name starting with “B”, all accounts with the maximum balance at the branch.

b. Rewrite the preceding query, without using a nested subquery; in other words, decorrelate the query.

c. Give a procedure (similar that that described in Section 14.4.5) for decorre- lating such queries.

Describe how to incrementally maintain the results of the following operations, on both insertions and deletions.

a. Union and set difference

b. Left outer join

Give an example of an expression deﬁning a materialized view and two situations (sets of statistics for the input relations and the differentials) such that incremental view maintenance is better than recomputation in one situation, and recomputation is better in the other situation.

Bibliographical Notes

The seminal work of Selinger et al. [1979] describes access-path selection in the Sys- tem R optimizer, which was one of the earliest relational-query optimizers. Graefe and McKenna [1993] describe Volcano, an equivalence-rule based query optimizer. Query processing in Starburst is described in Haas et al. [1989]. Query optimization in Oracle is brieﬂy outlined in Oracle [1997].

Estimation of statistics of query results, such as result size, is addressed by Ioanni- dis and Poosala [1995], Poosala et al. [1996], and Ganguly et al. [1996], among others.

Nonuniform distributions of values causes problems for estimation of query size and cost. Cost-estimation techniques that use histograms of value distributions have been proposed to tackle the problem. Ioannidis and Christodoulakis [1993], Ioannidis and Poosala [1995], and Poosala et al. [1996] present results in this area.

Exhaustive searching of all query plans is impractical for optimization of joins involving many relations, and techniques based on randomized searching, which do not examine all alternatives, have been proposed. Ioannidis and Wong [1987], Swami and Gupta [1988], and Ioannidis and Kang [1990] present results in this area.

Parametric query-optimization techniques have been proposed by Ioannidis et al. [1992] and Ganguly [1998], to handle query processing when the selectivity of query parameters is not known at optimization time. A set of plans — one for each of several different query selectivities — is computed, and is stored by the optimizer, at compile time. One of these plans is chosen at run time, on the basis of the actual selectivities, avoiding the cost of full optimization at run time.

Klug [1982] was an early work on optimization of relational-algebra expressions with aggregate functions. More recent work in this area includes Yan and Larson [1995] and Chaudhuri and Shim [1994]. Optimization of queries containing outer joins is described in Rosenthal and Reiner [1984], Galindo-Legaria and Rosenthal [1992], and Galindo-Legaria [1994].

The SQL language poses several challenges for query optimization, including the presence of duplicates and nulls, and the semantics of nested subqueries. Extension of relational algebra to duplicates is described in Dayal et al. [1982]. Optimization of nested subqueries is discussed in Kim [1982], Ganski and Wong [1987], Dayal [1987],

and more recently, in Seshadri et al. [1996].

When queries are generated through views, more relations often are joined than is necessary for computation of the query. A collection of techniques for join minimization has been grouped under the name tableau optimization. The notion of a tableau was introduced by Aho et al. [1979b] and Aho et al. [1979a], and was further extended by Sagiv and Yannakakis [1981]. Ullman [1988] andMaier [1983] provide a textbook coverage of tableaux.

Sellis [1988] and Roy et al. [2000] describe multiquery optimization, which is the problem of optimizing the execution of several queries as a group. If an entire group of queries is considered, it is possible to discover common subexpressions that can be evaluated once for the entire group. Finkelstein [1982] and Hall [1976] consider optimization of a group of queries and the use of common subexpressions. Dalvi et al.

[2001] discuss optimization issues in pipelining with limited buffer space combined with sharing of common subexpressions.

Query optimization can make use of semantic information, such as functional dependencies and other integrity constraints. Semantic query-optimization in relational databases is covered by King [1981], Chakravarthy et al. [1990], and in the context of aggregation, by Sudarshan and Ramakrishnan [1991].

Query-processing and optimization techniques for Datalog, in particular techniques to handle queries on recursive views, are described in Bancilhon and Ramakrishnan [1986], Beeri and Ramakrishnan [1991], Ramakrishnan et al. [1992c], Srivastava et al. [1995] and Mumick et al. [1996]. Query processing and optimization techniques for object-oriented databases are discussed in Maier and Stein [1986], Beech [1988], Bertino and Kim [1989], and Blakeley et al. [1993].

Blakeley et al. [1986], Blakeley et al. [1989], and Grifﬁn and Libkin [1995] describe techniques for maintenance of materialized views. Gupta and Mumick [1995] pro-vides a survey of materialized view maintenance. Optimization of materialized view maintenance plans is described by Vista [1998] and Mistry et al. [2001]. Query optimization in the presence of materialized views is addressed by Larson and Yang [1985], Chaudhuri et al. [1995], Dar et al. [1996], and Roy et al. [2000]. Index selection and materialized view selection are addressed by Ross et al. [1996], Labio et al.

[1997], Gupta [1997], Chaudhuri and Narasayya [1997], and Roy et al. [2000].

Search This Blog

Database Management System course