Summary of Query Processing

Summary

• The ﬁrst action that the system must perform on a query is to translate the query into its internal form, which (for relational database systems) is usually based on the relational algebra. In the process of generating the internal form of the query, the parser checks the syntax of the user’s query, veriﬁes that the relation names appearing in the query are names of relations in the database, and so on. If the query was expressed in terms of a view, the parser replaces all references to the view name with the relational-algebra expression to compute the view.

• Given a query, there are generally a variety of methods for computing the answer. It is the responsibility of the query optimizer to transform the query as entered by the user into an equivalent query that can be computed more efﬁciently. Chapter 14 covers query optimization.

• We can process simple selection operations by performing a linear scan, by doing a binary search, or by making use of indices. We can handle complex selections by computing unions and intersections of the results of simple se- lections.

• We can sort relations larger than memory by the external merge – sort algorithm.

• Queries involving a natural join may be processed in several ways, depending on the availability of indices and the form of physical storage for the relations.

If the join result is almost as large as the Cartesian product of the two relations, a block nested-loop join strategy may be advantageous.

If indices are available, the indexed nested-loop join can be used.

If the relations are sorted, a merge join may be desirable. It may be advantageous to sort a relation prior to join computation (so as to allow use of the merge join strategy).

The hash join algorithm partitions the relations into several pieces, such that each piece of one of the relations ﬁts in memory. The partitioning is carried out with a hash function on the join attributes, so that correspond- ing pairs of partitions can be joined independently.

• Duplicate elimination, projection, set operations (union, intersection and difference), and aggregation can be done by sorting or by hashing.

• Outer join operations can be implemented by simple extensions of join algorithms.

• Hashing and sorting are dual, in the sense that any operation such as du- plicate elimination, projection, aggregation, join, and outer join that can be implemented by hashing can also be implemented by sorting, and vice versa; that is, any operation that can be implemented by sorting can also be implemented by hashing.

• An expression can be evaluated by means of materialization, where the sys- tem computes the result of each subexpression and stores it on disk, and then uses it to compute the result of the parent expression.

• Pipelining helps to avoid writing the results of many subexpressions to disk, by using the results in the parent expression even as they are being generated.

Review Terms

Exercises

Why is it not desirable to force users to make an explicit choice of a query- processing strategy? Are there cases in which it is desirable for users to be aware of the costs of competing query-processing strategies? Explain your answer.

Consider the following SQL query for our bank database:

Write an efﬁcient relational-algebra expression that is equivalent to this query. Justify your choice.

What are the advantages and disadvantages of hash indices relative to B+-tree indices? How might the type of index available inﬂuence the choice of a query- processing strategy?

Assume (for simplicity in this exercise) that only one tuple ﬁts in a block and memory holds at most 3 page frames. Show the runs created on each pass of the sort-merge algorithm, when applied to sort the following tuples on the ﬁrst attribute: (kangaroo, 17), (wallaby, 21), (emu, 1), (wombat, 13), (platypus, 3), (lion, 8), (warthog, 4), (zebra, 11), (meerkat, 6), (hyena, 9), (hornbill, 2), (baboon, 12).

Let relations r1(A, B, C) and r2(C, D, E) have the following properties: r1 has 20,000 tuples, r2 has 45,000 tuples, 25 tuples of r1 ﬁt on one block, and 30 tuples of r2 ﬁt on one block. Estimate the number of block accesses required, using each of the following join strategies for r1 r2:

a. Nested-loop join

b. Block nested-loop join

c. Merge join

d. Hash join

Design a variant of the hybrid merge – join algorithm for the case where both relations are not physically sorted, but both have a sorted secondary index on the join attributes.

The indexed nested-loop join algorithm described in Section 13.5.3 can be inef- ﬁcient if the index is a secondary index, and there are multiple tuples with the same value for the join attributes. Why is it inefﬁcient? Describe a way, using sorting, to reduce the cost of retrieving tuples of the inner relation. Under what conditions would this algorithm be more efﬁcient than hybrid merge – join?

Estimate the number of block accesses required by your solution to Exercise 13.6 for r1 r2, where r1 and r2 are as deﬁned in Exercise

13.5. Let r and s be relations with no indices, and assume that the relations are not sorted. Assuming inﬁnite memory, what is the lowest cost way (in terms of I/O operations) to compute r algorithm?

s? What is the amount of memory required for this Suppose that a B+-tree index on branch-city is available on relation branch, and that no other index is available. List different ways to handle the following selections that involve negation?

The hash join algorithm as described in Section 13.5.5 computes the natural join of two relations. Describe how to extend the hash join algorithm to compute the natural left outer join, the natural right outer join and the natural full outer join. (Hint: Keep extra information with each tuple in the hash index, to detect whether any tuple in the probe relation matches the tuple in the hash index.) Try out your algorithm on the customer and depositor relations.

Write pseudocode for an iterator that implements indexed nested-loop join, where the outer relation is pipelined. Use the standard iterator functions in your pseudocode. Show what state information the iterator must maintain be- tween calls.

Design sorting based and hashing algorithms for computing the division op- eration.

Bibliographical Notes

A query processor must parse statements in the query language, and must translate them into an internal form. Parsing of query languages differs little from parsing of traditional programming languages. Most compiler texts, such as Aho et al. [1986], cover the main parsing techniques, and present optimization from a programming- language point of view.

Knuth [1973] presents an excellent description of external sorting algorithms, including an optimization that can create initial runs that are (on the average) twice the size of memory. Based on performance studies conducted in the mid-1970s, database systems of that period used only nested-loop join and merge join. These stud- ies, which were related to the development of System R, determined that either the nested-loop join or merge join nearly always provided the optimal join method (Blas- gen and Eswaran [1976]); hence, these two were the only join algorithms implemented in System R. The System R study, however, did not include an analysis of hash join algorithms. Today, hash joins are considered to be highly efﬁcient.

Hash join algorithms were initially developed for parallel database systems. Hash join techniques are described in Kitsuregawa et al. [1983], and extensions including hybrid hash join are described in Shapiro [1986]. Zeller and Gray [1990] and Davison and Graefe [1994] describe hash join techniques that can adapt to the available memory, which is important in systems where multiple queries may be running at the same time. Graefe et al. [1998] describes the use of hash joins and hash teams, which allow pipelining of hash-joins by using the same partitioning for all hash-joins in a pipeline sequence, in the Microsoft SQL Server.

Graefe [1993] presents an excellent survey of query-evaluation techniques. An earlier survey of query-processing techniques appears in Jarke and Koch [1984].

Query processing in main memory database is covered by DeWitt et al. [1984] and Whang and Krishnamurthy [1990]. Kim [1982] and Kim [1984] describe join strategies and the optimal use of available main memory.

Search This Blog

Database Management System course