Query Optimization:Estimating Statistics of Expression Results

Estimating Statistics of Expression Results

The cost of an operation depends on the size and other statistics of its inputs. Given an expression such as a (b c) to estimate the cost of joining a with (b c), we need to have estimates of statistics such as the size of b c.

In this section we first list some statistics about database relations that are stored in database system catalogs, and then show how to use the statistics to estimate statistics on the results of various elational operations.

One thing that will become clear later in this section is that the estimates are not very accurate, since they are based on assumptions that may not hold exactly. A query evaluation plan that has the lowest estimated execution cost may therefore not actually have the lowest actual execution cost. However, real-world experience has shown that even if estimates are not precise, the plans with the lowest estimated costs usually have actual execution costs that are either the lowest actual execution costs, or are close to the lowest actual execution costs.

Catalog Information

The DBMS catalog stores the following statistical information about database relations:

nr , the number of tuples in the relation r.

br , the number of blocks containing tuples of relation r.

lr , the size of a tuple of relation r in bytes.

fr , the blocking factor of relation r — that is, the number of tuples of relation r that fit into one block.

V (A, r), the number of distinct values that appear in the relation r for attribute A. This value is the same as the size of ΠA(r). If A is a key for relation r, V (A, r) is nr .

The last statistic, V (A, r), can also be maintained for sets of attributes, if desired, instead of just for individual attributes. Thus, given a set of attributes, A, V (A, r) is the size of ΠA(r).

If we assume that the tuples of relation r are stored together physically in a file, the following equation holds:

Statistics about indices, such as the heights of B+-tree indices and number of leaf pages in the indices, are also maintained in the catalog.

If we wish to maintain accurate statistics, then, every time a relation is modified, we must also update the statistics. This update incurs a substantial amount of over- head. Therefore, most systems do not update the statistics on every modification. In- stead, they update the statistics during periods of light system load. As a result, the statistics used for choosing a query-processing strategy may not be completely accurate. However, if not too many updates occur in the intervals between the updates of the statistics, the statistics will be sufficiently accurate to provide a good estimation of the relative costs of the different plans.

The statistical information noted here is simplified. Real-world optimizers often maintain further statistical information to improve the accuracy of their cost estimates of evaluation plans. For instance, some databases store the distribution of values for each attribute as a histogram: in a histogram the values for the attribute are divided into a number of ranges, and with each range the histogram associates the number of tuples whose attribute value lies in that range. As an example of a histogram, the range of values for an attribute age of a relation person could be divided into 0 – 9, 10 – 19, ..., 90 – 99 (assuming a maximum age of 99). With each range we store a count of the number of person tuples whose age values lie in that range. With- out such histogram information, an optimizer would have to assume that the distribution of values is uniform; that is, each range has the same count.

Selection Size Estimation

The size estimate of the result of a selection operation depends on the selection predicate. We first consider a single equality predicate, then a single comparison predicate, and finally combinations of predicates.

σA = a(r): If we assume uniform distribution of values (that is, each value appears with equal probability), the selection result can be estimated to have nr /V (A, r) tuples, assuming that the value a appears in attribute A of some record of r. The assumption that the value a in the selection appears in some record is generally true, and cost estimates often make it implicitly. However, it is often not realistic to assume that each value appears with equal probability. The branch-name attribute in the account relation is an example where the assumption is not valid. There is one tuple in the account relation for each account. It is reasonable to expect that the large branches have more ac- counts than smaller branches. Therefore, certain branch-name values appear with greater probability than do others. Despite the fact that the uniform distribution assumption is often not correct, it is a reasonable approximation of reality in many cases, and it helps us to keep our presentation relatively simple.

σAv (r): Consider a selection of the form σAv (r). If the actual value used in the comparison (v) is available at the time of cost estimation, a more ac- curate estimate can be made. The lowest and highest values (min(A, r) and max(A, r)) for the attribute can be stored in the catalog. Assuming that values are uniformly distributed, we can estimate the number of records that will

satisfy the condition A v as 0 if v < min(A, r), as nr if v max(A, r), and

image

otherwise.

In some cases, such as when the query is part of a stored procedure, the value v may not be available when the query is optimized. In such cases, we will assume that approximately one-half the records will satisfy the comparison condition. That is, we assume the result has nr /2 tuples; the estimate maybe very inaccurate, but is the best we can do without any further information.

• Complex selections:

Conjunction: A conjunctive selection is a selection of the form

image

We can estimate the result size of such a selection: For each θi, we estimate the size of the selection σθi (r), denoted by si, as described previously. Thus, the probability that a tuple in the relation satisfies selection condition θi is si/nr .

The preceding probability is called the selectivity of the selection σθi (r). Assuming that the conditions are independent of each other, the probability that a tuple satisfies all the conditions is simply the product of all these probabilities. Thus, we estimate the number of tuples in the full selection as

image

Multiplying this value by nr gives us the estimated number of tuples that satisfy the selection.

D Negation: In the absence of nulls, the result of a selection σ¬θ (r) is simply the tuples of r that are not in σθ (r). We already know how to estimate the number of tuples in σθ (r). The number of tuples in σ¬θ (r) is therefore estimated to be n(r) minus the estimated number of tuples in σθ (r).

We can account for nulls by estimating the number of tuples for which the condition θ would evaluate to unknown, and subtracting that number from the above estimate ignoring nulls. Estimating that number would require extra statistics to be maintained in the catalog.

Join Size Estimation

In this section, we see how to estimate the size of the result of a join.

The Cartesian product r × s contains nr ns tuples. Each tuple of r × s occupies lr + ls bytes, from which we can calculate the size of the Cartesian product.

Estimating the size of a natural join is somewhat more complicated than estimating the size of a selection or of a Cartesian product. Let r(R) and s(S) be relations.

• If R S = — that is, the relations have no attribute in common — then r s is the same as r × s, and we can use our estimation technique for Cartesian products.

• If R S is a key for R, then we know that a tuple of s will join with at most one tuple from r. Therefore, the number of tuples in rs I s no greater than the number of tuples in s. The case where R S is a key for S is symmetric to the case just described. If R S forms a foreign key of S, referencing R, the number of tuples in r s is exactly the same as the number of tuples in s.

• The most difficult case is when R S is a key for neither R nor S. In this case, we assume, as we did for selections, that each value appears with equal probability. Consider a tuple t of r, and assume R S = {A}. We estimate that tuple t produces

image

tuples in r s. These two estimates differ if V (A, r) j= V (A, s). If this situation occurs, there are likely to be dangling tuples that do not participate in the join.

Thus, the lower of the two estimates is probably the more accurate one.

The preceding estimate of join size may be too high if the V (A, r) values for attribute A in r have few values in common with the V (A, s) values for attribute A in s. However, this situation is unlikely to happen in the real world, since dangling tuples either do not exist, or constitute only a small fraction of the tuples, in most real-world relations. More important, the preceding estimate depends on the assumption that each value appears with equal probability. More sophisticated techniques for size estimation have to be used if this assumption does not hold.

We can estimate the size of a theta join r s by rewriting the join as σθ (r × s), and using the size estimates for Cartesian products along with the size estimates for selections, which we saw in Section 14.2.2.

To illustrate all these ways of estimating join sizes, consider the expression

image

 

Also assume that customer-name in depositor is a foreign key on customer.

In our example of depositor customer, customer-name in depositor is a foreign key referencing customer; hence, the size of the result is exactly ndepositor , which is 5000.

Let us now compute the size estimates for depositor customer without using information about foreign keys. Since V (customer-name, depositor) = 2500 and V (customer- name, customer) = 10000, the two estimates we get are 5000 10000/2500 = 20, 000 and 5000 10000/10000 = 5000, and we choose the lower one. In this case, the lower of these estimates is the same as that which we computed earlier from information about foreign keys.

Size Estimation for Other Operations

We outline below how to estimate the sizes of the results of other relational algebra operations.

Projection: The estimated size (number of records or number of tuples) of a projection of the form ΠA(r) is V (A, r), since projection eliminates duplicates. Aggregation: The size of AGF (r) is simply V (A, r), since there is one tuple in AGF (r) for each distinct value of A.

Set operations: If the two inputs to a set operation are selections on the same relation, we can rewrite the set operation as disjunctions, conjunctions, or negations. For example, σθ1 (r) σθ2 (r) can be rewritten as σθ1 θ2 (r). Similarly, we can rewrite intersections as conjunctions, and we can rewrite set difference by using negation, so long as the two relations participating in the set operations are selections on the same relation. We can then use the estimates for selections involving conjunctions, disjunctions, and negation in Section 14.2.2.

If the inputs are not selections on the same relation, we estimate the sizes this way: The estimated size of r s is the sum of the sizes of r and s. The estimated size of r s is the minimum of the sizes of r and s. The estimated size of r s is the same size as r. All three estimates may be inaccurate, but provide upper bounds on the sizes.

image

Estimation of Number of Distinct Values

For selections, the number of distinct values of an attribute (or set of attributes) A in the result of a selection, V (A, σθ (r)), can be estimated in these ways:

image

in A that are only from s. Again, more accurate estimates can be derived by using probability theory, but the above approximations work fairly well.

The estimates of distinct values are straightforward for projections: They are the same in ΠA(r) as in r. The same holds for grouping attributes of aggregation. For results of sum, count, and average, we can assume, for simplicity, that all aggregate values are distinct. For min(A) and max(A), the number of distinct values can be estimated as min(V (A, r),V (G, r)), where G denotes the grouping attributes. We omit details of estimating distinct values for other operations.

Comments

Popular posts from this blog

XML Document Schema

Extended Relational-Algebra Operations.

Distributed Databases:Concurrency Control in Distributed Databases