Indexing and Hashing:Multiple-Key Access

Multiple-Key Access

Until now, we have assumed implicitly that only one index (or hash table) is used to process a query on a relation. However, for certain types of queries, it is advantageous to use multiple indices if they exist.

Using Multiple Single-Key Indices

Assume that the account file has two indices: one for branch-name and one for balance. Consider the following query: “Find all account numbers at the Perryridge branch with balances equal to $1000.” We write

select loan-number

from account

where branch-name = “Perryridge” and balance = 1000

There are three strategies possible for processing this query:

1. Use the index on branch-name to find all records pertaining to the Perryridge branch. Examine each such record to see whether balance = 1000.

2. Use the index on balance to find all records pertaining to accounts with balances of $1000. Examine each such record to see whether branch-name = “Per- ryridge.”

3. Use the index on branch-name to find pointers to all records pertaining to the Perryridge branch. Also, use the index on balance to find pointers to all records pertaining to accounts with a balance of $1000. Take the intersection of these two sets of pointers. Those pointers that are in the intersection point to records pertaining to both Perryridge and accounts with a balance of $1000.

The third strategy is the only one of the three that takes advantage of the existence of multiple indices. However, even this strategy may be a poor choice if all of the following hold:

• There are many records pertaining to the Perryridge branch.

• There are many records pertaining to accounts with a balance of $1000.

• There are only a few records pertaining to both the Perryridge branch and accounts with a balance of $1000.

If these conditions hold, we must scan a large number of pointers to produce a small result. An index structure called a “bitmap index” greatly speeds up the intersection operation used in the third strategy. Bitmap indices are outlined in Section 12.9.4.

Indices on Multiple Keys

An alternative strategy for this case is to create and use an index on a search key (branch-name, balance)—that is, the search key consisting of the branch name concatenated with the account balance. The structure of the index is the same as that of any other index, the only difference being that the search key is not a single attribute, but rather is a list of attributes. The search key can be represented as a tuple of values, of the form (a1,..., an), where the indexed attributes are A1,... , An. The ordering of search-key values is the lexicographic ordering. For example, for the case of two attribute search keys, (a1, a2) < (b1, b2) if either a1 < b1 or a1 = b1 and a2 < b2. Lexicographic ordering is basically the same as alphabetic ordering of words.

The use of an ordered-index structure on multiple attributes has a few short- comings. As an illustration, consider the query

select loan-number

from account

where branch-name < “Perryridge” and balance = 1000

We can answer this query by using an ordered index on the search key (branch-name, balance): For each value of branch-name that is less than “Perryridge” in alphabetic order, the system locates records with a balance value of 1000. However, each record is likely to be in a different disk block, because of the ordering of records in the file, leading to many I/O operations.

The difference between this query and the previous one is that the condition on branch-name is a comparison condition, rather than an equality condition.

To speed the processing of general multiple search-key queries (which can involve one or more comparison operations), we can use several special structures. We shall consider the grid file in Section 12.9.3. There is another structure, called the R-tree, that can be used for this purpose. The R-tree is an extension of the B+-tree to handle in- dexing on multiple dimensions. Since the R-tree is used primarily with geographical data types, we describe the structure in Chapter 23.

Grid Files

Figure 12.32 shows part of a grid file for the search keys branch-name and balance on the account file. The two-dimensional array in the figure is called the grid array, and the one-dimensional arrays are called linear scales. The grid file has a single grid array, and one linear scale for each search-key attribute.

Search keys are mapped to cells in this way. Each cell in the grid array has a pointer to a bucket that contains the search-key values and pointers to records. Only some of the buckets and pointers from the cells are shown in the figure. To conserve space, multiple elements of the array can point to the same bucket. The dotted boxes in the figure indicate which cells point to the same bucket.

Suppose that we want to insert in the grid-file index a record whose search-key value is (“Brighton”, 500000). To find the cell to which the key is mapped, we independently locate the row and column to which the cell belongs.

We first use the linear scales on branch-name to locate the row of the cell to which the search key maps. To do so, we search the array to find the least element that is greater than “Brighton”. In this case, it is the first element, so the search key maps to the row marked 0. If it were the ith element, the search key would map to row i 1.

If the search key is greater than or equal to all elements in the linear scale, it maps to

image

the final row. Next, we use the linear scale on balance to find out similarly to which column the search key maps. In this case, the balance 500000 maps to column 6.

Thus, the search-key value (“Brighton”, 500000) maps to the cell in row 0, column 6. Similarly, (“Downtown”, 60000) would map to the cell in row 1 column 5. Both cells point to the same bucket (as indicated by the dotted box), so, in both cases, the system stores the search-key values and the pointer to the record in the bucket labeled Bj in the figure.

To perform a lookup to answer our example query, with the search condition of

branch-name < “Perryridge” and balance = 1000

we find all rows that can contain branch names less than “Perryridge”, using the linear scale on branch-name. In this case, these rows are 0, 1, and 2. Rows 3 and beyond contain branch names greater than or equal to “Perryridge”. Similarly, we find that only column 1 can contain a balance value of 1000. In this case, only column 1 satisfies this condition. Thus, only the cells in column 1, rows 0, 1, and 2, can contain entries that satisfy the search condition.

We therefore look up all entries in the buckets pointed to from these three cells. In this case, there are only two buckets, since two of the cells point to the same bucket, as indicated by the dotted boxes in the figure. The buckets may contain some search keys that do not satisfy the required condition, so each search key in the buckets must be tested again to see whether it satisfies the search condition. We have to examine only a small number of buckets, however, to answer this query.

We must choose the linear scales in such a way that the records are uniformly distributed across the cells. When a bucket—call it A—becomes full and an entry has to be inserted in it, the system allocates an extra bucket, B. If more than one cell points to A, the system changes the cell pointers so that some point to A and others to B.

The entries in bucket A and the new entry are then redistributed between A and B according to the cells to which they map. If only one cell points to bucket A, B becomes an overflow bucket for A. To improve performance in such a situation, we must reorganize the grid file, with an expanded grid array and expanded linear scales. The process is much like the expansion of the bucket address table in extensible hashing, and is left for you to do as an exercise.

It is conceptually simple to extend the grid-file approach to any number of search keys. If we want our structure to be used for queries on n keys, we construct an n-dimensional grid array with n linear scales.

The grid structure is suitable also for queries involving one search key. Consider this query:

select *

from account

where branch-name = “Perryridge”

The linear scale on branch-name tells us that only cells in row 3 can satisfy this condition. Since there is no condition on balance, we examine all buckets pointed to by cells in row 3 to find entries pertaining to Perryridge. Thus, we can use a grid-file index on two search keys to answer queries on either search key by itself, as well as to answer queries on both search keys. Thus, a single grid-file index can serve the role of three separate indices. If each index were maintained separately, the three together would occupy more space, and the cost of updating them would be high.

Grid files provide a significant decrease in processing time for multiple-key queries. However, they impose a space overhead (the grid directory can become large), as well as a performance overhead on record insertion and deletion. Further, it is hard to choose partitioning ranges for the keys such that the distribution of records is uniform. If insertions to the file are frequent, reorganization will have to be carried out periodically, and that can have a high cost.

Bitmap Indices

Bitmap indices are a specialized type of index designed for easy querying on multiple keys, although each bitmap index is built on a single key.

For bitmap indices to be used, records in a relation must be numbered sequentially, starting from, say, 0. Given a number n, it must be easy to retrieve the record numbered n. This is particularly easy to achieve if records are fixed in size, and allocated on consecutive blocks of a file. The record number can then be translated easily into a block number and a number that identifies the record within the block.

Consider a relation r, with an attribute A that can take on only one of a small number (for example, 2 to 20) values. For instance, a relation customer-info may have an attribute gender, which can take only values m (male) or f (female). Another example would be an attribute income-level, where income has been broken up into 5 levels:

L1: $0 9999, L2: $10, 000 19, 999, L3: 20, 000 39, 999, L4: 40, 000 74, 999, and L5: 75, 000 − ∞. Here, the raw data can take on many values, but a data analyst has split the values into a small number of ranges to simplify analysis of the data.

Bitmap Index Structure

A bitmap is simply an array of bits. In its simplest form, a bitmap index on the attribute A of relation r consists of one bitmap for each value that A can take. Each bitmap has as many bits as the number of records in the relation. The ith bit of the bitmap for value vj is set to 1 if the record numbered i has the value vj for attribute A. All other bits of the bitmap are set to 0.

In our example, there is one bitmap for the value m and one for f. The ith bit of the bitmap for m is set to 1 if the gender value of the record numbered i is m. All other bits of the bitmap for m are set to 0. Similarly, the bitmap for f has the value 1 for bits corresponding to records with the value f for the gender attribute; all other bits have the value 0. Figure 12.33 shows an example of bitmap indices on a relation customer-info.

We now consider when bitmaps are useful. The simplest way of retrieving all records with value m (or value f) would be to simply read all records of the relation and select those records with value m (or f, respectively). The bitmap index doesn’t really help to speed up such a selection.

image

In fact, bitmap indices are useful for selections mainly when there are selections on multiple keys. Suppose we create a bitmap index on attribute income-level, which we described earlier, in addition to the bitmap index on gender.

Consider now a query that selects women with income in the range 10, 000 19, 999. This query can be expressed as σgender=fincome-level=L2(r). To evaluate this selection, we fetch the bitmaps for gender value f and the bitmap for income-level value L2, and perform an intersection (logical-and) of the two bitmaps. In other words, we compute a new bitmap where bit i has value 1 if the ith bit of the two bitmaps are both 1, and has a value 0 otherwise. In the example in Figure 12.33, the intersection of the bitmap for gender = f (01101) and the bitmap for income-level = L1 (10100) gives the bitmap 00100.

Since the first attribute can take 2 values, and the second can take 5 values, we would expect only about 1 in 10 records, on an average, to satisfy a combined condition on the two attributes. If there are further conditions, the fraction of records satisfying all the conditions is likely to be quite small. The system can then compute the query result by finding all bits with value 1 in the intersection bitmap, and retrieving the corresponding records. If the fraction is large, scanning the entire relation would remain the cheaper alternative.

Another important use of bitmaps is to count the number of tuples satisfying a given selection. Such queries are important for data analysis. For instance, if we wish to find out how many women have an income level L2, we compute the intersection of the two bitmaps, and then count the number of bits that are 1 in the intersection bitmap. We can thus get the desired result from the bitmap index, without even accessing the relation.

Bitmap indices are generally quite small compared to the actual relation size. Records are typically at least tens of bytes to hundreds of bytes long, whereas a single bit represents the record in a bitmap. Thus the space occupied by a single bitmap is usually less than 1 percent of the space occupied by the relation. For instance, if the record size for a given relation is 100 bytes, then the space occupied by a single bitmap would be 1 of 1 percent of the space occupied by the relation. If an attribute A of the relation can take on only one of 8 values, a bitmap index on attribute A would consist of 8 bitmaps, which together occupy only 1 percent of the size of the relation.

Deletion of records creates gaps in the sequence of records, since shifting records (or record numbers) to fill gaps would be extremely expensive. To recognize deleted records, we can store an existence bitmap, in which bit i is 0 if record i does not exist and 1 otherwise. We will see the need for existence bitmaps in Section 12.9.4.2. Insertion of records should not affect the sequence numbering of other records. Therefore, we can do insertion either by appending records to the end of the file or by replacing deleted records.

Efficient Implementation of Bitmap Operations

We can compute the intersection of two bitmaps easily by using a for loop: the ith iteration of the loop computes the and of the ith bits of the two bitmaps. We can speed up computation of the intersection greatly by using bit-wise and instructions supported by most computer instruction sets. A word usually consists of 32 or 64 bits, depending on the architecture of the computer. A bit-wise and instruction takes two words as input and outputs a word where each bit is the logical and of the bits in corresponding positions of the input words. What is important to note is that a single bit-wise and instruction can compute the intersection of 32 or 64 bits at once.

If a relation had 1 million records, each bitmap would contain 1 million bits, or equivalently 128 Kbytes. Only 31,250 instructions are needed to compute the intersection of two bitmaps for our relation, assuming a 32-bit word length. Thus, computing bitmap intersections is an extremely fast operation.

Just like bitmap intersection is useful for computing the and of two conditions, bitmap union is useful for computing the or of two conditions. The procedure for bitmap union is exactly the same as for intersection, except we use bit-wise or instructions instead of bit-wise and instructions.

The complement operation can be used to compute a predicate involving the negation of a condition, such as not (income-level = L1). The complement of a bitmap is generated by complementing every bit of the bitmap (the complement of 1 is 0 and the complement of 0 is 1). It may appear that not (income-level = L1) can be implemented by just computing the complement of the bitmap for income level L1. If some records have been deleted, however, just computing the complement of a bitmap is not sufficient. Bits corresponding to such records would be 0 in the original bitmap, but would become 1 in the complement, although the records don’t exist. A similar problem also arises when the value of an attribute is null. For instance, if the value of income-level is null, the bit would be 0 in the original bitmap for value L1, and 1 in the complement bitmap.

To make sure that the bits corresponding to deleted records are set to 0 in the result, the complement bitmap must be intersected with the existence bitmap to turn off the bits for deleted records. Similarly, to handle null values, the complement bitmap must also be intersected with the complement of the bitmap for the value null.1

Counting the number of bits that are 1 in a bitmap can be done fast by a clever technique. We can maintain an array with 256 entries, where the ith entry stores the

1. Handling predicates such as is unknown would cause further complications, which would in general require use of an extra bitmap to to track which operation results are unknown.

clip_image008number of bits that are 1 in the binary representation of i. Set the total count initially to 0. We take each byte of the bitmap, use it to index into this array, and add the stored count to the total count. The number of addition operations would be 1 of the number of tuples, and thus the counting process is very efficient. A large array (using 216 = 65536 entries), indexed by pairs of bytes, would give even higher speedup, but at a higher storage cost.

Bitmaps and B+-Trees

Bitmaps can be combined with regular B+-tree indices for relations where a few at- tribute values are extremely common, and other values also occur, but much less frequently. In a B+-tree index leaf, for each value we would normally maintain a list of all records with that value for the indexed attribute. Each element of the list would be a record identifier, consisting of at least 32 bits, and usually more. For a value that occurs in many records, we store a bitmap instead of a list of records.

Suppose a particular value vi occurs in 1 of the records of a relation. Let N be the number of records in the relation, and assume that a record has a 64-bit number identifying it. The bitmap needs only 1 bit per record, or N bits in total. In contrast, the list representation requires 64 bits per record where the value occurs, or 64

N/16 = 4N bits. Thus, a bitmap is preferable for representing the list of records for value vi. In our example (with a 64-bit record identifier), if fewer than 1 in 64 records have a particular value, the list representation is preferable for identifying records with that value, since it uses fewer bits than the bitmap representation. If more than 1 in 64 records have that value, the bitmap representation is preferable.

Thus, bitmaps can be used as a compressed storage mechanism at the leaf nodes of B+-trees, for those values that occur very frequently.

Comments

Popular posts from this blog

XML Document Schema

Extended Relational-Algebra Operations.

Distributed Databases:Concurrency Control in Distributed Databases