Indexing and Hashing:Basic Concepts

By Ahmed Farahat - May 07, 2015

Indexing and Hashing

Many queries reference only a small proportion of the records in a ﬁle. For example, a query like “Find all accounts at the Perryridge branch” or “Find the balance of account number A-101” references only a fraction of the account records. It is in- efﬁcient for the system to read every record and to check the branch-name ﬁeld for the name “Perryridge,” or the account-number ﬁeld for the value A-101. Ideally, the system should be able to locate these records directly. To allow these forms of access, we design additional structures that we associate with ﬁles.

Basic Concepts

An index for a ﬁle in a database system works in much the same way as the index in this textbook. If we want to learn about a particular topic (speciﬁed by a word or a phrase) in this textbook, we can search for the topic in the index at the back of the book, ﬁnd the pages where it occurs, and then read the pages to ﬁnd the information we are looking for. The words in the index are in sorted order, making it easy to ﬁnd the word we are looking for. Moreover, the index is much smaller than the book, further reducing the effort needed to ﬁnd the words we are looking for.

Card catalogs in libraries worked in a similar manner (although they are rarely used any longer). To ﬁnd a book by a particular author, we would search in the author catalog, and a card in the catalog tells us where to ﬁnd the book. To assist us in searching the catalog, the library would keep the cards in alphabetic order by authors, with one card for each author of each book.

Database system indices play the same role as book indices or card catalogs in libraries. For example, to retrieve an account record given the account number, the database system would look up an index to ﬁnd on which disk block the corresponding record resides, and then fetch the disk block, to get the account record.

Keeping a sorted list of account numbers would not work well on very large databases with millions of accounts, since the index would itself be very big; further, even though keeping the index sorted reduces the search time, ﬁnding an account can still be rather time-consuming. Instead, more sophisticated indexing techniques may be used. We shall discuss several of these techniques in this chapter.

There are two basic kinds of indices:

• Ordered indices. Based on a sorted ordering of the values.

• Hash indices. Based on a uniform distribution of values across a range of buckets. The bucket to which a value is assigned is determined by a function, called a hash function.

We shall consider several techniques for both ordered indexing and hashing. No one technique is the best. Rather, each technique is best suited to particular database applications. Each technique must be evaluated on the basis of these factors:

• Access types: The types of access that are supported efﬁciently. Access types can include ﬁnding records with a speciﬁed attribute value and ﬁnding records whose attribute values fall in a speciﬁed range.

• Access time: The time it takes to ﬁnd a particular data item, or set of items, using the technique in question.

• Insertion time: The time it takes to insert a new data item. This value includes the time it takes to ﬁnd the correct place to insert the new data item, as well as the time it takes to update the index structure.

• Deletion time: The time it takes to delete a data item. This value includes the time it takes to ﬁnd the item to be deleted, as well as the time it takes to update the index structure.

• Space overhead: The additional space occupied by an index structure. Pro- vided that the amount of additional space is moderate, it is usually worth- while to sacriﬁce the space to achieve improved performance.

We often want to have more than one index for a ﬁle. For example, libraries maintained several card catalogs: for author, for subject, and for title.

An attribute or set of attributes used to look up records in a ﬁle is called a search key. Note that this deﬁnition of key differs from that used in primary key, candidate key, and superkey. This duplicate meaning for key is (unfortunately) well established in practice. Using our notion of a search key, we see that if there are several indices on a ﬁle, there are several search keys.

Search This Blog

Database Management System course

Indexing and Hashing:Basic Concepts

Basic Concepts

Comments

Post a Comment

Popular posts from this blog

Database System Architectures:Parallel Systems.

DATABASE DESIGN -2 part2

Query Optimization:Estimating Statistics of Expression Results