Relational-Database Design:Overall Database Design Process

Overall Database Design Process

So far we have looked at detailed issues about normal forms and normalization. In this section we study how normalization fits into the overall database design process.

Earlier in the chapter, starting in Section 7.4, we assumed that a relation schema R is given, and proceeded to normalize it. There are several ways in which we could have come up with the schema R:

1. R could have been generated when converting E-R diagram to a set of tables.

2. R could have been a single relation containing all attributes that are of interest.

The normalization process then breaks up R into smaller relations.

3. R could have been the result of some ad hoc design of relations, which we then test to verify that it satisfies a desired normal form.

In the rest of this section we examine the implications of these approaches. We also examine some practical issues in database design, including denormalization for performance and examples of bad design that are not detected by normalization.

E-R Model and Normalization

When we carefully define an E-R diagram, identifying all entities correctly, the tables generated from the E-R diagram should not need further normalization. However, there can be functional dependencies between attributes of an entity. For instance, suppose an employee entity had attributes department-number and department-address, and there is a functional dependency department-number department-address. We would then need to normalize the relation generated from employee.

Most examples of such dependencies arise out of poor E-R diagram design. In the above example, if we did the E-R diagram correctly, we would have created a department entity with attribute department-address and a relationship between employee and department. Similarly, a relationship involving more than two entities may not be in a desirable normal form. Since most relationships are binary, such cases are relatively rare. (In fact, some E-R diagram variants actually make it difficult or impossible to

specify nonbinary relations.)

Functional dependencies can help us detect poor E-R design. If the generated relations are not in desired normal form, the problem can be fixed in the E-R diagram.

That is, normalization can be done formally as part of data modeling. Alternatively, normalization can be left to the designer’s intuition during E-R modeling, and can be done formally on the relations generated from the E-R model.

The Universal Relation Approach

The second approach to database design is to start with a single relation schema containing all attributes of interest, and decompose it. One of our goals in choosing a decomposition was that it be a lossless-join decomposition. To consider losslessness, we assumed that it is valid to talk about the join of all the relations of the decomposed database.

Consider the database of Figure 7.20, showing a decomposition of the loan-info relation. The figure depicts a situation in which we have not yet determined the amount of loan L-58, but wish to record the remainder of the data on the loan. If we compute the natural join of these relations, we discover that all tuples referring to loan L-58 disappear. In other words, there is no loan-info relation corresponding to the relations of Figure 7.20. Tuples that disappear when we compute the join are dangling tuples (see Section 6.2.1). Formally, let r1(R1), r2(R2),... , rn(Rn) be a set of relations. A

image

Dangling tuples may occur in practical database applications. They represent in- complete information, as they do in our example, where we wish to store data about a loan that is still in the process of being negotiated. The relation r1 r2 ··· rn is called a universal relation, since it involves all the attributes in the universe defined by R1 R2 ∪ ··· ∪ Rn.

The only way that we can write a universal relation for the example of Figure 7.20 is to include null values in the universal relation. We saw in Chapter 3 that null values present several difficulties. Because of them, it may be better to view the relations of the decomposed design as representing the database, rather than as the univer- sal relation whose schema we decomposed during the normalization process. (The bibliographical notes discuss research on null values and universal relations.)

Note that we cannot enter all incomplete information into the database of Figure 7.20 without resorting to null values. For example, we cannot enter a loan number unless we know at least one of the following:

• The customer name

• The branch name

• The amount of the loan

Thus, a particular decomposition defines a restricted form of incomplete information that is acceptable in our database.

The normal forms that we have defined generate good database designs from the point of view of representation of incomplete information. Returning again to the example of Figure 7.20, we would not want to allow storage of the following fact: “There is a loan (whose number is unknown) to Jones in the amount of $100.” This is because

loan-number customer-name amount

and therefore the only way that we can relate customer-name and amount is through loan-number. If we do not know the loan number, we cannot distinguish this loan from other loans with unknown numbers.

In other words, we do not want to store data for which the key attributes are un- known. Observe that the normal forms that we have defined do not allow us to store that type of information unless we use null values. Thus, our normal forms allow representation of acceptable incomplete information via dangling tuples, while prohibiting the storage of undesirable incomplete information.

Another consequence of the universal relation approach to database design is that attribute names must be unique in the universal relation. We cannot use name to refer to both customer-name and to branch-name. It is generally preferable to use unique names, as we have done. Nevertheless, if we defined our relation schemas directly, rather than in terms of a universal relation, we could obtain relations on schemas such as the following for our banking example:

branch-loan (name, number) loan-customer (number, name) amt (number, amount) Observe that, with the preceding relations, expressions such as branch-loan loan- customer are meaningless. Indeed, the expression branch-loan loan-customer finds loans made by branches to customers who have the same name as the name of the branch.

In a language such as SQL, however, a query involving branch-loan and loan-customer must remove ambiguity in references to name by prefixing the relation name. In such environments, the multiple roles for name (as branch name and as customer name) are less troublesome and may be simpler to use.

We believe that using the unique-role assumption — that each attribute name has a unique meaning in the database — is generally preferable to reusing of the same name in multiple roles. When the unique-role assumption is not made, the database designer must be especially careful when constructing a normalized relational-data-base design.

Denormalization for Performance

Occasionally database designers choose a schema that has redundant information; that is, it is not normalized. They use the redundancy to improve performance for specific applications. The penalty paid for not using a normalized schema is the extra work (in terms of coding time and execution time) to keep redundant data consistent.

For instance, suppose that the name of an account holder has to be displayed along with the account number and balance, every time the account is accessed. In our normalized schema, this requires a join of account with depositor.

One alternative to computing the join on the fly is to store a relation containing all the attributes of account and depositor. This makes displaying the account information faster. However, the balance information for an account is repeated for every person who owns the account, and all copies must be updated by the application, when- ever the account balance is updated. The process of taking a normalized schema and making it non-normalized is called denormalization, and designers use it to tune performance of systems to support time-critical operations.

A better alternative, supported by many database systems today, is to use the normalized schema, and additionally store the join or account and depositor as a materialized view. (Recall that a materialized view is a view whose result is stored in the database, and brought up to date when the relations used in the view are updated.) Like denormalization, using materialized view does have space and time overheads; however, it has the advantage that keeping the view up to date is the job of the database system, not the application programmer.

Other Design Issues

There are some aspects of database design that are not addressed by normalization, and can thus lead to bad database design. We give examples here; obviously, such designs should be avoided.

Consider a company database, where we want to store earnings of companies in different years. A relation earnings(company-id, year, amount) could be used to store the earnings information. The only functional dependency on this relation is company-id, year amount, and the relation is in BCNF.

An alternative design is to use multiple relations, each storing the earnings for a different year. Let us say the years of interest are 2000, 2001, and 2002; we would then have relations of the form earnings-2000, earnings-2001, earnings-2002, all of which are on the schema (company-id, earnings). The only functional dependency here on each relation would be company-id earnings, so these relations are also in BCNF.

However, this alternative design is clearly a bad idea — we would have to create a new relation every year, and would also have to write new queries every year, to take each new relation into account. Queries would also be more complicated since they may have to refer to many relations.

Yet another way of representing the same data is to have a single relation company- year(company-id, earnings-2000, earnings-2001, earnings-2002). Here the only functional dependencies are from company-id to the other attributes, and again the relation is in BCNF. This design is also a bad idea since it has problems similar to the previous design — namely we would have to modify the relation schema and write new queries, every year. Queries would also be more complicated, since they may have to refer to many attributes.

Representations such as those in the company-year relation, with one column for each value of an attribute, are called crosstabs; they are widely used in spreadsheets and reports and in data analysis tools. While such representations are useful for display to users, for the reasons just given, they are not desirable in a database design. SQL extensions have been proposed to convert data from a normal relational representation to a crosstab, for display.

Comments

Popular posts from this blog

Database System Architectures:Parallel Systems.

DATABASE DESIGN -2 part2

Database System Architectures:Network Types