Data log

Data log

Data log is a nonprocedural query language based on the logic-programming language Prolog. As in the relational calculus, a user describes the information desired without giving a specific procedure for obtaining that information. The syntax of Data log resembles that of Prolog. However, the meaning of Data log programs is defined in a purely declarative manner, unlike the more procedural semantics of Prolog, so Data log simplifies writing simple queries and makes query optimization easier.

Basic Structure

A Data log program consists of a set of rules. Before presenting a formal definition of Data log rules and their formal meaning, we consider examples. Consider a Data log rule to define a view relation v1 containing account numbers and balances for accounts at the Perryridge branch with a balance of over $700:

image

Datalog rules define views; the preceding rule uses the relation account, and de- fines the view relation v1. The symbol :– is read as “if,” and the comma separating the “account(A, “Perryridge”, B)” from “B > 700” is read as “and.” Intuitively, the rule is understood as follows:

image

Suppose that the relation account is as shown in Figure 5.4. Then, the view relationv1 contains the tuples in Figure 5.5.

To retrieve the balance of account number A-217 in the view relation v1, we can write the following query:

image

image

In general, we need more than one rule to define a view relation. Each rule defines a set of tuples that the view relation must contain. The set of tuples in the view relation is then defined as the union of all these sets of tuples. The following Data log program specifies the interest rates for accounts:

interest-rate(A, 5) :– account(A, N , B), B < 10000

interest-rate(A, 6) :– account(A, N , B), B >= 10000

The program has two rules defining a view relation interest-rate, whose attributes are the account number and the interest rate. The rules say that, if the balance is less than $10000, then the interest rate is 5 percent, and if the balance is greater than or equal to $10000, the interest rate is 6 percent.

Datalog rules can also use negation. The following rules define a view relation c that contains the names of all customers who have a deposit, but have no loan, at the bank:

c(N ) :– depositor(N ,A), not is-borrower(N )

is-borrower(N ) :– borrower(N , L),

Prolog and most Data log implementations recognize attributes of a relation by position and omit attribute names. Thus, Data log rules are compact, compared to SQL

image

queries. However, when relations have a large number of attributes, or the order or number of attributes of relations may change, the positional notation can be cumbersome and error prone. It is not hard to create a variant of Data log syntax using named attributes, rather than positional attributes. In such a system, the Datalog rule defining v1 can be written as

image

Translation between the two forms can be done without significant effort, given the relation schema.

Syntax of Data log Rules

Now that we have informally explained rules and queries, we can formally define their syntax; we discuss their meaning in Section 5.2.3. We use the same conventions as in the relational algebra for denoting relation names, attribute names, and constants (such as numbers or quoted strings). We use uppercase (capital) letters and words starting with uppercase letters to denote variable names, and lowercase let- ters and words starting with lowercase letters to denote relation names and attribute names. Examples of constants are 4, which is a number, and “John,” which is a string;

X and Name are variables. A positive literal has the form

image

Literals involving arithmetic operations are treated specially. For example, the lit- eral B > 700, although not in the syntax just described, can be conceptually un- derstood to stand for > (B, 700), which is in the required syntax, and where > is a relation.

But what does this notation mean for arithmetic operations such as “>”? The re- lation > (conceptually) contains tuples of the form (x, y) for every possible pair of values x, y such that x > y. Thus, (2, 1) and (5, 33) are both tuples in >. Clearly, the (conceptual) relation > is infinite. Other arithmetic operations (such as >, =, + or ) are also treated conceptually as relations. For example, A = B + C stands conceptually for +(B, C, A), where the relation + contains every tuple (x, y, z) such that z = x + y.

A fact is written in the form

image

and denotes that the tuple (v1, v2,... , vn) is in relation p. A set of facts for a relation can also be written in the usual tabular notation. A set of facts for the relations in a database schema is equivalent to an instance of the database schema. Rules are built out of literals and have the form

image

where each Li is a (positive or negative) literal. The literal p(t1, t2,..., tn) is referred to as the head of the rule, and the rest of the literals in the rule constitute the body of the rule.

A Datalog program consists of a set of rules; the order in which the rules are writ- ten has no significance. As mentioned earlier, there may be several rules defining a relation.

Figure 5.6 shows a Datalog program that defines the interest on each account in the Perryridge branch. The first rule of the program defines a view relation interest, whose attributes are the account number and the interest earned on the account. It uses the relation account and the view relation interest-rate. The last two rules of the program are rules that we saw earlier.

A view relation v1 is said to depend directly on a view relation v2 if v2 is used in the expression defining v1. In the above program, view relation interest depends directly on relations interest-rate and account. Relation interest-rate in turn depends directly on account.

A view relation v1 is said to depend indirectly on view relation v2 if there is a sequence of intermediate relations i1, i2,... , in, for some n, such that v1 depends directly on i1, i1 depends directly on i2, and so on till in1 depends on in.

In the example in Figure 5.6, since we have a chain of dependencies from interest to interest-rate to account, relation interest also depends indirectly on account.

Finally, a view relation v1 is said to depend on view relation v2 if v1 either depends directly or indirectly on v2.

A view relation v is said to be recursive if it depends on itself. A view relation that is not recursive is said to be nonrecursive.

Consider the program in Figure 5.7. Here, the view relation empl depends on itself (becasue of the second rule), and is therefore recursive. In contrast, the program in Figure 5.6 is nonrecursive.

image

image

Semantics of Nonrecursive Datalog

We consider the formal semantics of Datalog programs. For now, we consider only programs that are nonrecursive. The semantics of recursive programs is somewhat more complicated; it is discussed in Section 5.2.6. We define the semantics of a pro- gram by starting with the semantics of a single rule.

Semantics of a Rule

A ground instantiation of a rule is the result of replacing each variable in the rule by some constant. If a variable occurs multiple times in a rule, all occurrences of the variable must be replaced by the same constant. Ground instantiations are often simply called instantiations.

Our example rule defining v1, and an instantiation of the rule, are:

image

Here, variable A was replaced by “A-217,” and variable B by 750.

A rule usually has many possible instantiations. These instantiations correspond to the various ways of assigning values to each variable in the rule.

Suppose that we are given a rule R,

image

and a set of facts I for the relations used in the rule (I can also be thought of as a database instance). Consider any instantiation Rl of rule R:

image

where each literal li is either of the form qi(vi,1, v1,2,... , vi,ni ) or of the form not qi(vi,1, v1,2, ..., vi,ni ), and where each vi and each vi,j is a constant.

We say that the body of rule instantiation Rl is satisfied in I if

1. For each positive literal qi(vi,1,... , vi,ni ) in the body of Rl, the set of facts I contains the fact q(vi,1,... , vi,ni ).

2. For each negative literal not qj (vj,1,..., vj,nj ) in the body of Rl, the set of facts I does not contain the fact qj (vj,1,..., vj,nj ).

image

The fact account(“A-217”, “Perryridge”, 750) is in the set of facts I. Further, 750 is greater than 700, and hence conceptually (750, 700) is in the relation “>”. Hence, the body of the rule instantiation is satisfied in I. There are other possible instantiations of R, and using them we find that infer(R, I) has exactly the set of facts for v1 that appears in Figure 5.8.

Semantics of a Program

When a view relation is defined in terms of another view relation, the set of facts in the first view depends on the set of facts in the second one. We have assumed, in this section, that the definition is nonrecursive; that is, no view relation depends (directly or indirectly) on itself. Hence, we can layer the view relations in the following way, and can use the layering to define the semantics of the program:

• A relation is in layer 1 if all relations used in the bodies of rules defining it are stored in the database.

• A relation is in layer 2 if all relations used in the bodies of rules defining it either are stored in the database or are in layer 1.

• In general, a relation p is in layer i +1 if (1) it is not in layers 1, 2,... , i, and (2) all relations used in the bodies of rules defining p either are stored in the database or are in layers 1, 2,... , i.

Consider the program in Figure 5.6. The layering of view relations in the program appears in Figure 5.9. The relation account is in the database. Relation interest-rate is

image

in level 1, since all the relations used in the two rules defining it are in the database. Relation perryridge-account is similarly in layer 1. Finally, relation interest is in layer 2, since it is not in layer 1 and all the relations used in the rule defining it are in the database or in layers lower than 2.

We can now define the semantics of a Datalog program in terms of the layering of view relations. Let the layers in a given program be 1, 2,... , n. Let Ri denote the set of all rules defining view relations in layer i.

• We define I0 to be the set of facts stored in the database, and define I1 as

I1 = I0 infer (R1, I0)

• We proceed in a similar fashion, defining I2 in terms of I1 and R2, and so on, using the following definition:

Ii+1 = Ii infer (Ri+1, Ii)

• Finally, the set of facts in the view relations defined by the program (also called the semantics of the program) is given by the set of facts In corresponding to the highest layer n.

For the program in Figure 5.6, I0 is the set of facts in the database, and I1 is the set of facts in the database along with all facts that we can infer from I0 using the rules for relations interest-rate and perryridge-account. Finally, I2 contains the facts in I1 along with the facts for relation interest that we can infer from the facts in I1 by the rule defining interest. The semantics of the program — that is, the set of those facts that are in each of the view relations — is defined as the set of facts I2.

Recall that, in Section 3.5.3, we saw how to define the meaning of nonrecursive relational-algebra views by a technique known as view expansion. View expansion

can be used with nonrecursive Datalog views as well; conversely, the layering technique described here can also be used with relational-algebra views.

Safety

It is possible to write rules that generate an infinite number of answers. Consider the rule

image

Since the relation defining > is infinite, this rule would generate an infinite number of facts for the relation gt, which calculation would, correspondingly, take an infinite amount of time and space.

The use of negation can also cause similar problems. Consider the rule:

not-in-loan(L, B, A) :– not loan(L, B, A)

The idea is that a tuple (loan-number, branch-name, amount) is in view relation not-in- loan if the tuple is not present in the loan relation. However, if the set of possible ac- count numbers, branch-names, and balances is infinite, the relation not-in-loan would be infinite as well.

Finally, if we have a variable in the head that does not appear in the body, we may get an infinite number of facts where the variable is instantiated to different values.

So that these possibilities are avoided, Data log rules are required to satisfy the following safety conditions:

1. Every variable that appears in the head of the rule also appears in a nonarithmetic positive literal in the body of the rule.

2. Every variable appearing in a negative literal in the body of the rule also appears in some positive literal in the body of the rule.

If all the rules in a nonrecursive Data log program satisfy the preceding safety conditions, then all the view relations defined in the program can be shown to be finite, as long as all the database relations are finite. The conditions can be weakened some- what to allow variables in the head to appear only in an arithmetic literal in the body in some cases. For example, in the rule

image

we can see that if relation q is finite, then so is p, according to the properties of addition, even though variable A appears in only an arithmetic literal.

Relational Operations in Datalog

Nonrecursive Datalog expressions without arithmetic operations are equivalent in expressive power to expressions using the basic operations in relational algebra (, , ×, σ, Π and ρ). We shall not formally prove this assertion here. Rather, we shall show through examples how the various relational-algebra operations can be expressed in Datalog. In all cases, we define a view relation called query to illustrate the operations.

We have already seen how to do selection by using Datalog rules. We perform projections simply by using only the required attributes in the head of the rule. To project attribute account-name from account, we use

image

Finally, we note that with the positional notation used in Datalog, the renaming operator ρ is not needed. A relation can occur more than once in the rule body, but instead of renaming to give distinct names to the relation occurrences, we can use different variable names in the different occurrences.

It is possible to show that we can express any nonrecursive Datalog query without arithmetic by using the relational-algebra operations. We leave this demonstration as an exercise for you to carry out. You can thus establish the equivalence of the basic operations of relational algebra and nonrecursive Datalog without arithmetic operations.

Certain extensions to Data log support the extended relational update operations of insertion, deletion, and update. The syntax for such operations varies from implementation to implementation. Some systems allow the use of + or in rule heads to denote relational insertion and deletion. For example, we can move all accounts at the Perryridge branch to the Johnstown branch by executing

+ account(A, “Johnstown”, B) :– account(A, “Perryridge”, B)

account(A, “Perryridge”, B) :– account(A, “Perryridge”, B)

Some implementations of Datalog also support the aggregation operation of ex- tended relational algebra. Again, there is no standard syntax for this operation.

Recursion in Datalog

Several database applications deal with structures that are similar to tree data structures. For example, consider employees in an organization. Some of the employees are managers. Each manager manages a set of people who report to him or her. But

image

each of these people may in turn be managers, and they in turn may have other people who report to them. Thus employees may be organized in a structure similar to a tree.

Suppose that we have a relation schema

Manager -schema = (employee-name, manager -name)

Let manager be a relation on the preceding schema.

Suppose now that we want to find out which employees are supervised, directly or indirectly by a given manager — say, Jones. Thus, if the manager of Alon is Barinsky, and the manager of Barinsky is Estovar, and the manager of Estovar is Jones, then Alon, Barinsky, and Estovar are the employees controlled by Jones. People often write programs to manipulate tree data structures by recursion. Using the idea of recursion, we can define the set of employees controlled by Jones as follows. The people supervised by Jones are (1) people whose manager is Jones and (2) people whose manager is supervised by Jones. Note that case (2) is recursive.

We can encode the preceding recursive definition as a recursive Datalog view, called empl-jones:

empl-jones(X) :– manager(X, “Jones” )

empl-jones(X) :– manager(X, Y ), empl-jones(Y )

The first rule corresponds to case (1); the second rule corresponds to case (2). The view empl-jones depends on itself because of the second rule; hence, the preceding Datalog program is recursive. We assume that recursive Datalog programs contain no rules with negative literals. The reason will become clear later. The bibliographical

image

image

notes refer to papers that describe where negation can be used in recursive Datalog programs.

The view relations of a recursive program that contains a set of rules R are defined to contain exactly the set of facts I computed by the iterative procedure Datalog Fixpoint in Figure 5.10. The recursion in the Datalog program has been turned into an iteration in the procedure. At the end of the procedure, infer(R,I) = I, and I is called a fixed point of the program.

Consider the program defining empl-jones, with the relation manager, as in Figure 5.11. The set of facts computed for the view relation empl-jones in each iteration appears in Figure 5.12. In each iteration, the program computes one more level of employees under Jones and adds it to the set empl-jones. The procedure terminates when there is no change to the set empl-jones, which the system detects by finding I = Old I. Such a termination point must be reached, since the set of managers and employees is finite. On the given manager relation, the procedure Datalog-Fixpoint terminates after iteration 4, when it detects that no new facts have been inferred.

You should verify that, at the end of the iteration, the view relation empl-jones contains exactly those employees who work under Jones. To print out the names of the employees supervised by Jones defined by the view, you can use the query ? empl-jones(N )

To understand procedure Datalog-Fixpoint, we recall that a rule infers new facts from a given set of facts. Iteration starts with a set of facts I set to the facts in the database. These facts are all known to be true, but there may be other facts that are true as well.1 Next, the set of rules R in the given Datalog program is used to infer what facts are true, given that facts in I are true. The inferred facts are added to I, and the rules are used again to make further inferences. This process is repeated until no new facts can be inferred.

For safe Datalog programs, we can show that there will be some point where no more new facts can be derived; that is, for some k, Ik+1 = Ik . At this point, then, we have the final set of true facts. Further, given a Datalog program and a database, the fixed-point procedure infers all the facts that can be inferred to be true.

1. The word “fact” is used in a technical sense to note membership of a tuple in a relation. Thus, in the Datalog sense of “fact,” a fact may be true (the tuple is indeed in the relation) or false (the tuple is not in the relation).

If a recursive program contains a rule with a negative literal, the following prolem can arise. Recall that when we make an inference by using a ground instantiation of a rule, for each negative literal notq in the rule body we check that q is not present in the set of facts I. This test assumes that q cannot be inferred later. However, in the fixed-point iteration, the set of facts I grows in each iteration, and even if q is not present in I at one iteration, it may appear in I later. Thus, we may have made an inference in one iteration that can no longer be made at an earlier iteration, and the inference was incorrect. We require that a recursive program should not contain negative literals, in order to avoid such problems.

Instead of creating a view for the employees supervised by a specific manager Jones, we can create a more general view relation empl that contains every tuple (X, Y ) such that X is directly or indirectly managed by Y , using the following program (also shown in Figure 5.7):

image

which gives the same set of values for X as the view empl-jones. Most Datalog implementations have sophisticated query optimizers and evaluation engines that can run the preceding query at about the same speed they could evaluate the view empl-jones.

The view empl defined previously is called the transitive closure of the relation manager. If the relation manager were replaced by any other binary relation R, the preceding program would define the transitive closure of R.

The Power of Recursion

Datalog with recursion has more expressive power than Datalog without recursion. In other words, there are queries on the database that we can answer by using recur- sion, but cannot answer without using it. For example, we cannot express transitive closure in Datalog without using recursion (or for that matter, in SQL or QBE without recursion). Consider the transitive closure of the relation manager. Intuitively, a fixed number of joins can find only those employees that are some (other) fixed number of levels down from any manager (we will not attempt to prove this result here). Since any given nonrecursive query has a fixed number of joins, there is a limit on how many levels of employees the query can find. If the number of levels of employees in the manager relation is more than the limit of the query, the query will miss some levels of employees. Thus, a nonrecursive Datalog program cannot express transitive closure.

An alternative to recursion is to use an external mechanism, such as embedded SQL, to iterate on a nonrecursive query. The iteration in effect implements the fixed point loop of Figure 5.10. In fact, that is how such queries are implemented on data base systems that do not support recursion. However, writing such queries by iteration is more complicated than using recursion, and evaluation by recursion can be optimized to run faster than evaluation by iteration.

The expressive power provided by recursion must be used with care. It is relatively easy to write recursive programs that will generate an infinite number of facts, as this program illustrates:

image

The program generates number(n) for all positive integers n, which is clearly infinite, and will not terminate. The second rule of the program does not satisfy the safety condition in Section 5.2.4. Programs that satisfy the safety condition will terminate, even if they are recursive, provided that all database relations are finite. For such programs, tuples in view relations can contain only constants from the database, and hence the view relations must be finite. The converse is not true; that is, there are programs that do not satisfy the safety conditions, but that do terminate.

Recursion in Other Languages

The SQL:1999 standard supports a limited form of recursion, using the with recursive clause. Suppose the relation manager has attributes emp and mgr. We can find every pair (X, Y ) such that X is directly or indirectly managed by Y , using this SQL:1999 query:

image

Recall that the with clause is used to define a temporary view whose definition is available only to the query where it is defined. The additional keyword recursive specifies that the view is recursive. The SQL definition of the view empl above is equivalent to the Datalog version we saw in Section 5.2.6.

The procedure Datalog-Fixpoint iteratively uses the function infer(R,I) to compute what facts are true, given a recursive Datalog program. Although we considered only the case of Datalog programs without negative literals, the procedure can also be used on views defined in other languages, such as SQL or relational algebra, provided that the views satisfy the conditions described next. Regardless of the language used to define a view V, the view can be thought of as being defined by an expression EV that, given a set of facts I, returns a set of facts EV (I) for the view relation V. Given a set of view definitions R (in any language), we can define a function infer(R,I) that returns I lV R EV (I). The preceding function has the same form as the infer function for Datalog.

A view V is said to be monotonic if, given any two sets of facts I1 and I2 such that I1 I2, then EV (I1) EV (I2), where EV is the expression used to define V .

Similarly, the function infer is said to be monotonic if  I1 I2 infer(R, I1) inf er(R, I2) Thus, if infer is monotonic, given a set of facts I0 that is a subset of the true facts, we can be sure that all facts in infer(R, I0) are also true. Using the same reasoning as in Section 5.2.6, we can then show that procedure Datalog-Fixpoint is sound (that is, it computes only true facts), provided that the function infer is monotonic.

Relational-algebra expressions that use only the operators Π, σ, ×, , , or ρ are monotonic. Recursive views can be defined by using such expressions.

However, relational expressions that use the operator are not monotonic. For example, let manager 1 and manager 2 be relations with the same schema as the manager relation. Let

image

Consider the expression manager 1 manager 2. Now the result of the preceding ex- pression on I1 is (“Barinsky”, “Estovar”), whereas the result of the expression on I2 is the empty relation. But I1 I2; hence, the expression is not monotonic. Expressions using the grouping operation of extended relational algebra are also nonmonotonic.

The fixed-point technique does not work on recursive views defined with non monotonic expressions. However, there are instances where such views are useful, particularly for defining aggregates on “part – subpart” relationships. Such relation-ships define what subparts make up each part. Subparts themselves may have further subparts, and so on; hence, the relationships, like the manager relationship, have a natural recursive structure. An example of an aggregate query on such a structure would be to compute the total number of subparts of each part. Writing this query in Datalog or in SQL (without procedural extensions) would require the use of a recursive view on a nonmonotonic expression. The bibliographic notes provide references to research on defining such views.

It is possible to define some kinds of recursive queries without using views. For example, extended relational operations have been proposed to define transitive closure, and extensions to the SQL syntax to specify (generalized) transitive closure have been proposed. However, recursive view definitions provide more expressive power than do the other forms of recursive queries.

Comments

Popular posts from this blog

Database System Architectures:Parallel Systems.

DATABASE DESIGN -2 part2

Database System Architectures:Network Types