XML:Querying and Transformation

Querying and Transformation

Given the increasing number of applications that use XML to exchange, mediate, and store data, tools for effective management of XML data are becoming increasingly important. In particular, tools for querying and transformation of XML data are essential to extract information from large bodies of XML data, and to convert data between different representations (schemas) in XML. Just as the output of a relational query is a relation, the output of an XML query can be an XML document. As a result, querying and transformation can be combined into a single tool.

Several languages provide increasing degrees of querying and transformation capabilities:

• XPath is a language for path expressions, and is actually a building block for the remaining two query languages.

• XSLT was designed to be a transformation language, as part of the XSL style sheet system, which is used to control the formatting of XML data into HTML or other print or display languages. Although designed for formatting, XSLT can generate XML as output, and can express many interesting queries. Furthermore, it is currently the most widely available language for manipulating XML data.

• XQuery has been proposed as a standard for querying of XML data. XQuery combines features from many of the earlier proposals for querying XML, in particular the language Quilt.

A tree model of XML data is used in all these languages. An XML document is modeled as a tree, with nodes corresponding to elements and attributes. Element nodes can have children nodes, which can be subelements or attributes of the element. Correspondingly, each node (whether attribute or element), other than the root element, has a parent node, which is an element. The order of elements and attributes in the XML document is modeled by the ordering of children of nodes of the tree. The terms parent, child, ancestor, descendant, and siblings are interpreted in the tree model of XML data.

The text content of an element can be modeled as a text node child of the element. Elements containing text broken up by intervening subelements can have multiple text node children. For instance, an element containing “this is a <bold> wonderful </bold> book” would have a subelement child corresponding to the element bold and two text node children corresponding to “this is a” and “book”. Since such structures are not commonly used in database data, we shall assume that elements do not contain both text and subelements.

XPath addresses parts of an XML document by means of path expressions. The lan- guage can be viewed as an extension of the simple path expressions in object-oriented and object-relational databases (See Section 9.5.1).

A path expression in XPath is a sequence of location steps separated by “/” (in- stead of the “.” operator that separates steps in SQL:1999). The result of a path ex- pression is a set of values. For instance, on the document in Figure 10.8, the XPath expression

image

would return the same names, but without the enclosing tags.

Like a directory hierarchy, the initial ’/’ indicates the root of the document. (Note that this is an abstract root “above” <bank-2> that is the document tag.) Path expressions are evaluated from left to right. As a path expression is evaluated, the result of the path at any point consists of a set of nodes from the document.

When an element name, such as customer, appears before the next ’/’, it refers to all elements of the specified name that are children of elements in the current element set. Since multiple children can have the same name, the number of nodes in the node set can increase or decrease with each step. Attribute values may also be accessed, using the “@” symbol. For instance, /bank-2/account/@account-number returns a set of all values of account-number attributes of account elements. By default, IDREF links are not followed; we shall see how to deal with IDREFs later.

XPath supports a number of other features:

• Selection predicates may follow any step in a path, and are contained in square brackets. For example,

image

returns the account numbers of those accounts.

We can test the existence of a subelement by listing it without any comparison operation; for instance, if we removed just “> 400” from the above, the expression would return account numbers of all accounts that have a balance subelement, regardless of its value.

• XPath provides several functions that can be used as part of predicates, including testing the position of the current node in the sibling order and counting the number of nodes matched. For example, the path expression

/bank-2/account/[customer/count()> 2]

returns accounts with more than 2 customers. Boolean connectives and and or can be used in predicates, while the function not(.. .) can be used for negation.

• The function id(“foo”) returns the node (if any) with an attribute of type ID and value “foo”. The function id can even be applied on sets of references, or even strings containing multiple references separated by blanks, such as IDREFS. For instance, the path

/bank-2/account/id(@owner)

returns all customers referred to from the owners attribute of account elements.

• The | operator allows expression results to be unioned. For example, if the DTD of bank-2 also contained elements for loans, with attribute borrower of type IDREFS identifying loan borrower, the expression

/bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)

gives customers with either accounts or loans. However, the | operator cannot be nested inside other operators.

• An XPath expression can skip multiple levels of nodes by using “//”. For in- stance, the expression /bank-2//name finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained. This example illustrates the ability to find required data without full knowledge of the schema.

• Each step in the path need not select from the children of the nodes in the current node set. In fact, this is just one of several directions along which a step in the path may proceed, such as parents, siblings, ancestors and descendants. We omit details, but note that “//”, described above, is a short form for specifying “all descendants,” while “..” specifies the parent.

XSLT

A style sheet is a representation of formatting options for a document, usually stored outside the document itself, so that formatting is separate from content. For example, a style sheet for HTML might specify the font to be used on all headers, and thus

image

replace a large number of font declarations in the HTML page. The XML Stylesheet Language (XSL) was originally designed for generating HTML from XML, and is thus a logical extension of HTML style sheets. The language includes a general-purpose transformation mechanism, called XSL Transformations (XSLT), which can be used to transform one XML document into another XML document, or to other formats such as HTML.1 XSLT transformations are quite powerful, and in fact XSLT can even act as a query language.

XSLT transformations are expressed as a series of recursive rules, called templates.

In their basic form, templates allow selection of nodes in an XML tree by an XPath expression. However, templates can also generate new XML content, so that selection and content generation can be mixed in natural and powerful ways. While XSLT can be used as a query language, its syntax and semantics are quite dissimilar from those of SQL.

A simple template for XSLT consists of a match part and a select part. Consider this XSLT code:

image

The xsl:template match statement contains an XPath expression that selects one or more nodes. The first template matches customer elements that occur as children of the bank-2 root element. The xsl:value-of statement enclosed in the match statement outputs values from the nodes in the result of the XPath expression. The first template outputs the value of the customer-name subelement; note that the value does not contain the element tag.

Note that the second template matches all nodes. This is required because the de- fault behavior of XSLT on subtrees of the input document that do not match any template is to copy the subtrees to the output document.

XSLT copies any tag that is not in the xsl namespace unchanged to the output. Figure 10.10 shows how to use this feature to make each customer name from our example appear as a subelement of a “<customer>” element, by placing the xsl:value-of statement between <customer> and </customer>.

image

Structural recursion is a key part of XSLT. Recall that elements and subelements naturally form a tree structure. The idea of structural recursion is this: When a tem- plate matches an element in the tree structure, XSLT can use structural recursion to apply template rules recursively on subtrees, instead of just outputting a value. It applies rules recursively by the xsl:apply-templates directive, which appears inside other templates.

For example, the results of our previous query can be placed in a surrounding <customers> element by the addition of a rule using xsl:apply-templates, as in Figure 10.11 The new rule matches the outer “bank” tag, and constructs a result document by applying all other templates to the subtrees appearing within the bank element, but wrapping the results in the given <customers> </customers> element. Without recursion forced by the <xsl:apply-templates/> clause, the template would output <customers> </customers>, and then apply the other templates on the subelements.

In fact, the structural recursion is critical to constructing well-formed XML documents, since XML documents must have a single top-level element containing all other elements in the document.

XSLT provides a feature called keys, which permit lookup of elements by using values of subelements or attributes; the goals are similar to that of the id() function I nXPath, but permits attributes other than the ID attributes to be used. Keys are defined by an xsl:key directive, which has three parts, for example:

<xsl:key name=“acctno” match=“account” use=“account-number”/>

The name attribute is used to distinguish different keys. The match attribute specifies which nodes the key applies to. Finally, the use attribute specifies the expression to be used as the value of the key. Note that the expression need not be unique to an element; that is, more than one element may have the same expression value. In the example, the key named acctno specifies that the account-number subelement of account should be used as a key for that account.

Keys can be subsequently used in templates as part of any pattern through the key function. This function takes the name of the key and a value, and returns the

image

set of nodes that match that value. Thus, the XML node for account “A-401” can be referenced as key(“acctno”, “A-401”).

Keys can be used to implement some types of joins, as in Figure 10.12. The code in the figure can be applied to XML data in the format in Figure 10.1. Here, the key function joins the depositor elements with matching customer and account elements. The result of the query consists of pairs of customer and account elements enclosed within cust-acct elements.

XSLT allows nodes to be sorted. A simple example shows how xsl:sort would be used in our style sheet to return customer elements sorted by name:

image

Here, the xsl:apply-template has a select attribute, which constrains it to be applied only on customer subelements. The xsl:sort directive within the xsl:apply-template element causes nodes to be sorted before they are processed by the next set of templates. Options exist to allow sorting on multiple subelements/attributes, by numeric value, and in descending order.

XQuery

The World Wide Web Consortium (W3C) is developing XQuery, a query language for XML. Our discusssion here is based on a draft of the language standard, so the final standard may differ; however we expect the main features we cover here will not change substantially. The XQuery language derives from an XML query language called Quilt; most of the XQuery features we outline here are part of Quilt. Quilt itself includes features from earlier languages such as XPath, discussed in Section 10.4.1, and two other XML query languages, XQL and XML-QL.

Unlike XSLT, XQuery does not represent queries in XML. Instead, they appear more like SQL queries, and are organized into “FLWR” (pronounced “flower”) expressions comprising four sections: for, let, where, and return. The for section gives a series of variables that range over the results of XPath expressions. When more than one variable is specified, the results include the Cartesian product of the possible values the variables can take, making the for clause similar in spirit to the from clause of an SQL query. The let clause simply allows complicated expressions to be assigned to variable names for simplicity of representation. The where section, like the SQL where clause, performs additional tests on the joined tuples from the for section. Finally, the return section allows the construction of results in XML.

A simple FLWR expression that returns the account numbers for checking accounts is based on the XML document of Figure 10.8, which uses ID and IDREFS:

image

Since this query is simple, the let clause is not essential, and the variable $acctno in the return clause could be replaced with $x/@account-number. Note further that, since the for clause uses XPath expressions, selections may occur within the XPath expression. Thus, an equivalent query may have only for and return clauses:

image

However, the let clause simplifies complex queries.

Path expressions in XQuery may return a multiset, with repeated nodes. The function distinct applied on a multiset, returns a set without duplication. The distinct function can be used even within a for clause. XQuery also provides aggregate functions such as sum and count that can be applied on collections such as sets and multisets. While XQuery does not provide a group by construct, aggregate queries can be written by using nested FLWR constructs in place of grouping; we leave details as an exercise for you. Note also that variables assigned by let clauses may be set- or multiset-valued, if the path expression on the right-hand side returns a set or multi set value.

Joins are specified in XQuery much as they are in SQL. The join of depositor, account and customer elements in Figure 10.1, which we wrote in XSLT in Section 10.4.2, can be written in XQuery this way:

image

The query also introduces the syntax $c/*, which refers to all the children of the node, which is bound to the variable $c. Similarly, $c/text() gives the text content of an element, without the tags.

Path expressions in XQuery are based on path expressions in XPath, but XQuery provides some extensions (which may eventually be added to XPath itself). One of the useful syntax extensions is the operator ->, which can be used to dereference IDREFs, just like the function id(). The operator can be applied on a value of type IDREFS to get a set of elements. It can be used, for example, to find all the accounts associated with a customer, with the ID/IDREFS representation of bank information. We leave details to the reader.

Results can be sorted in XQuery if a sortby clause is included at the end of any ex- pression; the clause specifies how the instances of that expression should be sorted. For instance, this query outputs all customer elements sorted by the name subelement:

image

To sort in descending order, we can use sortby(name descending).

Sorting can be done at multiple levels of nesting. For instance, we can get a nested representation of bank information sorted in customer name order, with accounts of each customer sorted by account number, as follows.

image

XQuery provides a variety of built-in functions, and supports user-defined functions. For instance, the built-in function document(name) returns the root of a named document; the root can then be used in a path expression to access the contents of the document. Users can define functions as illustrated by this function, which returns a list of all balances of a customer with a specified name:

image

XQuery uses the type system of XMLSchema. XQuery also provides functions to convert between types. For instance, number(x) converts a string to a number.

XQuery offers a variety of other features, such as if-then-else clauses, which can be used within return clauses, and existential and universal quantification, which can be used in predicates in where clauses. For example, existential quantification can be expressed using some $e in path satisfies P where path is a path expression, and P is a predicate which can use $e. Universal quantification can be expressed by using every in place of some.

The Application Program Interface

With the wide acceptance of XML as a data representation and exchange format, soft- ware tools are widely available for manipulation of XML data. In fact, there are two standard models for programmatic manipulation of XML, each available for use with a wide variety of popular programming languages.

One of the standard APIs for manipulating XML is the document object model (DOM), which treats XML content as a tree, with each element represented by a node, called a DOMNode. Programs may access parts of the document in a navigational fashion, beginning with the root.

DOM libraries are available for most common programming langauges and are even present in Web browsers, where it may be used to manipulate the document displayed to the user. We outline here some of the interfaces and methods in the Java API for DOM, to give a flavor of DOM. The Java DOM API provides an interface called Node, and interfaces Element and Attribute, which inherit from the Node interface. The Node interface provides methods such as get Parent Node(), get First Child(), and get Next Sibling(), to navigate the DOM tree, starting with the root node. Subelements of an element can be accessed by name get Elements By Tag Name(name), which re- turns a list of all child elements with a specified tag name; individual members of the list can be accessed by the method item(i), which returns the ith element in the list. Attribute values of an element can be accessed by name, using the method get At- tribute(name). The text value of an element is modeled as a Text node, which is a child of the element node; an element node with no subelements has only one such child node. The method get Data() on the Text node returns the text contents. DOM also provides a variety of functions for updating the document by adding and deleting attribute and element children of a node, setting node values, and so on.

Many more details are required for writing an actual DOM program; see the bibliographical notes for references to further information.

DOM can be used to access XML data stored in databases, and an XML database can be built using DOM as its primary interface for accessing and modifying data.

However, the DOM interface does not support any form of declarative querying.

The second programming interface we discuss, the Simple API for XML (SAX) is an event model, designed to provide a common interface between parsers and applications. This API is built on the notion of event handlers, which consists of user-specified functions associated with parsing events. Parsing events correspond to the recognition of parts of a document; for example, an event is generated when the start-tag is found for an element, and another event is generated when the end-tag is found. The pieces of a document are always encountered in order from start to finish. SAX is not appropriate for database applications.

Comments

Popular posts from this blog

XML Document Schema

Extended Relational-Algebra Operations.

Distributed Databases:Concurrency Control in Distributed Databases