XML Document Schema

Databases have schemas, which are used to constrain what information can be stored in the database and to constrain the data types of the stored information. In contrast, by default, XML documents can be created without any associated schema: An el- ement may then have any subelement or attribute. While such freedom may occasionally be acceptable given the self-describing nature of the data format, it is not generally useful when XML documents must be processesed automatically as part of an application, or even when large amounts of related data are to be formatted in XML.

Here, we describe the document-oriented schema mechanism included as part of the XML standard, the Document Type Deﬁnition, as well as the more recently deﬁned XMLSchema.

Document Type Deﬁnition

The document type deﬁnition (DTD) is an optional part of an XML document. The main purpose of a DTD is much like that of a schema: to constrain and type the information present in the document. However, the DTD does not in fact constrain types in the sense of basic types like integer or string. Instead, it only constrains the appearance of subelements and attributes within an element. The DTD is primarily a list of rules for what pattern of subelements appear within an element. Figure 10.6 shows a part of an example DTD for a bank information document; the XML document in Figure 10.1 conforms to this DTD.

Each declaration is in the form of a regular expression for the subelements of an element. Thus, in the DTD in Figure 10.6, a bank element consists of one or more account, customer, or depositor elements; the | operator speciﬁes “or” while the + operator speciﬁes “one or more.” Although not shown here, the ∗ operator is used to specify “zero or more,” while the ? operator is used to specify an optional element (that is, “zero or one”).

The account element is deﬁned to contain subelements account-number, branch- name and balance (in that order). Similarly, customer and depositor have the at- tributes in their schema deﬁned as subelements.

Finally, the elements account-number, branch-name, balance, customer-name, customer-street, and customer-city are all declared to be of type #PCDATA. The keyword #PCDATA indicates text data; it derives its name, historically, from “parsed character data.” Two other special type declarations are empty, which says that the element has no contents, and any, which says that there is no constraint on the subelements of the element; that is, any elements, even those not mentioned in the DTD, can occur as subelements of the element. The absence of a declaration for an element is equivalent to explicitly declaring the type as any.

The allowable attributes for each element are also declared in the DTD. Unlike subelements, no order is imposed on attributes. Attributes may speciﬁed to be of type CDATA, ID, IDREF, or IDREFS; the type CDATA simply says that the attribute contains character data, while the other three are not so simple; they are explained in more detail shortly. For instance, the following line from a DTD speciﬁes that element account has an attribute of type acct-type, with default value checking.

<!ATTLIST account acct-type CDATA “checking” >

Attributes must have a type declaration and a default declaration. The default declaration can consist of a default value for the attribute or #REQUIRED, meaning that a value must be speciﬁed for the attribute in each element, or #IMPLIED, meaning that no default value has been provided. If an attribute has a default value, for every element that does not specify a value for the attribute, the default value is ﬁlled in automatically when the XML document is read An attribute of type ID provides a unique identiﬁer for the element; a value that occurs in an ID attribute of an element must not occur in any other element in the same document. At most one attribute of an element is permitted to be of type ID.

An attribute of type IDREF is a reference to an element; the attribute must contain a value that appears in the ID attribute of some element in the document. The type IDREFS allows a list of references, separated by spaces.

Figure 10.7 shows an example DTD in which customer account relationships are represented by ID and IDREFS attributes, instead of depositor records. The account elements use account-number as their identiﬁer attribute; to do so, account-number has been made an attribute of account instead of a subelement. The customer elements have a new identiﬁer attribute called customer-id. Additionally, each customer element contains an attribute accounts, of type IDREFS, which is a list of identiﬁers of accounts that are owned by the customer. Each account element has an attribute owners, of type IDREFS, which is a list of owners of the account.

Figure 10.8 shows an example XML document based on the DTD in Figure 10.7. Note that we use a different set of accounts and customers from our earlier example, in order to illustrate the IDREFS feature better.

The ID and IDREF attributes serve the same role as reference mechanisms in object- oriented and object-relational databases, permitting the construction of complex data relationships.

Document type deﬁnitions are strongly connected to the document formatting her- itage of XML. Because of this, they are unsuitable in many ways for serving as the type structure of XML for data processing applications. Nevertheless, a tremendous num- ber of data exchange formats are being deﬁned in terms of DTDs, since they were part of the original standard. Here are some of the limitations of DTDs as a schema mechanism.

• Individual text elements and attributes cannot be further typed. For instance, the element balance cannot be constrained to be a positive number. The lack of such constraints is problematic for data processing and exchange applications, which must then contain code to verify the types of elements and attributes.

• It is difﬁcult to use the DTD mechanism to specify unordered sets of subele- ments. Order is seldom important for data exchange (unlike document layout,

where it is crucial). While the combination of alternation (the | operation) and the ∗ operation as in Figure 10.6 permits the speciﬁcation of unordered collec-

tions of tags, it is much more difﬁcult to specify that each tag may only appear

once.

• There is a lack of typing in IDs and IDREFs. Thus, there is no way to specify the type of element to which an IDREF or IDREFS attribute should refer. As a result, the DTD in Figure 10.7 does not prevent the “owners” attribute of an account element from referring to other accounts, even though this makes no sense.

XML Schema

An effort to redress many of these DTD deﬁciencies resulted in a more sophisticated schema language, XMLSchema. We present here an example of XMLSchema, and list some areas in which it improves DTDs, without giving full details of XMLSchema’s syntax.

Figure 10.9 shows how the DTD in Figure 10.6 can be represented by XMLSchema. The ﬁrst element is the root element bank, whose type is declared later. The example

then deﬁnes the types of elements account, customer, and depositor. Observe the use of types xsd:string and xsd:decimal to constrain the types of data elements. Finally the example deﬁnes the type BankType as containing zero or more occurrences of each of account, customer and depositor. XMLSchema can deﬁne the minimum and maximum number of occurrences of subelements by using minOccurs and maxOc- curs. The default for both minimum and maximum occurrences is 1, so these have to be explicity speciﬁed to allow zero or more accounts, deposits, and customers.

Among the beneﬁts that XMLSchema offers over DTDs are these:

• It allows user-deﬁned types to be created.

• It allows the text that appears in elements to be constrained to speciﬁc types, such as numeric types in speciﬁc formats or even more complicated types such as lists or union.

10.3 XML Document Schema 371

<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>

<xsd:element name=“bank” type=“BankType” />

<xsd:element name=“account”>

<xsd:complexType>

<xsd:sequence>

<xsd:element name=“account-number” type=“xsd:string”/>

<xsd:element name=“branch-name” type=“xsd:string”/>

<xsd:element name=“balance” type=“xsd:decimal”/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:element name=“customer”>

<xsd:element name=“customer-number” type=“xsd:string”/>

<xsd:element name=“customer-street” type=“xsd:string”/>

<xsd:element name=“customer-city” type=“xsd:string”/>

</xsd:element>

<xsd:element name=“depositor”>

<xsd:complexType>

<xsd:sequence>

<xsd:element name=“customer-name” type=“xsd:string”/>

<xsd:element name=“account-number” type=“xsd:string”/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

<xsd:complexType name=“BankType”>

<xsd:sequence>

<xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/>

<xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/>

<xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/>

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

Figure 10.9 XMLSchema version of DTD from Figure 10.6.

• It allows types to be restricted to create specialized types, for instance by spec- ifying minimum and maximum values.

• It allows complex types to be extended by using a form of inheritance.

• It is a superset of DTDs.

• It allows uniqueness and foreign key constraints.

• It is integrated with namespaces to allow different parts of a document to conform to different schema.

• It is itself speciﬁed by XML syntax, as Figure 10.9 shows.

372 Chapter 10 XML

However, the price paid for these features is that XMLSchema is signiﬁcantly more complicated than DTDs.

Search This Blog

Database Management System course