XML Document Schema
XML Document Schema
Databases have schemas, which are used to constrain what information can be stored in the database and to constrain the data types of the stored information. In contrast, by default, XML documents can be created without any associated schema: An el- ement may then have any subelement or attribute. While such freedom may occasionally be acceptable given the self-describing nature of the data format, it is not generally useful when XML documents must be processesed automatically as part of an application, or even when large amounts of related data are to be formatted in XML.
Here, we describe the document-oriented schema mechanism included as part of the XML standard, the Document Type Definition, as well as the more recently defined XMLSchema.
Document Type Definition
The document type definition (DTD) is an optional part of an XML document. The main purpose of a DTD is much like that of a schema: to constrain and type the information present in the document. However, the DTD does not in fact constrain types in the sense of basic types like integer or string. Instead, it only constrains the appearance of subelements and attributes within an element. The DTD is primarily a list of rules for what pattern of subelements appear within an element. Figure 10.6 shows a part of an example DTD for a bank information document; the XML document in Figure 10.1 conforms to this DTD.
Each declaration is in the form of a regular expression for the subelements of an element. Thus, in the DTD in Figure 10.6, a bank element consists of one or more account, customer, or depositor elements; the | operator specifies “or” while the + operator specifies “one or more.” Although not shown here, the ∗ operator is used to specify “zero or more,” while the ? operator is used to specify an optional element (that is, “zero or one”).
The account element is defined to contain subelements account-number, branch- name and balance (in that order). Similarly, customer and depositor have the at- tributes in their schema defined as subelements.
Finally, the elements account-number, branch-name, balance, customer-name, customer-street, and customer-city are all declared to be of type #PCDATA. The keyword #PCDATA indicates text data; it derives its name, historically, from “parsed character data.” Two other special type declarations are empty, which says that the element has no contents, and any, which says that there is no constraint on the subelements of the element; that is, any elements, even those not mentioned in the DTD, can occur as subelements of the element. The absence of a declaration for an element is equivalent to explicitly declaring the type as any.
The allowable attributes for each element are also declared in the DTD. Unlike subelements, no order is imposed on attributes. Attributes may specified to be of type CDATA, ID, IDREF, or IDREFS; the type CDATA simply says that the attribute contains character data, while the other three are not so simple; they are explained in more detail shortly. For instance, the following line from a DTD specifies that element account has an attribute of type acct-type, with default value checking.
<!ATTLIST account acct-type CDATA “checking” >
Attributes must have a type declaration and a default declaration. The default declaration can consist of a default value for the attribute or #REQUIRED, meaning that a value must be specified for the attribute in each element, or #IMPLIED, meaning that no default value has been provided. If an attribute has a default value, for every element that does not specify a value for the attribute, the default value is filled in automatically when the XML document is read An attribute of type ID provides a unique identifier for the element; a value that occurs in an ID attribute of an element must not occur in any other element in the same document. At most one attribute of an element is permitted to be of type ID.
An attribute of type IDREF is a reference to an element; the attribute must contain a value that appears in the ID attribute of some element in the document. The type IDREFS allows a list of references, separated by spaces.
Figure 10.7 shows an example DTD in which customer account relationships are represented by ID and IDREFS attributes, instead of depositor records. The account elements use account-number as their identifier attribute; to do so, account-number has been made an attribute of account instead of a subelement. The customer elements have a new identifier attribute called customer-id. Additionally, each customer element contains an attribute accounts, of type IDREFS, which is a list of identifiers of accounts that are owned by the customer. Each account element has an attribute owners, of type IDREFS, which is a list of owners of the account.
Figure 10.8 shows an example XML document based on the DTD in Figure 10.7. Note that we use a different set of accounts and customers from our earlier example, in order to illustrate the IDREFS feature better.
The ID and IDREF attributes serve the same role as reference mechanisms in object- oriented and object-relational databases, permitting the construction of complex data relationships.
Document type definitions are strongly connected to the document formatting her- itage of XML. Because of this, they are unsuitable in many ways for serving as the type structure of XML for data processing applications. Nevertheless, a tremendous num- ber of data exchange formats are being defined in terms of DTDs, since they were part of the original standard. Here are some of the limitations of DTDs as a schema mechanism.
• Individual text elements and attributes cannot be further typed. For instance, the element balance cannot be constrained to be a positive number. The lack of such constraints is problematic for data processing and exchange applications, which must then contain code to verify the types of elements and attributes.
• It is difficult to use the DTD mechanism to specify unordered sets of subele- ments. Order is seldom important for data exchange (unlike document layout,
where it is crucial). While the combination of alternation (the | operation) and the ∗ operation as in Figure 10.6 permits the specification of unordered collec-
tions of tags, it is much more difficult to specify that each tag may only appear
once.
• There is a lack of typing in IDs and IDREFs. Thus, there is no way to specify the type of element to which an IDREF or IDREFS attribute should refer. As a result, the DTD in Figure 10.7 does not prevent the “owners” attribute of an account element from referring to other accounts, even though this makes no sense.
XML Schema
An effort to redress many of these DTD deficiencies resulted in a more sophisticated schema language, XMLSchema. We present here an example of XMLSchema, and list some areas in which it improves DTDs, without giving full details of XMLSchema’s syntax.
Figure 10.9 shows how the DTD in Figure 10.6 can be represented by XMLSchema. The first element is the root element bank, whose type is declared later. The example
then defines the types of elements account, customer, and depositor. Observe the use of types xsd:string and xsd:decimal to constrain the types of data elements. Finally the example defines the type BankType as containing zero or more occurrences of each of account, customer and depositor. XMLSchema can define the minimum and maximum number of occurrences of subelements by using minOccurs and maxOc- curs. The default for both minimum and maximum occurrences is 1, so these have to be explicity specified to allow zero or more accounts, deposits, and customers.
Among the benefits that XMLSchema offers over DTDs are these:
• It allows user-defined types to be created.
• It allows the text that appears in elements to be constrained to specific types, such as numeric types in specific formats or even more complicated types such as lists or union.
10.3 XML Document Schema 371
<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>
<xsd:element name=“bank” type=“BankType” />
<xsd:element name=“account”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“account-number” type=“xsd:string”/>
<xsd:element name=“branch-name” type=“xsd:string”/>
<xsd:element name=“balance” type=“xsd:decimal”/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name=“customer”>
<xsd:element name=“customer-number” type=“xsd:string”/>
<xsd:element name=“customer-street” type=“xsd:string”/>
<xsd:element name=“customer-city” type=“xsd:string”/>
</xsd:element>
<xsd:element name=“depositor”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“customer-name” type=“xsd:string”/>
<xsd:element name=“account-number” type=“xsd:string”/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:complexType name=“BankType”>
<xsd:sequence>
<xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/>
<xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/>
<xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Figure 10.9 XMLSchema version of DTD from Figure 10.6.
• It allows types to be restricted to create specialized types, for instance by spec- ifying minimum and maximum values.
• It allows complex types to be extended by using a form of inheritance.
• It is a superset of DTDs.
• It allows uniqueness and foreign key constraints.
• It is integrated with namespaces to allow different parts of a document to conform to different schema.
• It is itself specified by XML syntax, as Figure 10.9 shows.
372 Chapter 10 XML
However, the price paid for these features is that XMLSchema is significantly more complicated than DTDs.
Comments
Post a Comment