XML:Background
XML
Unlike most of the technologies presented in the preceding chapters, the Extensible Markup Language (XML) was not originally conceived as a database technology. In fact, like the Hyper-Text Markup Language (HTML) on which the World Wide Web is based, XML has its roots in document management, and is derived from a language for structuring large documents known as the Standard Generalized Markup Language (SGML). However, unlike SGML and HTML, XML can represent database data, as well as many other kinds of structured data used in business applications. It is particularly useful as a data format when an application must communicate with another application, or integrate information from several other applications. When XML is used in these contexts, many database issues arise, including how to organize, manipulate, and query the XML data. In this chapter, we introduce XML and discuss both the management of XML data with database techniques and the exchange of data formatted as XML documents.
Background
To understand XML, it is important to understand its roots as a document markup language. The term markup refers to anything in a document that is not intended to be part of the printed output. For example, a writer creating text that will eventually be typeset in a magazine may want to make notes about how the typesetting should be done. It would be important to type these notes in a way so that they could be distinguished from the actual content, so that a note like “do not break this para- graph” does not end up printed in the magazine. In electronic document processing, a markup language is a formal description of what part of the document is content, what part is markup, and what the markup means.
Just as database systems evolved from physical file processing to provide a separate logical view, markup languages evolved from specifying instructions for how to print parts of the document to specify the function of the content. For instance, with functional markup, text representing section headings (for this section, the words “Background”) would be marked up as being a section heading, instead of being marked up as text to be printed in large size, bold font. Such functional markup al- lowed the document to be formatted differently in different situations. It also helps different parts of a large document, or different pages in a large Web site to be for- matted in a uniform manner. Functional markup also helps automate extraction of key parts of documents.
For the family of markup languages that includes HTML, SGML, and XML the markup takes the form of tags enclosed in angle-brackets, <>. Tags are used in pairs,
with <tag> and </tag> delimiting the beginning and the end of the portion of the document to which the tag refers. For example, the title of a document might be marked up as follows.
<title>Database System Concepts</title>
Unlike HTML, XML does not prescribe the set of tags allowed, and the set may be specialized as needed. This feature is the key to XML’s major role in data representa- tion and exchange, whereas HTML is used primarily for document formatting.
For example, in our running banking application, account and customer information can be represented as part of an XML document as in Figure 10.1. Observe the use of tags such as account and account-number. These tags provide context for each value and allow the semantics of the value to be identified.
Compared to storage of data in a database, the XML representation may be inefficient, since tag names are repeated throughout the document. However, in spite of this disadvantage, an XML representation has significant advantages when it is used to exchange data, for example, as part of a message:
• First, the presence of the tags makes the message self-documenting; that is, a schema need not be consulted to understand the meaning of the text. We can readily read the fragment above, for example.
• Second, the format of the document is not rigid. For example, if some sender adds additional information, such as a tag last-accessed noting the last date on which an account was accessed, the recipient of the XML data may simply ignore the tag. The ability to recognize and ignore unexpected tags allows the format of the data to evolve over time, without invalidating existing applications.
• Finally, since the XML format is widely accepted, a wide variety of tools are available to assist in its processing, including browser software and database tools.
Just as SQL is the dominant language for querying relational data, XML is becoming the dominant format for data exchange.
Comments
Post a Comment