Multimedia Databases
Multimedia Databases
Multimedia data, such as images, audio, and video — an increasingly popular form of data — are today almost always stored outside the database, in file systems. This kind of storage is not a problem when the number of multimedia objects is relatively small, since features provided by databases are usually not important.
However, database features become important when the number of multimedia objects stored is large. Issues such as transactional updates, querying facilities, and indexing then become important. Multimedia objects often have descriptive attributes, such as those indicating when they were created, who created them, and to what category they belong. One approach to building a database for such multimedia objects is to use databases for storing the descriptive attributes and for keeping track of the files in which the multimedia objects are stored.
However, storing multimedia outside the database makes it harder to provide database functionality, such as indexing on the basis of actual multimedia data con- tent. It can also lead to inconsistencies, such as a file that is noted in the database, but whose contents are missing, or vice versa. It is therefore desirable to store the data themselves in the database.
Several issues have to be addressed if multimedia data are to be stored in a database.
• The database must support large objects, since multimedia data such as videos can occupy up to a few gigabytes of storage. Many database systems do not support objects larger than a few gigabytes. Larger objects could be split into smaller pieces and stored in the database. Alternatively, the multimedia ob- ject may be stored in a file system, but the database may contain a pointer to the object; the pointer would typically be a file name. The SQL/MED standard (MED stands for Management of External Data), which is under development, allows external data, such as files, to be treated as if they are part of the database. With SQL/MED, the object would appear to be part of the database, but can be stored externally.
We discuss multimedia data formats in Section 23.4.1.
• The retrieval of some types of data, such as audio and video, has the requirement that data delivery must proceed at a guaranteed steady rate. Such data are sometimes called isochronous data, or continuous-media data. For example, if audio data are not supplied in time, there will be gaps in the sound. If the data are supplied too fast, system buffers may overflow, resulting in loss of data. We discuss continuous-media data in Section 23.4.2.
• Similarity-based retrieval is needed in many multimedia database applications. For example, in a database that stores fingerprint images, a query finger- print image is provided, and fingerprints in the database that are similar to the query fingerprint must be retrieved. Index structures such as B+-trees and R-trees cannot be used for this purpose; special index structures need to be created. We discuss similarity-based retrieval in Section 23.4.3
Multimedia Data Formats
Because of the large number of bytes required to represent multimedia data, it is essential that multimedia data be stored and transmitted in compressed form. For image data, the most widely used format is JPEG, named after the standards body that created it, the Joint Picture Experts Group. We can store video data by encoding each frame of video in JPEG format, but such an encoding is wasteful, since successive frames of a video are often nearly the same. The Moving Picture Experts Group has developed the MPEG series of standards for encoding video and audio data; these encodings exploit commonalities among a sequence of frames to achieve a greater degree of compression. The MPEG-1 standard stores a minute of 30-frame- per-second video and audio in approximately 12.5 megabytes (compared to approximately 75 megabytes for video in only JPEG). However, MPEG-1 encoding introduces some loss of video quality, to a level roughly comparable to that of VHS video tape.
The MPEG-2 standard is designed for digital broadcast systems and digital video disks (DVD); it introduces only a negligible loss of video quality. MPEG-2 compresses 1 minute of video and audio to approximately 17 megabytes. Several competing standards are used for audio encoding, including MP3, which stands for MPEG-1 Layer 3, RealAudio, and other formats.
Continuous-Media Data
The most important types of continuous-media data are video and audio data (for ex- ample, a database of movies). Continuous-media systems are characterized by their real-time information-delivery requirements:
• Data must be delivered sufficiently fast that no gaps in the audio or video result.
• Data must be delivered at a rate that does not cause overflow of system buffers.
• Synchronization among distinct data streams must be maintained. This need arises, for example, when the video of a person speaking must show lips moving synchronously with the audio of the person speaking.
To supply data predictably at the right time to a large number of consumers of the data, the fetching of data from disk must be carefully coordinated. Usually, data are fetched in periodic cycles. In each cycle, say of n seconds, n seconds worth of data is fetched for each consumer and stored in memory buffers, while the data fetched in the previous cycle is being sent to the consumers from the memory buffers. The cycle period is a compromise: A short period uses less memory but requires more disk arm movement, which is a waste of resources, while a long period reduces disk arm movement but increases memory requirements and may delay initial delivery of data. When a new request arrives, admission control comes into play: That is, the system checks if the request can be satisfied with available resources (in each period); if so, it is admitted; otherwise it is rejected.
Extensive research on delivery of continuous media data has dealt with such issues as handling arrays of disks and dealing with disk failure. See the bibliographical references for details.
Several vendors offer video-on-demand servers. Current systems are based on file systems, because existing database systems do not provide the real-time response that these applications need. The basic architecture of a video-on-demand system comprises:
• Video server. Multimedia data are stored on several disks (usually in a RAID configuration). Systems containing a large volume of data may use tertiary storage for less frequently accessed data.
• Terminals. People view multimedia data through various devices, collectively referred to as terminals. Examples are personal computers and televisions attached to a small, inexpensive computer called a set-top box.
• Network. Transmission of multimedia data from a server to multiple terminals requires a high-capacity network.
Video-on-demand service eventually will become ubiquitous, just as cable and broadcast television are now. For the present, the main applications of video-server technology are in offices (for training, viewing recorded talks and presentations, and the like), in hotels, and in video-production facilities.
Similarity-Based Retrieval
In many multimedia applications, data are described only approximately in the data- base. An example is the fingerprint data in Section 23.4. Other examples are:
• Pictorial data. Two pictures or images that are slightly different as represented in the database may be considered the same by a user. For instance, a database may store trademark designs. When a new trademark is to be registered, the system may need first to identify all similar trademarks that were registered previously.
• Audio data. Speech-based user interfaces are being developed that allow the user to give a command or identify a data item by speaking. The input from the user must then be tested for similarity to those commands or data items stored in the system.
• Handwritten data. Handwritten input can be used to identify a handwritten data item or command stored in the database. Here again, similarity testing is required.
The notion of similarity is often subjective and user specific. However, similarity testing is often more successful than speech or handwriting recognition, because the input can be compared to data already in the system and, thus, the set of choices available to the system is limited.
Several algorithms exist for finding the best matches to a given input by similarity testing. Some systems, including a dial-by-name, voice-activated telephone system, have been deployed commercially. See the bibliographical notes for references.
Mobility and Personal Databases
Large-scale, commercial databases have traditionally been stored in central computing facilities. In distributed database applications, there has usually been strong central database and network administration. Two technology trends have combined to create applications in which this assumption of central control and administration is not entirely correct:
1. The increasingly widespread use of personal computers, and, more important, of laptop or notebook computers.
2. The development of a relatively low-cost wireless digital communication infrastructure, based on wireless local-area networks, cellular digital packet net- works, and other technologies.
Mobile computing has proved useful in many applications. Many business travelers use laptop computers so that they can work and access data en route. Delivery services use mobile computers to assist in package tracking. Emergency-response services use mobile computers at the scene of disasters, medical emergencies, and the like to access information and to enter data pertaining to the situation. New applications of mobile computers continue to emerge.
Wireless computing creates a situation where machines no longer have fixed locations and network addresses. Location-dependent queries are an interesting class of queries that are motivated by mobile computers; in such queries, the location of the user (computer) is a parameter of the query. The value of the location parameter is provided either by the user or, increasingly, by a global positioning system (GPS). An example is a traveler’s information system that provides data on hotels, roadside ser- vices, and the like to motorists. Processing of queries about services that are ahead on the current route must be based on knowledge of the user’s location, direction of motion, and speed. Increasingly, navigational aids are being offered as a built-in feature in automobiles.
Energy (battery power) is a scarce resource for most mobile computers. This limitation influences many aspects of system design. Among the more interesting consequences of the need for energy efficiency is the use of scheduled data broadcasts to reduce the need for mobile systems to transmit queries.
Increasing amounts of data may reside on machines administered by users, rather than by database administrators. Furthermore, these machines may, at times, be disconnected from the network. In many cases, there is a conflict between the user’s need to continue to work while disconnected and the need for global data consistency.
In Sections 23.5.1 through 23.5.4, we discuss techniques in use and under development to deal with the problems of mobility and personal computing.
A Model of Mobile Computing
The mobile-computing environment consists of mobile computers, referred to as mobile hosts, and a wired network of computers. Mobile hosts communicate with the wired network via computers referred to as mobile support stations. Each mobile support station manages those mobile hosts within its cell — that is, the geograph- ical area that it covers. Mobile hosts may move between cells, thus necessitating a handoff of control from one mobile support station to another. Since mobile hosts may, at times, be powered down, a host may leave one cell and rematerialize later at some distant cell. Therefore, moves between cells are not necessarily between adja- cent cells. Within a small area, such as a building, mobile hosts may be connected by a wireless local-area network (LAN) that provides lower-cost connectivity than would a wide-area cellular network, and that reduces the overhead of handoffs.
It is possible for mobile hosts to communicate directly without the intervention of a mobile support station. However, such communication can occur between only nearby hosts. Such direct forms of communication are becoming more prevalent with the advent of the Bluetooth standard. Bluetooth uses short-range digital radio to allow wireless connectivity within a 10-meter range at high speed (up to 721 kilo- bits per second). Initially conceived as a replacement for cables, Bluetooth’s greatest promise is in easy ad hoc connection of mobile computers, PDAs, mobile phones, and so-called intelligent appliances.
The network infrastructure for mobile computing consists in large part of two technologies: wireless local-area networks (such as Avaya’s Orinoco wireless LAN),
and packet-based cellular telephony networks. Early cellular systems used analog technology and were designed for voice communication. Second-generation digital systems retained the focus on voice appliations. Third-generation (3G) and so-called 2.5G systems use packet-based networking and are more suited to data applications. In these networks, voice is just one of many applications (albeit an economically important one).
Bluetooth, wireless LANs, and 2.5G and 3G cellular networks make it possible for a wide variety of devices to communicate at low cost. While such communication itself does not fit the domain of a usual database application, the accounting, monitoring, and management data pertaining to this communication will generate huge databases. The immediacy of wireless communication generates a need for real-time access to many of these databases. This need for timeliness adds another dimension to the constraints on the system — a matter we shall discuss further in Section 24.3.
The size and power limitations of many mobile computers have led to alternative memory hierarchies. Instead of, or in addition to, disk storage, flash memory, which we discussed in Section 11.1, may be included. If the mobile host includes a hard disk, the disk may be allowed to spin down when it is not in use, to save energy. The same considerations of size and energy limit the type and size of the display used in a mobile device. Designers of mobile devices often create special-purpose user interfaces to work within these constraints. However, the need to present Web-based data has neccessitated the creation of presentation standards. Wireless application protocol (WAP) is a standard for wireless internet access. WAP-based browsers access special Web pages that use wireless markup lanaguge (WML), an XML-based language designed for the constraints of mobile and wireless Web browsing.
Routing and Query Processing
The route between a pair of hosts may change over time if one of the two hosts is mobile. This simple fact has a dramatic effect at the network level, since location- based network addresses are no longer constants within the system.
Mobility also directly affects database query processing. As we saw in Chapter 19, we must consider the communication costs when we choose a distributed query- processing strategy. Mobility results in dynamically changing communication costs, thus complicating the optimization process. Furthermore, there are competing no- tions of cost to consider:
• User time is a highly valuable commodity in many business applications
• Connection time is the unit by which monetary charges are assigned in some cellular systems
• Number of bytes, or packets, transferred is the unit by which charges are computed in some digital cellular systems
• Time-of-day-based charges vary, depending on whether communication occurs during peak or off-peak periods
• Energy is limited. Often, battery power is a scarce resource whose use must be optimized. A basic principle of radio communication is that it requires less energy to receive than to transmit radio signals. Thus, transmission and reception of data impose different power demands on the mobile host.
Broadcast Data
It is often desirable for frequently requested data to be broadcast in a continuous cycle by mobile support stations, rather than transmitted to mobile hosts on demand. A typical application of such broadcast data is stock-market price information. There are two reasons for using broadcast data. First, the mobile host avoids the energy cost for transmitting data requests. Second, the broadcast data can be received by a large number of mobile hosts at once, at no extra cost. Thus, the available transmission bandwidth is utilized more effectively.
A mobile host can then receive data as they are transmitted, rather than consuming energy by transmitting a request. The mobile host may have local nonvolatile storage available to cache the broadcast data for possible later use. Given a query, the mobile host may optimize energy costs by determining whether it can process that query with only cached data. If the cached data are insufficient, there are two options: Wait for the data to be broadcast, or transmit a request for data. To make this decision, the mobile host must know when the relevant data will be broadcast.
Broadcast data may be transmitted according to a fixed schedule or a changeable schedule. In the former case, the mobile host uses the known fixed schedule to determine when the relevant data will be transmitted. In the latter case, the broadcast schedule must itself be broadcast at a well-known radio frequency and at well-known time intervals.
In effect, the broadcast medium can be modeled as a disk with a high latency.
Requests for data can be thought of as being serviced when the requested data are broadcast. The transmission schedules behave like indices on the disk. The bibliographical notes list recent research papers in the area of broadcast data management.
Disconnectivity and Consistency
Since wireless communication may be paid for on the basis of connection time, there is an incentive for certain mobile hosts to be disconnected for substantial periods. Mobile computers without wireless connectivity are disconnected most of the time when they are being used, except periodically when they are connected to their host computers, either physically or through a computer network.
During these periods of disconnection, the mobile host may remain in operation. The user of the mobile host may issue queries and updates on data that reside or are cached locally. This situation creates several problems, in particular:
• Recoverability: Updates entered on a disconnected machine may be lost if the mobile host experiences a catastrophic failure. Since the mobile host rep- resents a single point of failure, stable storage cannot be simulated well.
• Consistency: Locally cached data may become out of date, but the mobile host cannot discover this situation until it is reconnected. Likewise, updates occurring in the mobile host cannot be propagated until reconnection occurs.
We explored the consistency problem in Chapter 19, where we discussed network partitioning, and we elaborate on it here. In wired distributed systems, partitioning is considered to be a failure mode; in mobile computing, partitioning via disconnection is part of the normal mode of operation. It is therefore necessary to allow data access to proceed despite partitioning, even at the risk of some loss of consistency.
For data updated by only the mobile host, it is a simple matter to propagate the updates when the mobile host reconnects. However, if the mobile host caches read- only copies of data that may be updated by other computers, the cached data may become inconsistent. When the mobile host is connected, it can be sent invalidation reports that inform it of out-of-date cache entries. However, when the mobile host is disconnected, it may miss an invalidation report. A simple solution to this problem is to invalidate the entire cache on reconnection, but such an extreme solution is highly costly. Several caching schemes are cited in the bibliographical notes.
If updates can occur at both the mobile host and elsewhere, detecting conflicting updates is more difficult. Version-numbering-based schemes allow updates of shared files from disconnected hosts. These schemes do not guarantee that the up- dates will be consistent. Rather, they guarantee that, if two hosts independently up- date the same version of a document, the clash will be detected eventually, when the hosts exchange information either directly or through a common host.
The version-vector scheme detects inconsistencies when copies of a document are independently updated. This scheme allows copies of a document to be stored at multiple hosts. Although we use the term document, the scheme can be applied to any other data items, such as tuples of a relation.
The basic idea is for each host i to store, with its copy of each document d,a version vector — that is, a set of version numbers {Vd,i[j]}, with one entry for each other host j on which the document could potentially be updated. When a host i updates a document d, it increments the version number Vd,i[i] by one.
Whenever two hosts i and j connect with each other, they exchange updated documents, so that both obtain new versions of the documents. However, before exchaning documents, the hosts have to discover whether the copies are consistent:
1. If the version vectors are the same on both hosts — that is, for each k, Vd,i[k] = Vd,j [k] — then the copies of document d are identical.
2. If, for each k, Vd,i[k] ≤ Vd,j [k] and the version vectors are not identical, then the copy of document d at host i is older than the one at host j. That is, the copy of document d at host j was obtained by one or more modifications of the copy of the document at host i. Host i replaces its copy of d, as well as its copy of the version vector for d, with the copies from host j.
3. If there is a pair of hosts k and m such that Vd,i[k] < Vd,j [k] and Vd,i[m] > Vd,j [m], then the copies are inconsistent; that is, the copy of d at i contains up- dates performed by host k that have not been propagated to host j, and, sim- ilarly, the copy of d at j contains updates performed by host m that have not been propagated to host i. Then, the copies of d are inconsistent, since two or more updates have been performed on d independently. Manual intervention may be required to merge the updates.
The version-vector scheme was initially designed to deal with failures in distributed file systems. The scheme gained importance because mobile computers often store copies of files that are also present on server systems, in effect constituting a distributed file system that is often disconnected. Another application of the scheme is in groupware systems, where hosts are connected periodically, rather than continuously, and must exchange updated documents. The version-vector scheme also has applications in replicated databases.
The version-vector scheme, however, fails to address the most difficult and most important issue arising from updates to shared data — the reconciliation of inconsistent copies of data. Many applications can perform reconciliation automatically by executing in each computer those operations that had performed updates on remote computers during the period of disconnection. This solution works if update operations commute — that is, they generate the same result, regardless of the order in which they are executed. Alternative techniques may be available in certain applications; in the worst case, however, it must be left to the users to resolve the inconsistencies. Dealing with such inconsistency automatically, and assisting users in resolving inconsistencies that cannot be handled automatically, remains an area of research.
Another weakness is that the version-vector scheme requires substantial communication between a reconnecting mobile host and that host’s mobile support station.
Consistency checks can be delayed until the data are needed, although this delay may increase the overall inconsistency of the database.
The potential for disconnection and the cost of wireless communication limit the practicality of transaction-processing techniques discussed in Chapter 19 for distributed systems. Often, it is preferable to let users prepare transactions on mobile hosts, but to require that, instead of executing the transactions locally, they submit transactions to a server for execution. Transactions that span more than one computer and that include a mobile host face long-term blocking during transaction commit, unless disconnectivity is rare or predictable.
Comments
Post a Comment