Application Development and Administration:Advanced Data Types and New Applications

Advanced Data Types and New Applications

For most of the history of databases, the types of data stored in databases were relatively simple, and this was reﬂected in the rather limited support for data types in earlier versions of SQL. In the past few years, however, there has been increasing need for handling new data types in databases, such as temporal data, spatial data. and multimedia data.

Another major trend in the last decade has created its own issues: the growth of mobile computers, starting with laptop computers and pocket organizers, and in more recent years growing to include mobile phones with built-in computers, and a variety of wearable computers that are increasingly used in commercial applications.

In this chapter we study several new data types, and also study database issues dealing with mobile computers.

Motivation

Before we address each of the topics in detail, we summarize the motivation for, and some important issues in dealing with, each of these types of data.

• Temporal data. Most database systems model the current state of the world, for instance, current customers, current students, and courses currently being offered. In many applications, it is very important to store and retrieve information about past states. Historical information can be incorporated manually into a schema design. However, the task is greatly simpliﬁed by database support for temporal data, which we study in Section 23.2.

• Spatial data. Spatial data include geographic data, such as maps and associated information, and computer-aided-design data, such as integrated-circuit designs or building designs. Applications of spatial data initially stored data as ﬁles in a ﬁle system, as did early-generation business applications. But as the complexity and volume of the data, and the number of users, have grown, ad hoc approaches to storing and retrieving data in a ﬁle system have proved insufﬁcient for the needs of many applications that use spatial data.

Spatial-data applications require facilities offered by a database system — in particular, the ability to store and query large amounts of data efﬁciently. Some applications may also require other database features, such as atomic updates to parts of the stored data, durability, and concurrency control. In Section 23.3, we study the extensions needed to traditional database systems to support spatial data.

• Multimedia data. In Section 23.4, we study the features required in database systems that store multimedia data such as image, video, and audio data. The main distinguishing feature of video and audio data is that the display of the data requires retrieval at a steady, predetermined rate; hence, such data are called continuous-media data.

• Mobile databases. In Section 23.5, we study the database requirements of the new generation of mobile computing systems, such as notebook computers and palmtop computing devices, which are connected to base stations via wireless digital communication networks. Such computers need to be able to operate while disconnected from the network, unlike the distributed database systems discussed in Chapter 19. They also have limited storage capacity, and thus require special techniques for memory management.

Time in Databases

A database models the state of some aspect of the real world outside itself. Typically, databases model only one state — the current state — of the real world, and do not store information about past states, except perhaps as audit trails. When the state of the real world changes, the database gets updated, and information about the old state gets lost. However, in many applications, it is important to store and retrieve information about past states. For example, a patient database must store information about the medical history of a patient. A factory monitoring system may store information about current and past readings of sensors in the factory, for analysis. Databases that store information about states of the real world across time are called temporal databases.

When considering the issue of time in database systems, we must distinguish be- tween time as measured by the system and time as observed in the real world. The valid time for a fact is the set of time intervals during which the fact is true in the real world. The transaction time for a fact is the time interval during which the fact is current within the database system. This latter time is based on the transaction serialization order and is generated automatically by the system. Note that valid-time intervals, being a real-world concept, cannot be generated automatically and must be provided to the system.

A temporal relation is one where each tuple has an associated time when it is true; the time may be either valid time or transaction time. Of course, both valid time and transaction time can be stored, in which case the relation is said to be a

bitemporal relation. Figure 23.1 shows an example of a temporal relation. To simplify the representation, each tuple has only one time interval associated with it; thus, a tuple is represented once for every disjoint time interval in which it is true. Intervals are shown here as a pair of attributes from and to; an actual implementation would have a structured type, perhaps called Interval, that contains both ﬁelds. Note that some of the tuples have a “*” in the to time column; these asterisks indicate that the tuple is true until the value in the to time column is changed; thus, the tuple is true at the current time. Although times are shown in textual form, they are stored internally in a more compact form, such as the number of seconds since some ﬁxed time on a ﬁxed date (such as 12:00 AM, January 1, 1900) that can be translated back to the normal textual form.

Time Speciﬁcation in SQL

The SQL standard deﬁnes the types date, time, and timestamp. The type date contains four digits for the year (1 – 9999), two digits for the month (1 – 12), and two digits for the date (1 – 31). The type time contains two digits for the hour, two digits for the minute, and two digits for the second, plus optional fractional digits. The seconds ﬁeld can go beyond 60, to allow for leap seconds that are added during some years to correct for small variations in the speed of rotation of Earth. The type timestamp contains the ﬁelds of date and time, with six fractional digits for the seconds ﬁeld.

Since different places in the world have different local times, there is often a need for specifying the time zone along with the time. The Universal Coordinated Time (UTC), is a standard reference point for specifying time, with local times deﬁned as offsets from UTC. (The standard abbreviation is UTC, rather than UCT, since it is an abbreviation of “Universal Coordinated Time” written in French as universel temps coordonne´.) SQL also supports two types, time with time zone, and timestamp with time zone, which specify the time as a local time plus the offset of the local time from UTC. For instance, the time could be expressed in terms of U.S. Eastern Standard Time, with an offset of −6:00, since U.S. Eastern Standard time is 6 hours behind UTC.

SQL supports a type called interval, which allows us to refer to a period of time such as “1 day” or “2 days and 5 hours,” without specifying a particular time when this period starts. This notion differs from the notion of interval we used previously, which refers to an interval of time with speciﬁc starting and ending times.1

Temporal Query Languages

A database relation without temporal information is sometimes called a snapshot relation, since it reﬂects the state in a snapshot of the real world. Thus, a snapshot of a temporal relation at a point in time t is the set of tuples in the relation that are true at time t, with the time-interval attributes projected out. The snapshot operation on a temporal relation gives the snapshot of the relation at a speciﬁed time (or the current time, if the time is not speciﬁed).

A temporal selection is a selection that involves the time attributes; a temporal projection is a projection where the tuples in the projection inherit their times from the tuples in the original relation. A temporal join is a join, with the time of a tuple in the result being the intersection of the times of the tuples from which it is derived. If the times do not intersect, the tuple is removed from the result.

The predicates precedes, overlaps, and contains can be applied on intervals; their meanings should be clear. The intersect operation can be applied on two intervals, to give a single (possibly empty) interval. However, the union of two intervals may or may not be a single interval.

Functional dependencies must be used with care in a temporal relation. Although the account number may functionally determine the balance at any given point in time, obviously the balance can change over time. A temporal functional dependency X τ Y holds on a relation schema R if, for all legal instances r of R, all snapshots of r satisfy the functional dependency X → Y .

Several proposals have been made for extending SQL to improve its support of temporal data. SQL:1999 Part 7 (SQL/Temporal), which is currently under development, is the proposed standard for temporal extensions to SQL.

Spatial and Geographic Data

Spatial data support in databases is important for efﬁciently storing, indexing, and querying of data based on spatial locations. For example, suppose that we want to store a set of polygons in a database, and to query the database to ﬁnd all polygons that intersect a given polygon. We cannot use standard index structures, such as B- trees or hash indices, to answer such a query efﬁciently. Efﬁcient processing of the above query would require special-purpose index structures, such as R-trees (which we study later) for the task.

Two types of spatial data are particularly important:

• Computer-aided-design (CAD) data, which includes spatial information about how objects — such as buildings, cars, or aircraft — are constructed. Other important examples of computer-aided-design databases are integrated- circuit and electronic-device layouts.

• Geographic data such as road maps, land-usage maps, topographic elevation maps, political maps showing boundaries, land ownership maps, and so on. Geographic information systems are special-purpose databases tailored for storing geographic data.

Support for geographic data has been added to many database systems, such as the IBM DB2 Spatial Extender, the Informix Spatial Data blade, and Oracle Spatial.

Representation of Geometric Information

Figure 23.2 illustrates how various geometric constructs can be represented in a data- base, in a normalized fashion. We stress here that geometric information can be rep- resented in several different ways, only some of which we describe.

A line segment can be represented by the coordinates of its endpoints. For example, in a map database, the two coordinates of a point would be its latitude and longi-

tude. A polyline (also called a linestring) consists of a connected sequence of line segments, and can be represented by a list containing the coordinates of the endpoints of the segments, in sequence. We can approximately represent an arbitrary curve by polylines, by partitioning the curve into a sequence of segments. This representation is useful for two-dimensional features such as roads; here, the width of the road is small enough relative to the size of the full map that it can be considered two dimensional. Some systems also support circular arcs as primitives, allowing curves to be represented as sequences of arcs.

We can represent a polygon by listing its vertices in order, as in Figure 23.2.2 The list of vertices speciﬁes the boundary of a polygonal region. In an alternative representation, a polygon can be divided into a set of triangles, as shown in Figure 23.2.

This process is called triangulation, and any polygon can be triangulated. The complex polygon can be given an identiﬁer, and each of the triangles into which it is divided carries the identiﬁer of the polygon. Circles and ellipses can be represented by corresponding types, or can be approximated by polygons.

List-based representations of polylines or polygons are often convenient for query processing. Such non-ﬁrst-normal-form representations are used when supported by the underlying database. So that we can use ﬁxed-size tuples (in ﬁrst-normal form) for representing polylines, we can give the polyline or curve an identiﬁer, and can represent each segment as a separate tuple that also carries with it the identiﬁer of the polyline or curve. Similarly, the triangulated representation of polygons allows a ﬁrst-normal-form relational representation of polygons.

The representation of points and line segments in three-dimensional space is similar to their representation in two-dimensional space, the only difference being that points have an extra z component. Similarly, the representation of planar ﬁgures — such as triangles, rectangles, and other polygons — does not change much when we move to three dimensions. Tetrahedrons and cuboids can be represented in the same way as triangles and rectangles. We can represent arbitrary polyhedra by dividing them into tetrahedrons, just as we triangulate polygons. We can also represent them by listing their faces, each of which is itself a polygon, along with an indication of which side of the face is inside the polyhedron.

Design Databases

Computer-aided-design (CAD) systems traditionally stored data in memory during editing or other processing, and wrote the data back to a ﬁle at the end of a session of editing. The drawbacks of such a scheme include the cost (programming complexity, as well as time cost) of transforming data from one form to another, and the need to read in an entire ﬁle even if only parts of it are required. For large designs, such as the design of a large-scale integrated circuit, or the design of an entire airplane, it may be impossible to hold the complete design in memory. Designers of object- oriented databases were motivated in large part by the database requirements of CAD systems. Object-oriented databases represent components of the design as objects, and the connections between the objects indicate how the design is structured.

The objects stored in a design database are generally geometric objects. Simple two-dimensional geometric objects include points, lines, triangles, rectangles, and, in general, polygons. Complex two-dimensional objects can be formed from simple objects by means of union, intersection, and difference operations. Similarly, com- plex three-dimensional objects may be formed from simpler objects such as spheres, cylinders, and cuboids, by union, intersection, and difference operations, as in Figure 23.3. Three-dimensional surfaces may also be represented by wireframe models, which essentially model the surface as a set of simpler objects, such as line segments, triangles, and rectangles.

Design databases also store nonspatial information about objects, such as the material from which the objects are constructed. We can usually model such information by standard data-modeling techniques. We concern ourselves here with only the spatial aspects.

Various spatial operations must be performed on a design. For instance, the designer may want to retrieve that part of the design that corresponds to a particular region of interest. Spatial-index structures, discussed in Section 23.3.5, are useful for such tasks. Spatial-index structures are multidimensional, dealing with two- and three-dimensional data, rather than dealing with just the simple one-dimensional ordering provided by the B+-trees.

Spatial-integrity constraints, such as “two pipes should not be in the same location,” are important in design databases to prevent interference errors. Such errors often occur if the design is performed manually, and are detected only when a prototype is being constructed. As a result, these errors can be expensive to ﬁx. Database support for spatial-integrity constraints helps people to avoid design errors, thereby keeping the design consistent. Implementing such integrity checks again depends on the availability of efﬁcient multidimensional index structures.

Geographic Data

Geographic data are spatial in nature, but differ from design data in certain ways. Maps and satellite images are typical examples of geographic data. Maps may provide not only location information — about boundaries, rivers, and roads, for example — but also much more detailed information associated with locations, such as elevation, soil type, land usage, and annual rainfall.

Geographic data can be categorized into two types:

• Raster data. Such data consist of bit maps or pixel maps, in two or more dimensions. A typical example of a two-dimensional raster image is a satellite image of cloud cover, where each pixel stores the cloud visibility in a particular area. Such data can be three-dimensional — for example, the temperature at different altitudes at different regions, again measured with the help of a satellite. Time could form another dimension — for example, the surface temperature measurements at different points in time. Design databases generally do not store raster data.

• Vector data. Vector data are constructed from basic geometric objects, such as points, line segments, triangles, and other polygons in two dimensions, and cylinders, spheres, cuboids, and other polyhedrons in three dimensions.

Map data are often represented in vector format. Rivers and roads may be represented as unions of multiple line segments. States and countries may be represented as polygons. Topological information, such as height, may be rep- resented by a surface divided into polygons covering regions of equal height, with a height value associated with each polygon.

Representation of Geographic Data

Geographical features, such as states and large lakes, are represented as complex polygons. Some features, such as rivers, may be represented either as complex curves or as complex polygons, depending on whether their width is relevant.

Geographic information related to regions, such as annual rainfall, can be rep- resented as an array — that is, in raster form. For space efﬁciency, the array can be stored in a compressed form. In Section 23.3.5, we study an alternative representa- tion of such arrays by a data structure called a quadtree.

As noted in Section 23.3.3, we can represent region information in vector form, using polygons, where each polygon is a region within which the array value is the same. The vector representation is more compact than the raster representation in some applications. It is also more accurate for some tasks, such as depicting roads, where dividing the region into pixels (which may be fairly large) leads to a loss of precision in location information. However, the vector representation is unsuitable for applications where the data are intrinsically raster based, such as satellite images.

Applications of Geographic Data

Geographic databases have a variety of uses, including online map services, vehicle- navigation systems; distribution-network information for public-service utilities such as telephone, electric-power, and water-supply systems; and land-usage information for ecologists and planners.

Web-based road map services form a very widely used application of map data. At the simplest level, these systems can be used to generate online road maps of a desired region. An important beneﬁt of online maps is that it is easy to scale the maps to the desired size — that is, to zoom in and out to locate relevant features. Road map services also store information about roads and services, such as the layout of roads, speed limits on roads, road conditions, connections between roads, and one- way restrictions. With this additional information about roads, the maps can be used for getting directions to go from one place to another and for automatic trip planning. Users can query online information about services to locate, for example, hotels, gas stations, or restaurants with desired offerings and price ranges.

Vehicle-navigation systems are systems mounted in automobiles, which provide road maps and trip planning services. A useful addition to a mobile geographic information system such as a vehicle navigation system is a Global Positioning System (GPS) unit, which uses information broadcast from GPS satellites to ﬁnd the current location with an accuracy of tens of meters. With such a system, a driver can never3 get lost — the GPS unit ﬁnds the location in terms of latitude, longitude, and elevation and the navigation system can query the geographic database to ﬁnd where and on which road the vehicle is currently located.

Geographic databases for public-utility information are becoming increasingly important as the network of buried cables and pipes grows. Without detailed maps, work carried out by one utility may damage the cables of another utility, resulting in large-scale disruption of service. Geographic databases, coupled with accurate location-ﬁnding systems, can help avoid such problems.

So far, we have explained why spatial databases are useful. In the rest of the section, we shall study technical details, such as representation and indexing of spatial information.

Spatial Queries

There are a number of types of queries that involve spatial locations.

• Nearness queries request objects that lie near a speciﬁed location. A query to ﬁnd all restaurants that lie within a given distance of a given point is an example of a nearness query. The nearest-neighbor query requests the object that is nearest to a speciﬁed point. For example, we may want to ﬁnd the nearest gasoline station. Note that this query does not have to specify a limit on the distance, and hence we can ask it even if we have no idea how far the nearest gasoline station lies.

• Region queries deal with spatial regions. Such a query can ask for objects that lie partially or fully inside a speciﬁed region. A query to ﬁnd all retail shops within the geographic boundaries of a given town is an example.

• Queries may also request intersections and unions of regions. For example, given region information, such as annual rainfall and population density, a query may request all regions with a low annual rainfall as well as a high population density.

Queries that compute intersections of regions can be thought of as computing the spatial join of two spatial relations — for example, one representing rainfall and the other representing population density — with the location playing the role of join at- tribute. In general, given two relations, each containing spatial objects, the spatial join of the two relations generates either pairs of objects that intersect, or the intersection regions of such pairs.

Several join algorithms efﬁciently compute spatial joins on vector data. Although nested-loop join and indexed nested-loop join (with spatial indices) can be used, hash joins and sort – merge joins cannot be used on spatial data. Researchers have pro- posed join techniques based on coordinated traversal of spatial index structures on the two relations. See the bibliographical notes for more information.

In general, queries on spatial data may have a combination of spatial and nonspatial requirements. For instance, we may want to ﬁnd the nearest restaurant that has vegetarian selections, and that charges less than $10 for a meal.

Since spatial data are inherently graphical, we usually query them by using a graphical query language. Results of such queries are also displayed graphically, rather than in tables. The user can invoke various operations on the interface, such as choosing an area to be viewed (for example, by pointing and clicking on suburbs west of Manhattan), zooming in and out, choosing what to display on the basis of selection conditions (for example, houses with more than three bedrooms), overlay of multiple maps (for example, houses with more than three bedrooms over layed on a map showing areas with low crime rates), and so on. The graphical interface constitutes the front end. Extensions of SQL have been proposed to permit relational databases to store and retrieve spatial information efﬁciently, and also allowing queries to mix spatial and nonspatial conditions. Extensions include allowing abstract data types, such as lines, polygons, and bit maps, and allowing spatial conditions, such as contains or overlaps.

Indexing of Spatial Data

Indices are required for efﬁcient access to spatial data. Traditional index structures, such as hash indices and B-trees, are not suitable, since they deal only with one- dimensional data, whereas spatial data are typically of two or more dimensions.

k-d Trees

To understand how to index spatial data consisting of two or more dimensions, we consider ﬁrst the indexing of points in one-dimensional data. Tree structures, such as binary trees and B-trees, operate by successively dividing space into smaller parts. For instance, each internal node of a binary tree partitions a one-dimensional interval in two. Points that lie in the left partition go into the left subtree; points that lie in the right partition go into the right subtree. In a balanced binary tree, the partition is chosen so that approximately one-half of the points stored in the subtree fall in each partition. Similarly, each level of a B-tree splits a one-dimensional interval into multiple parts.

We can use that intuition to create tree structures for two-dimensional space, as well as in higher-dimensional spaces. A tree structure called a k-d tree was one of the early structures used for indexing in multiple dimensions. Each level of a k-d tree partitions the space into two. The partitioning is done along one dimension at the node at the top level of the tree, along another dimension in nodes at the next level, and so on, cycling through the dimensions. The partitioning proceeds in such a way that, at each node, approximately one-half of the points stored in the subtree fall on one side, and one-half fall on the other. Partitioning stops when a node has

less than a given maximum number of points. Figure 23.4 shows a set of points in two-dimensional space, and a k-d tree representation of the set of points. Each line corresponds to a node in the tree, and the maximum number of points in a leaf node has been set at 1. Each line in the ﬁgure (other than the outside box) corresponds to a node in the k-d tree. The numbering of the lines in the ﬁgure indicates the level of the tree at which the corresponding node appears.

The k-d-B tree extends the k-d tree to allow multiple child nodes for each internal node, just as a B-tree extends a binary tree, to reduce the height of the tree. k-d-B trees are better suited for secondary storage than k-d trees.

Quadtrees

An alternative representation for two-dimensional data is a quadtree. An example of the division of space by a quadtree appears in Figure 23.5. The set of points is the same as that in Figure 23.4. Each node of a quadtree is associated with a rectangular region of space. The top node is associated with the entire target space. Each non- leaf node in a quadtree divides its region into four equal-sized quadrants, and correspondingly each such node has four child nodes corresponding to the four quad- rants. Leaf nodes have between zero and some ﬁxed maximum number of points. Correspondingly, if the region corresponding to a node has more than the maximum number of points, child nodes are created for that node. In the example in Figure 23.5, the maximum number of points in a leaf node is set to 1.

This type of quadtree is called a PR quadtree, to indicate it stores points, and that the division of space is divided based on regions, rather than on the actual set of

points stored. We can use region quadtrees to store array (raster) information. A node in a region quadtree is a leaf node if all the array values in the region that it covers are the same. Otherwise, it is subdivided further into four children of equal area, and is therefore an internal node. Each node in the region quadtree corresponds to a subarray of values. The subarrays corresponding to leaves either contain just a single array element or have multiple array elements, all of which have the same value.

Indexing of line segments and polygons presents new problems. There are extensions of k-d trees and quadtrees for this task. However, a line segment or polygon may cross a partitioning line. If it does, it has to be split and represented in each of the subtrees in which its pieces occur. Multiple occurrences of a line segment or polygon can result in inefﬁciencies in storage, as well as inefﬁciencies in querying.

R-Trees

A storage structure called an R-tree is useful for indexing of rectangles and other polygons. An R-tree is a balanced tree structure with the indexed polygons stored in leaf nodes, much like a B+-tree. However, instead of a range of values, a rectangular bounding box is associated with each tree node. The bounding box of a leaf node is the smallest rectangle parallel to the axes that contains all objects stored in the leaf node. The bounding box of internal nodes is, similarly, the smallest rectangle parallel to the axes that contains the bounding boxes of its child nodes. The bounding box of a polygon is deﬁned, similarly, as the smallest rectangle parallel to the axes that contains the polygon.

Each internal node stores the bounding boxes of the child nodes along with the pointers to the child nodes. Each leaf node stores the indexed polygons, and may optionally store the bounding boxes of the polygons; the bounding boxes help speed up checks for overlaps of the rectangle with the indexed polygons — if a query rect- angle does not overlap with the bounding box of a polygon, it cannot overlap with the polygon either. (If the indexed polygons are rectangles, there is of course no need to store bounding boxes since they are identical to the rectangles.)

Figure 23.6 shows an example of a set of rectangles (drawn with a solid line) and the bounding boxes (drawn with a dashed line) of the nodes of an R-tree for the set of rectangles. Note that the bounding boxes are shown with extra space inside them, to make them stand out pictorially. In reality, the boxes would be smaller and ﬁt tightly on the objects that they contain; that is, each side of a bounding box B would touch at least one of the objects or bounding boxes that are contained in B.

The R-tree itself is at the right side of Figure 23.6. The ﬁgure refers to the coordinates of bounding box i as BBi in the ﬁgure.

We shall now see how to implement search, insert, and delete operations on an R-tree.

• Search: As the ﬁgure shows, the bounding boxes associated with sibling nodes may overlap; in B+-trees, k-d trees, and quadtrees, in contrast, the ranges do not overlap. A search for polygons containing a point therefore has to follow all child nodes whose associated bounding boxes contain the point; as a result, multiple paths may have to be searched. Similarly, a query to ﬁnd all polygons that intersect a given polygon has to go down every node where the associated rectangle intersects the polygon.

• Insert: When we insert a polygon into an R-tree, we select a leaf node to hold the polygon. Ideally we should pick a leaf node that has space to hold a new entry, and whose bounding box contains the bounding box of the polygon. However, such a node may not exist; even if it did, ﬁnding the node may be very expensive, since it is not possible to ﬁnd it by a single traversal down from the root. At each internal node we may ﬁnd multiple children whose bounding boxes contain the bounding box of the polygon, and each of these children needs to be explored. Therefore, as a heuristic, in a traversal from the root, if any of the child nodes has a bounding box containing the bounding box of the polygon, the R-tree algorithm chooses one of them arbitrarily. If none of the children satisfy this condition, the algorithm chooses a child node whose bounding box has the maximum overlap with the bounding box of the polygon for continuing the traversal.

Once the leaf node has been reached, if the node is already full, the algorithm performs node splitting (and propagates splitting upward if required) in a manner very similar to B+-tree insertion. Just as with B+-tree insertion, the R-tree insertion algorithm ensures that the tree remains balanced. Addition- ally, it ensures that the bounding boxes of leaf nodes, as well as internal nodes, remain consistent; that is, bounding boxes of leaves contain all the bounding boxes of the polygons stored at the leaf, while the bounding boxes for internal nodes contain all the bounding boxes of the children nodes.

The main difference of the insertion procedure from the B+-tree insertion procedure lies in how the node is split. In a B+-tree, it is possible to ﬁnd a value such that half the entries are less than the midpoint and half are greater than the value. This property does not generalize beyond one dimension; that is, for more than one dimension, it is not always possible to split the entries into two sets so that their bounding boxes do not overlap. Instead, as a heuristic, the set of entries S can be split into two disjoint sets S1 and S2 so that the bounding boxes of S1 and S2 have the minimum total area; another heuristic would be to split the entries into two sets S1 and S2 in such a way that S1 and S2 have minimum overlap. The two nodes resulting from the split would contain the entries in S1 and S2 respectively. The cost of ﬁnding splits with minimum total area or overlap can itself be large, so cheaper heuristics, such as the quadratic split heuristic are used. (The heuristic gets is name from the fact that it takes time quadratic in the number of entries.)

The quadratic split heuristic works this way: First, it picks a pair of entries a and b from S such that putting them in the same node would result in a bounding box with the maximum wasted space; that is, the area of the minimum bounding box of a and b minus the sum of the areas of a and b is the largest. The heuristic places the entries a and b in sets S1 and S2 respectively.

It then iteratively adds the remaining entries, one entry per iteration, to one of the two sets S1 or S2. At each iteration, for each remaining entry e, let ie,1 denote the increase in the size of the bounding box of S1 if e is added to S1 and let ie,2 denote the corresponding increase for S2. In each iteration, the heuristic chooses one of the entries with the maximum difference of ie,1 and ie,2 and adds it to S1 if ie,1 is less than ie,2, and to S2 otherwise. That is, an entry with “maximum preference” for one of S1 or S2 is chosen at each iteration. The iteration stops when all entries have been assigned, or when one of the sets S1 or S2 has enough entries that all remaining entries have to be added to the other set so the nodes constructed from S1 and S2 both have the required minimum occupancy. The heuristic then adds all unassigned entries to the set with fewer entries.

• Deletion: Deletion can be performed like a B+-tree deletion, borrowing entries from sibling nodes, or merging sibling nodes if a node becomes under full. An alternative approach redistributes all the entries of under full nodes to sibling nodes, with the aim of improving the clustering of entries in the R-tree.

See the bibliographical references for more details on insertion and deletion operations on R-trees, as well as on variants of R-trees, called R∗-trees or R+-trees.

The storage efﬁciency of R-trees is better than that of k-d trees or quad trees, since a polygon is stored only once, and we can ensure easily that each node is at least half full. However, querying may be slower, since multiple paths have to be searched. Spatial joins are simpler with quad trees than with R-trees, since all quad trees on a region are partitioned in the same manner. However, because of their better storage efﬁciency, and their similarity to B-trees, R-trees and their variants have proved popular in database systems that support spatial data.

Search This Blog

Database Management System course