Application Development and Administration:Advanced Querying and Information Retrieval.

Advanced Querying and Information Retrieval

Businesses have begun to exploit the burgeoning data online to make better decisions about their activities, such as what items to stock and how best to target customers to increase sales. Many of their queries are rather complicated, however, and certain types of information cannot be extracted even by using SQL.

Several techniques and tools are available to help with decision support. Several tools for data analysis allow analysts to view data in different ways. Other analysis tools precompute summaries of very large amounts of data, in order to give fast responses to queries. The SQL:1999 standard now contains additional constructs to support data analysis. Another approach to getting knowledge from data is to use data mining, which aims at detecting various types of patterns in large volumes of data. Data mining supplements various types of statistical techniques with similar goals.

Textual data, too, has grown explosively. Textual data is unstructured, unlike the rigidly structured data in relational databases. Querying of unstructured textual data is referred to as information retrieval. Information retrieval systems have much in common with database systems—in particular, the storage and retrieval of data on secondary storage. However, the emphasis in the ﬁeld of information systems is different from that in database systems, concentrating on issues such as querying based on keywords; the relevance of documents to the query; and the analysis, classiﬁcation, and indexing of documents.

This chapter covers decision support, including online analytical processing and data mining and information retrieval.

Decision-Support Systems

Database applications can be broadly classiﬁed into transaction processing and decision support, as we have seen earlier in Section 21.3.2. Transaction-processing systems are widely used today, and companies have accumulated a vast amount of in- formation generated by these systems.

For example, company databases often contain enormous quantities of information about customers and transactions. The size of the information storage required may range up to hundreds of gigabytes, or even terabytes, for large retail chains. Transaction information for a retailer may include the name or identiﬁer (such as credit-card number) of the customer, the items purchased, the price paid, and the dates on which the purchases were made. Information about the items purchased may include the name of the item, the manufacturer, the model number, the color, and the size. Customer information may include credit history, annual income, residence, age, and even educational background.

Such large databases can be treasure troves of information for making business decisions, such as what items to stock and what discounts to offer. For instance, a retail company may notice a sudden spurt in purchases of ﬂannel shirts in the Paciﬁc Northwest, may realize that there is a trend, and may start stocking a larger number of such shirts in shops in that area. As another example, a car company may ﬁnd, on querying its database, that most of its small sports cars are bought by young women whose annual incomes are above $50,000. The company may then target its marketing to attract more such women to buy its small sports cars, and may avoid wasting money trying to attract other categories of people to buy those cars. In both cases, the company has identiﬁed patterns in customer behavior, and has used the patterns to make business decisions.

The storage and retrieval of data for decision support raises several issues:

• Although many decision support queries can be written in SQL, others either cannot be expressed in SQL or cannot be expressed easily in SQL. Several SQL extensions have therefore been proposed to make data analysis easier. The area of online analytical processing (OLAP) deals with tools and techniques for data analysis that can give nearly instantaneous answers to queries request- ing summarized data, even though the database may be extremely large. In Section 22.2, we study SQL extensions for data analysis, and techniques for online analytical processing.

• Database query languages are not suited to the performance of detailed statistical analyses of data. There are several packages, such as SAS and S++, that help in statistical analysis. Such packages have been interfaced with databases, to allow large volumes of data to be stored in the database and retrieved efﬁciently for analysis. The ﬁeld of statistical analysis is a large discipline on its own; see the references in the bibliographical notes for more information.

• Knowledge-discovery techniques attempt to discover automatically statistical rules and patterns from data. The ﬁeld of data mining combines knowledge discovery techniques invented by artiﬁcial intelligence researchers and statistical analysts, with efﬁcient implementation techniques that enable them to be used on extremely large databases. Section 22.3 discusses data mining.

• Large companies have diverse sources of data that they need to use for making business decisions. The sources may store the data under different schemas. For performance reasons (as well as for reasons of organization control), the data sources usually will not permit other parts of the company to retrieve data on demand.

To execute queries efﬁciently on such diverse data, companies have built data warehouses. Data warehouses gather data from multiple sources under a uniﬁed schema, at a single site. Thus, they provide the user a single uniform interface to data. We study issues in building and maintaining a data ware- house in Section 22.4.

The area of decision support can be broadly viewed as covering all the above areas, although some people use the term in a narrower sense that excludes statistical analysis and data mining.

Data Analysis and OLAP

Although complex statistical analysis is best left to statistics packages, databases should support simple, commonly used, forms of data analysis. Since the data stored in databases are usually large in volume, they need to be summarized in some fashion if we are to derive information that humans can use.

OLAP tools support interactive analysis of summary information. Several SQL ex- tensions have been developed to support OLAP tools. There are many commonly used tasks that cannot be done using the basic SQL aggregation and grouping facilities. Examples include ﬁnding percentiles, or cumulative distributions, or aggregates over sliding windows on sequentially ordered data. A number of extensions of SQL have been recently proposed to support such tasks, and implemented in products such as Oracle and IBM DB2.

Online Analytical Processing

Statistical analysis often requires grouping on multiple attributes. Consider an application where a shop wants to ﬁnd out what kinds of clothes are popular. Let us suppose that clothes are characterized by their item-name, color, and size, and that we have a relation sales with the schema sales(item-name, color, size, number). Suppose that item-name can take on the values (skirt, dress, shirt, pant), color can take on the values (dark, pastel, white), and size can take on values (small, medium, large).

Given a relation used for data analysis, we can identify some of its attributes as measure attributes, since they measure some value, and can be aggregated upon. For instance, the attribute number of the sales relation is a measure attribute, since it measures the number of units sold. Some (or all) of the other attributes of the relation are identiﬁed as dimension attributes, since they deﬁne the dimensions on which measure attributes, and summaries of measure attributes, are viewed. In the sales relation, item-name, color, and size are dimension attributes. (A more realistic version of the sales relation would have additional dimensions, such as time and sales location, and additional measures such as monetar y value of the sale.)

Data that can be modeled as dimension attributes and measure attributes are called multidimensional data.

To analyze the multidimensional data, a manager may want to see data laid out as shown in the table in Figure 22.1. The table shows total numbers for different combinations of item-name and color. The value of size is speciﬁed to be all, indicating that the displayed values are a summary across all values of size.

The table in Figure 22.1 is an example of a cross-tabulation (or cross-tab, for short), also referred to as a pivot-table. In general, a cross-tab is a table where values for one attribute (say A) form the row headers, values for another attribute (say B) form the column headers, and the values in an individual cell are derived as follows. Each cell can be identiﬁed by (ai, bj ), where ai is a value for A and bj a value for B. If there is at most one tuple with any (ai, bj ) value, the value in the cell is derived from that single tuple (if any); for instance, it could be the value of one or more other attributes of the tuple. If there can be multiple tuples with an (ai, bj ) value, the value in the cell must be derived by aggregation on the tuples with that value. In our example, the aggregation used is the sum of the values for attribute number. In our example, the cross-tab also has an extra column and an extra row storing the totals of the cells in the row/column. Most cross-tabs have such summary rows and columns.

A cross-tab is different from relational tables usually stored in databases, since the number of columns in the cross-tab depends on the actual data. A change in the data values may result in adding more columns, which is not desirable for data storage.

However, a cross-tab view is desirable for display to users. It is straightforward to represent a cross-tab without summary values in a relational form with a ﬁxed number of columns. A cross-tab with summary rows/columns can be represented by introducing a special value all to represent subtotals, as in Figure 22.2. The SQL:1999 standard actually uses the null value in place of all, but to avoid confusion with regular null values, we shall continue to use all.

Consider the tuples (skirt, all, 53) and (dress, all, 35). We have obtained these tuples by eliminating individual tuples with different values for color, and by replacing the value of number by an aggregate—namely, sum. The value all can be thought of as representing the set of all values for an attribute. Tuples with the value all only for the color dimension can be obtained by an SQL query performing a group by on the column item-name. Similarly, a group by on color can be used to get the tuples with the value all for item-name, and a group by with no attributes (which can simply be omitted in SQL) can be used to get the tuple with value all for item-name and color.

The generalization of a cross-tab, which is 2-dimensional, to n dimensions can be visualized as an n-dimensional cube, called the data cube. Figure 22.3 shows a data cube on the sales relation. The data cube has three dimensions, namely item-name, color, and size, and the measure attribute is number. Each cell is identiﬁed by values for these three dimensions. Each cell in the data cube contains a value, just as in a cross-tab. In Figure 22.3, the value contained in a cell is shown on one of the faces of the cell; other faces of the cell are shown blank if they are visible.

The value for a dimension may be all, in which case the cell contains a summary over all values of that dimension, as in the case of cross-tabs. The number of different ways in which the tuples can be grouped for aggregation can be large. In fact, for a table with n dimensions, aggregation can be performed with grouping on each of the 2n subsets of the n dimensions.1 An online analytical processing or OLAP system is an interactive system that permits an analyst to view different summaries of multidimensional data. The word online indicates that the an analyst must be able to request new summaries and get responses online, within a few seconds, and should not be forced to wait for a long time to see the result of a query.

With an OLAP system, a data analyst can look at different cross-tabs on the same data by interactively selecting the attributes in the cross-tab. Each cross-tab is a

two-dimensional view on a multidimensional data cube. For instance the analyst may select a cross-tab on item-name and size, or a cross-tab on color and size. The operation of changing the dimensions used in a cross-tab is called pivoting.

An OLAP system provides other functionality as well. For instance, the analyst may wish to see a cross-tab on item-name and color for a ﬁxed value of size, for ex- ample, large, instead of the sum across all sizes. Such an operation is referred to as slicing, since it can be thought of as viewing a slice of the data cube. The operation is sometimes called dicing, particularly when values for multiple dimensions are ﬁxed.

When a cross-tab is used to view a multidimensional cube, the values of dimension attributes that are not part of the cross-tab are shown above the cross-tab. The value of such an attribute can be all, as shown in Figure 22.1, indicating that data in the crosstab are a summary over all values for the attribute. Slicing/dicing simply consists of selecting speciﬁc values for these attributes, which are then displayed on top of the cross-tab.

OLAP systems permit users to view data at any desired level of granularity. The operation of moving from ﬁner-granularity data to a coarser granularity (by means of aggregation) is called a rollup. In our example, starting from the data cube on the sales table, we got our example cross-tab by rolling up on the attribute size. The opposite operation—that of moving from coarser-granularity data to ﬁner-granularity data—is called a drill down. Clearly, ﬁner-granularity data cannot be generated from coarse-granularity data; they must be generated either from the original data, or from even ﬁner-granularity summary data.

Analysts may wish to view a dimension at different levels of detail. For instance, an attribute of type datetime contains a date and a time of day. Using time precise to a second (or less) may not be meaningful: An analyst who is interested in rough time

of day may look at only the hour value. An analyst who is interested in sales by day of the week may map the date to a day-of-the-week and look only at that. Another analyst may be interested in aggregates over a month, or a quarter, or for an entire year.

The different levels of detail for an attribute can be organized into a hierarchy.

Figure 22.4(a) shows a hierarchy on the datetime attribute. As another example, Figure 22.4(b) shows a hierarchy on location, with the city being at the bottom of the hierarchy, state above it, country at the next level, and region being the top level. In our earlier example, clothes can be grouped by category (for instance, menswear or womenswear); category would then lie above item-name in our hierarchy on clothes.

At the level of actual values, skirts and dresses would fall under the womenswear category and pants and shirts under the menswear category.

An analyst may be interested in viewing sales of clothes divided as menswear and womenswear, and not interested in individual values. After viewing the aggregates at the level of womenswear and menswear, an analyst may drill down the hierarchy to look at individual values. An analyst looking at the detailed level may drill up the hierarchy, and look at coarser-level aggregates. Both levels can be displayed on the same cross-tab, as in Figure 22.5.

OLAP Implementation

The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are referred to as multidimensional OLAP (MOLAP) systems. Later, OLAP facilities were integrated into relational systems, with data stored in a relational data- base. Such systems are referred to as relational OLAP (ROLAP) systems. Hybrid

systems, which store some summaries in memory and store the base data and other summaries in a relational database, are called hybrid OLAP (HOLAP) systems.

Many OLAP systems are implemented as client–server systems. The server contains the relational database as well as any MOLAP data cubes. Client systems obtain views of the data by communicating with the server.

A naıve way of computing the entire data cube (all groupings) on a relation is to use any standard algorithm for computing aggregate operations, one grouping at a time. The naıve algorithm would require a large number of scans of the relation. A simple optimization is to compute an aggregation on, say, (item-name, color) from an aggregation (item-name, color, size), instead of from the original relation. For the standard SQL aggregate functions, we can compute an aggregate with grouping on a set of attributes A from an aggregate with grouping on a set of attributes B if A ⊆ B; you can do so as an exercise (see Exercise 22.1), but note that to compute avg, we additionally need the count value. (For some non-standard aggregate functions, such as median, aggregates cannot be computed as above; the optimization described here do not apply to such “non-decomposable” aggregate functions.) The amount of data read drops signiﬁcantly by computing an aggregate from another aggregate, instead of from the original relation. Further improvements are possible; for instance, multiple groupings can be computed on a single scan of the data. See the bibliographical notes for references to algorithms for efﬁciently computing data cubes.

Early OLAP implementations precomputed and stored entire data cubes, that is, groupings on all subsets of the dimension attributes. Precomputation allows OLAP queries to be answered within a few seconds, even on datasets that may contain millions of tuples adding up to gigabytes of data. However, there are 2n groupings with n dimension attributes; hierarchies on attributes increase the number further.

As a result, the entire data cube is often larger than the original relation that formed the data cube and in many cases it is not feasible to store the entire data cube.

Instead of precomputing and storing all possible groupings, it makes sense to precompute and store some of the groupings, and to compute others on demand. Instead of computing queries from the original relation, which may take a very long time, we can compute them from other precomputed queries. For instance, suppose a query requires summaries by (item-name, color), which has not been precomputed.

The query result can be computed from summaries by (item-name, color, size), if that has been precomputed. See the bibliographical notes for references on how to select a good set of groupings for precomputation, given limits on the storage available for precomputed results.

The data in a data cube cannot be generated by a single SQL query using the basic group by constructs, since aggregates are computed for several different groupings of the dimension attributes. Section 22.2.3 discusses SQL extensions to support OLAP functionality.

Extended Aggregation

The SQL-92 aggregation functionality is limited, so several extensions were implemented by different databases. The SQL:1999 standard, however, deﬁnes a rich set of aggregate functions, which we outline in this section and in the next two sections. The Oracle and IBM DB2 databases support most of these features, and other databases will no doubt support these features in the near future.

The new aggregate functions on single attributes are standard deviation and variance (stddev and variance). Standard deviation is the square root of variance.2 Some database systems support other aggregate functions such as median and mode. Some database systems even allow users to add new aggregate functions.

SQL:1999 also supports a new class of binary aggregate functions, which can compute statistical results on pairs of attributes; they include correlations, covariances, and regression curves, which give a line approximating the relation between the values of the pair of attributes. Deﬁnitions of these functions may be found in any standard textbook on statistics, such as those referenced in the bibliographical notes.

SQL:1999 also supports generalizations of the group by construct, using the cube and rollup constructs. A representative use of the cube construct is:

select item-name, color, size, sum(number) from sales

group by cube(item-name, color, size)

This query computes the union of eight different groupings of the sales relation:

{ (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), () }

where () denotes an empty group by list.

For each grouping, the result contains the null value for attributes not present in the grouping. For instance, the table in Figure 22.2, with occurrences of all replaced by null, can be computed by the query select item-name, color, sum(number) from sales group by cube(item-name, color)

2. The SQL:1999 standard actually supports two types of variance, called population variance and sample variance, and correspondingly two types of standard deviation. The deﬁnitions of the two types differ slightly; see a statistics textbook for details.

A representative rollup construct is select item-name, color, size, sum(number) from sales group by rollup(item-name, color, size)

Here, only four groupings are generated:

{ (item-name, color, size), (item-name, color), (item-name), () }

Rollup can be used to generate aggregates at multiple levels of a hierarchy on a column. For instance, suppose we have a table itemcategory(item-name, category) giving the category of each item. Then the query select category, item-name, sum(number) from sales, category

where sales.item-name = itemcategory.item-name

group by rollup(category, item-name)

would give a hierarchical summary by item-name and by category.

Multiple rollups and cubes can be used in a single group by clause. For instance, the following query

select item-name, color, size, sum(number) from sales group by rollup(item-name), rollup(color, size)

generates the groupings

{ (item-name, color, size), (item-name, color), (item-name), (color, size), (color), () }

To understand why, note that rollup(item-name) generates two groupings, {(item- name), ()}, and rollup(color, size) generates three groupings, {(color, size), (color), () }.

The cross product of the two gives us the six groupings shown.

As we mentioned in Section 22.2.1, SQL:1999 uses the value null to indicate the usual sense of null as well as all. This dual use of null can cause ambiguity if the attributes used in a rollup or cube clause contain null values. The function grouping can be applied on an attribute; it returns 1 if the value is a null value representing all, and returns 0 in all other cases. Consider the following query:

select item-name, color, size, sum(number), grouping(item-name) as item-name-ﬂag, grouping(color) as color-ﬂag, grouping(size) as size-

The output is the same as in the version of the query without grouping, but with three extra columns called item-name-ﬂag, color-ﬂag, and size-ﬂag. In each tuple, the value of a ﬂag ﬁeld is 1 if the corresponding ﬁeld is a null representing all.

Instead of using tags to indicate nulls that represent all, we can replace the null value by a value of our choice:

decode(grouping(item-name), 1, ’all’, item-name)

This expression returns the value “all” if the value of item-name is a null correspond- ing to all, and returns the actual value of item-name otherwise. This expression can be used in place of item-name in the select clause to get “all” in the output of the query, in place of nulls representing all.

Neither the rollup nor the cube clause gives complete control on the groupings that are generated. For instance, we cannot use them to specify that we want only groupings {(color, size), (size, item-name)}. Such restricted groupings can be generated by using the grouping construct in the having clause; we leave the details as an exercise for you.

Ranking

Finding the position of a value in a larger set is a common operation. For instance, we may wish to assign students a rank in class based on their total marks, with the rank 1 going to the student with the highest marks, the rank 2 to the student with the next highest marks, and so on. While such queries can be expressed in SQL-92, they are difﬁcult to express and inefﬁcient to evaluate. Programmers often resort to writing the query partly in SQL and partly in a programming language. A related type of query is to ﬁnd the percentile in which a value in a (multi)set belongs, for example, the bottom third, middle third, or top third. We study SQL:1999 support for these types of queries here.

Ranking is done in conjunction with an order by speciﬁcation. Suppose we are given a relation student-marks(student-id, marks) which stores the marks obtained by each student. The following query gives the rank of each student.

select student-id, rank() over (order by (marks) desc) as s-rank

from student-marks

Note that the order of tuples in the output is not deﬁned, so they may not be sorted by rank. An extra order by clause is needed to get them in sorted order, as shown below.

select student-id, rank () over (order by (marks) desc) as s-rank

from student-marks order by s-rank

A basic issue with ranking is how to deal with the case of multiple tuples that are the same on the ordering attribute(s). In our example, this means deciding what to do if there are two students with the same marks. The rank function gives the same rank to all tuples that are equal on the order by attributes. For instance, if the highest mark is shared by two students, both would get rank 1. The next rank given would be 3, not 2, so if three students get the next highest mark, they would all get rank 3, and the next student(s) would get rank 5, and so on. There is also a dense rank function that does not create gaps in the ordering. In the above example, the tuples with the second highest value all get rank 2, and tuples with the third highest value get rank 3, and so on.

Ranking can be done within partitions of the data. For instance, suppose we have an additional relation student-section(student-id, section) that stores for each student the section in which the student studies. The following query then gives the rank of

students within each section.

select student-id, section,

rank () over (partition by section order by marks desc) as sec-rank

from student-marks, student-section

where student-marks.student-id = student-section.student-id

order by section, sec-rank

The outer order by clause orders the result tuples by section, and within each section by the rank.

Multiple rank expressions can be used within a single select statement; thus we can obtain the overall rank and the rank within the section by using two rank expressions in the same select clause. An interesting question is what happens when ranking (possibly with partitioning) occurs along with a group by clause. In this case, the group by clause is applied ﬁrst, and partitioning and ranking are done on the results of the group by. Thus aggregate values can then be used for ranking. For example, suppose we had marks for each student for each of several subjects. To rank students by the sum of their marks in different subjects, we can use a group by clause to com- pute the aggregate marks for each student, and then rank students by the aggregate sum. We leave details as an exercise for you.

The ranking functions can be used to ﬁnd the top n tuples by embedding a ranking query within an outer-level query; we leave details as an exercise. Note that bottom n is simply the same as top n with a reverse sorting order. Several database systems provide nonstandard SQL extensions to specify directly that only the top n results are required; such extensions do not require the rank function, and simplify the job of the optimizer, but are (currently) not as general since they do not support partitioning.

SQL:1999 also speciﬁes several other functions that can be used in place of rank. For instance, percent rank of a tuple gives the rank of the tuple as a fraction. If there are n tuples in the partition3 and the rank of the tuple is r, then its percent rank is deﬁned as (r − 1)/(n − 1) (and as null if there is only one tuple in the partition). The function cume dist, short for cumulative distribution, for a tuple is deﬁned as p/n where p is the number of tuples in the partition with ordering values preceding or equal to the ordering value of the tuple, and n is the number of tuples in the partition. The function row number sorts the rows and gives each row a unique number corresponding to its position in the sort order; different rows with the same ordering value would get different row numbers, in a nondeterministic fashion.

Finally, for a given constant n, the ranking function ntile(n) takes the tuples in each partition in the speciﬁed order, and divides them into n buckets with equal numbers of tuples.4 For each tuple, ntile(n) then gives the number of the bucket in which it is placed, with bucket numbers starting with 1. This function is particularly useful for constructing histograms based on percentiles. For instance, we can sort employees by salary, and use ntile(3) to ﬁnd which range (bottom third, middle third, or top third) each employee is in, and compute the total salary earned by employees in each range:

select threetile, sum(salary) from (

select salary, ntile(3) over (order by (salary)) as threetile

from employee) as s

group by threetile.

The presence of null values can complicate the deﬁnition of rank, since it is not clear where they should occur ﬁrst in the sort order. SQL:1999 permits the user to specify where they should occur by using nulls ﬁrst or nulls last, for instance

select student-id, rank () over (order by marks desc nulls last) as s-rank

from student-marks

Windowing

An example of a window query is query that, given sales values for each date, calculates for each date the average of the sales on that day, the previous day, and the next day; such moving average queries are used to smooth out random variations. Another example of a window query is one that ﬁnds the cumulative balance in an account, given a relation specifying the deposits and withdrawals on an account. Such queries are either hard or impossible (depending on the exact query) to express in basic SQL.

SQL:1999 provides a windowing feature to support such queries. In contrast to group by, the same tuple can exist in multiple windows. Suppose we are given a relation transaction(account-number, date-time, value), where value is positive for a de-

posit and negative for a withdrawal. We assume there is at most one transaction per

date-time value.

Consider the query

select account-number, date-time,

sum(value) over

(partition by account-number

order by date-time

rows unbounded preceding)

as balance

from transaction

order by account-number, date-time

The query gives the cumulative balances on each account just before each transaction on the account; the cumulative balance of the account is the sum of values of all earlier transactions on the account.

The partition by clause partitions tuples by account number, so for each row only the tuples in its partition are considered. A window is created for each tuple; the key- words rows unbounded preceding specify that the window for each tuple consists of all tuples in the partition that precede it in the speciﬁed order (here, increasing order of date-time). The aggregate function sum(value) is applied on all the tuples in the window. Observe that the query does not use a group by clause, since there is an output tuple for each tuple in the transaction relation.

While the query could be written without these extended constructs, it would be rather difﬁcult to formulate. Note also that different windows can overlap, that is, a tuple may be present in more than one window.

Other types of windows can be speciﬁed. For instance, to get a window containing the previous 10 rows for each row, we can specify rows 10 preceding. To get a window containing the current, previous, and following row, we can use between rows

1 preceding and 1 following. To get the previous rows and the current row, we can say between rows unbounded preceding and current. Note that if the ordering is on a nonkey attribute, the result is not deterministic, since the order of tuples is not fully deﬁned.

We can even specify windows by ranges of values, instead of numbers of rows. For instance, suppose the ordering value of a tuple is v; then range between 10 preceding and current row would give tuples whose ordering value is between v − 10 and v

(both values inclusive). When dealing with dates, we can use range interval 10 day preceding to get a window containing tuples within the previous 10 days, but not including the date of the tuple.

Clearly, the windowing functionality of SQL:1999 is very rich and can be used to write rather complex queries with a small amount of effort.

Search This Blog

Database Management System course