Application Development and Administration:Data Warehousing

Data Warehousing

Large companies have presences in many places, each of which may generate a large volume of data. For instance, large retail chains have hundreds or thousands of stores, whereas insurance companies may have data from thousands of local branches. Further, large organizations have a complex internal organization structure, and there-

fore different data may be present in different locations, or on different operational systems, or under different schemas. For instance, manufacturing-problem data and customer-complaint data may be stored on different database systems. Corporate decision makers require access to information from all such sources. Setting up queries on individual sources is both cumbersome and inefﬁcient. Moreover, the sources of data may store only current data, whereas decision makers may need access to past data as well; for instance, information about how purchase patterns have changed in the past year could be of great importance. Data warehouses provide a solution to these problems.

A data warehouse is a repository (or archive) of information gathered from multiple sources, stored under a uniﬁed schema, at a single site. Once gathered, the data are stored for a long time, permitting access to historical data. Thus, data warehouses nprovide the user a single consolidated interface to data, making decision-support queries easier to write. Moreover, by accessing information for decision support from a data warehouse, the decision maker ensures that online transaction-processing systems are not affected by the decision-support workload.

Components of a Data Warehouse

Figure 22.8 shows the architecture of a typical data warehouse, and illustrates the gathering of data, the storage of data, and the querying and data-analysis support. Among the issues to be addressed in building a warehouse are the following:

• When and how to gather data. In a source-driven architecture for gathering data, the data sources transmit new information, either continually (as transaction processing takes place), or periodically (nightly, for example). In a destination-driven architecture, the data warehouse periodically sends re- quests for new data to the sources.

Unless updates at the sources are replicated at the warehouse via two-phase commit, the warehouse will never be quite up to date with the sources. Two- phase commit is usually far too expensive to be an option, so data warehouses typically have slightly out-of-date data. That, however, is usually not a problem for decision-support systems.

• What schema to use. Data sources that have been constructed independently are likely to have different schemas. In fact, they may even use different data models. Part of the task of a warehouse is to perform schema integration, and to convert data to the integrated schema before they are stored. As a result, the data stored in the warehouse are not just a copy of the data at the sources. In- stead, they can be thought of as a materialized view of the data at the sources.

• Data cleansing. The task of correcting and preprocessing data is called data cleansing. Data sources often deliver data with numerous minor inconsistencies, that can be corrected. For example, names are often misspelled, and ad- dresses may have street/area/city names misspelled, or zip codes entered in- correctly. These can be corrected to a reasonable extent by consulting a data- base of street names and zip codes in each city. Address lists collected from multiple sources may have duplicates that need to be eliminated in a merge– purge operation. Records for multiple individuals in a house may be grouped together so only one mailing is sent to each house; this operation is called householding.

• How to propagate updates. Updates on relations at the data sources must be propagated to the data warehouse. If the relations at the data warehouse are exactly the same as those at the data source, the propagation is straight- forward. If they are not, the problem of propagating updates is basically the view-maintenance problem, which was discussed in Section 14.5.

• What data to summarize. The raw data generated by a transaction-processing system may be too large to store online. However, we can answer many queries by maintaining just summary data obtained by aggregation on a relation, rather than maintaining the entire relation. For example, instead of storing data about every sale of clothing, we can store total sales of clothing by item- name and category.

Suppose that a relation r has been replaced by a summary relation s. Users may still be permitted to pose queries as though the relation r were available online. If the query requires only summary data, it may be possible to trans- form it into an equivalent one using s instead; see Section 14.5.

Warehouse Schemas

Data warehouses typically have schemas that are designed for data analysis, using tools such as OLAP tools. Thus, the data are usually multidimensional data, with dimension attributes and measure attributes. Tables containing multidimensional data are called fact tables and are usually very large. A table recording sales information for a retail store, with one tuple for each item that is sold, is a typical example of a fact table. The dimensions of the sales table would include what the item is (usually an item identiﬁer such as that used in bar codes), the date when the item is sold, which location (store) the item was sold from, which customer bought the item, and so on. The measure attributes may include the number of items sold and the price of the items.

To minimize storage requirements, dimension attributes are usually short identiﬁers that are foreign keys into other other tables called dimension tables. For instance, a fact table sales would have attributes item-id, store-id, customer-id, and date, and measure attributes number and price. The attribute store-id is a foreign key into a dimension table store, which has other attributes such as store location (city, state, country). The item-id attribute of the sales table would be a foreign key into a dimension table item-info, which would contain information such as the name of the item, the category to which the item belongs, and other item details such as color and size. The customer-id attribute would be a foreign key into a customer table containing attributes such as name and address of the customer. We can also view the date at- tribute as a foreign key into a date-info table giving the month, quarter, and year of each date.

The resultant schema appears in Figure 22.9. Such a schema, with a fact table, multiple dimension tables, and foreign keys from the fact table to the dimension tables, is called a star schema. More complex data warehouse designs may have multiple levels of dimension tables; for instance, the item-info table may have an attribute manufacturer-id that is a foreign key into another table giving details of the manufacturer. Such schemas are called snowﬂake schemas. Complex data warehouse designs may also have more than one fact table.

Information-Retrieval Systems

The ﬁeld of information retrieval has developed in parallel with the ﬁeld of databases. In the traditional model used in the ﬁeld of information retrieval, information is organized into documents, and it is assumed that there is a large number of documents. Data contained in documents is unstructured, without any associated schema. The process of information retrieval consists of locating relevant documents, on the basis of user input, such as keywords or example documents.

The Web provides a convenient way to get to, and to interact with, information sources across the Internet. However, a persistent problem facing the Web is the explosion of stored information, with little guidance to help the user to locate what is interesting. Information retrieval has played a critical role in making the Web a productive and useful tool, especially for researchers.

Traditional examples of information-retrieval systems are online library catalogs and online document-management systems such as those that store newspaper articles. The data in such systems are organized as a collection of documents; a newspaper article or a catalog entry (in a library catalog) are examples of documents. In the context of the Web, usually each HTML page is considered to be a document.

A user of such a system may want to retrieve a particular document or a particular class of documents. The intended documents are typically described by a set of keywords — for example, the keywords “database system” may be used to locate book son database systems, and the keywords “stock” and “scandal” may be used to locate articles about stock-market scandals. Documents have associated with them a set of keywords, and documents whose keywords contain those supplied by the user are retrieved.

Keyword-based information retrieval can be used not only for retrieving textual data, but also for retrieving other types of data, such as video or audio data, that have descriptive keywords associated with them. For instance, a video movie may have associated with it keywords such as its title, director, actors, type, and so on.

There are several differences between this model and the models used in traditional database systems.

• Database systems deal with several operations that are not addressed in information-retrieval systems. For instance, database systems deal with updates and with the associated transactional requirements of concurrency control and durability. These matters are viewed as less important in information systems. Similarly, database systems deal with structured information organized with relatively complex data models (such as the relational model or object- oriented data models), whereas information-retrieval systems traditionally have used a much simpler model, where the information in the database is organized simply as a collection of unstructured documents.

• Information-retrieval systems deal with several issues that have not been ad- dressed adequately in database systems. For instance, the ﬁeld of information retrieval has dealt with the problems of managing unstructured documents, such as approximate searching by keywords, and of ranking of documents on estimated degree of relevance of the documents to the query.

Keyword Search

Information-retrieval systems typically allow query expressions formed using key- words and the logical connectives and, or, and not. For example, a user could ask for all documents that contain the keywords “motorcycle and maintenance,” or documents that contain the keywords “computer or microprocessor,” or even documents that contain the keyword “computer but not database.” A query containing keywords without any of the above connectives is assumed to have ands implicitly connecting the keywords.

In full text retrieval, all the words in each document are considered to be keywords. For unstructured documents, full text retrieval is essential since there may be no information about what words in the document are keywords. We shall use the word term to refer to the words in a document, since all words are keywords.

In its simplest form an information retrieval system locates and returns all documents that contain all the keywords in the query, if the query has no connectives; connectives are handled as you would expect. More sophisticated systems estimate relevance of documents to a query so that the documents can be shown in order of estimated relevance. They use information about term occurrences, as well as hyperlink information, to estimate relevance; Section 22.5.1.1 and 22.5.1.2 outline how to do so. Section 22.5.1.3 outlines how to deﬁne similarity of documents, and use similarity for searching. Some systems also attempt to provide a better set of answers by using the meanings of terms, rather than just the syntactic occurrence of terms, as outlined in Section 22.5.1.4.

Relevance Ranking Using Terms

The set of all documents that satisfy a query expression may be very large; in particular, there are billions of documents on the Web, and most keyword queries on a Web search engine ﬁnd hundreds of thousands of documents containing the key- words. Full text retrieval makes this problem worse: Each document may contain many terms, and even terms that are only mentioned in passing are treated equivalently with documents where the term is indeed relevant. Irrelevant documents may get retrieved as a result.

Information retrieval systems therefore estimate relevance of documents to a query, and return only highly ranked documents as answers. Relevance ranking is not an exact science, but there are some well-accepted approaches.

The ﬁrst question to address is, given a particular term t, how relevant is a particular document d to the term. One approach is to use the the number of occurrences of the term in the document as a measure of its relevance, on the assumption that relevant terms are likely to be mentioned many times in a document. Just counting the number of occurrences of a term is usually not a good indicator: First, the number of occurrences depends on the length of the document, and second, a document containing 10 occurrences of a term may not be 10 times as relevant as a document containing one occurrence.

One way of measuring r(d, t), the relevance of a document d to a term t, is

where n(d) denotes the number of terms in the document and n(d, t) denotes the number of occurrences of term t in the document d. Observe that this metric takes the length of the document into account. The relevance grows with more occurrences of a term in the document, although it is not directly proportional to the number of occurrences.

Many systems reﬁne the above metric by using other information. For instance, if the term occurs in the title, or the author list, or the abstract, the document would be considered more relevant to the term. Similarly, if the ﬁrst occurrence of a term is late in the document, the document may be considered less relevant than if the ﬁrst occurrence is early in the document. The above notions can be formalized by extensions of the formula we have shown for r(d, t). In the information retrieval community, the relevance of a document to a term is referred to as term frequency, regardless of the exact formula used.

A query Q may contain multiple keywords. The relevance of a document to a query with two or more keywords is estimated by combining the relevance measures of the document to each keyword. A simple way of combining the measures is to add them up. However, not all terms used as keywords are equal. Suppose a query uses two terms, one of which occurs frequently, such as “web,” and another that is less frequent, such as “Silberschatz.” A document containing “Silberschatz” but not “web” should be ranked higher than a document containing the term “web” but not “Silberschatz.”

To ﬁx the above problem, weights are assigned to terms using the inverse document frequency, deﬁned as 1/n(t), where n(t) denotes the number of documents (among those indexed by the system) that contain the term t. The relevance of a document d to a set of terms Q is then deﬁned as

This measure can be further reﬁned if the user is permitted to specify weights w(t) for terms in the query, in which case the user-speciﬁed weights are also taken into account by using w(t)/n(t) in place of 1/n(t).

Almost all text documents (in English) contain words such as “and,” “or,” “a,” and so on, and hence these words are useless for querying purposes since their in- verse document frequency is extremely low. Information-retrieval systems deﬁne a set of words, called stop words, containing 100 or so of the most common words, and remove this set from the document when indexing; such words are not used as keywords, and are discarded if present in the keywords supplied by the user.

Another factor taken into account when a query contains multiple terms is the proximity of the term in the document. If the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. The formula for r(d, Q) can be modiﬁed to take proximity into account.

Given a query Q, the job of an information retrieval system is to return documents in descending order of their relevance to Q. Since there may be a very large number of documents that are relevant, information retrieval systems typically return only the ﬁrst few documents with the highest degree of estimated relevance, and permit users to interactively request further documents.

Relevance Using Hyperlinks

Early Web search engines ranked documents by using only relevance measures similar to those described in Section 22.5.1.1. However, researchers soon realized that Web documents have information that plain text documents do not have, namely hyper- links. And in fact, the relevance ranking of a document is affected more by hyperlinks that point to the document, than by hyperlinks going out of the document.

The basic idea of site ranking is to ﬁnd sites that are popular, and to rank pages from such sites higher than pages from other sites. A site is identiﬁed by the in- ternet address part of the URL, such as www.bell-labs.com in a URL http://www.bell-

labs.com/topic/books/db-book. A site usually contains multiple Web pages. Since most searches are intended to ﬁnd information from popular sites, ranking pages from popular sites higher is generally a good idea. For instance, the term “google” may oc- cur in vast numbers of pages, but the site google.com is the most popular among the sites with pages that contain the term “google”. Documents from google.com containing the term “google” would therefore be ranked as the most relevant to the term “google”.

This raises the question of how to deﬁne the popularity of a site. One way would be to ﬁnd how many times a site is accessed. However, getting such information is impossible without the cooperation of the site, and is infeasible for a Web search engine to implement. A very effective alternative uses hyperlinks; it deﬁnes p(s), the popularity of a site s, as the number of sites that contain at least one page with a link to site s.

Traditional measures of relevance of the page (which we saw in Section 22.5.1.2) can be combined with the popularity of the site containing the page to get an overall measure of the relevance of the page. Pages with high overall relevance value are returned as answers to a query, as before.

Note also that we used the popularity of a site as a measure of relevance of individual pages at the site, not the popularity of individual pages. There are at least two reasons for this. First, most sites contain only links to root pages of other sites, so all other pages would appear to have almost zero popularity, when in fact they may be accessed quite frequently by following links from the root page. Second, there are far fewer sites than pages, so computing and using popularity of sites is cheaper than computing and using popularity of pages.

There are more reﬁned notions of popularity of sites. For instance, a link from a popular site to another site s may be considered to be a better indication of the popularity of s than a link to s from a less popular site.6 This notion of popularity is in fact circular, since the popularity of a site is deﬁned by the popularity of other sites, and there may be cycles of links between sites. However, the popularity of sites can be deﬁned by a system of simultaneous linear equations, which can be solved by matrix manipulation techniques. The linear equations are deﬁned in such a way that they have a unique and well-deﬁned solution.

The popular Web search engine google.com uses the referring-site popularity idea in its deﬁnition page rank, which is a measure of popularity of a page. This approach of ranking of pages gave results so much better than previously used ranking techniques, that google.com became a widely used search engine, in a rather short period of time.

There is another, somewhat similar, approach, derived interestingly from a theory of social networking developed by sociologists in the 1950s. In the social networking context, the goal was to deﬁne the prestige of people. For example, the president of the United States has high prestige since a large number of people know him. If someone is known by multiple prestigious people, then she also has high prestige, even if she is not known by as large a number of people.

The above idea was developed into a notion of hubs and authorities that takes into account the presence of directories that link to pages containing useful information.

A hub is a page that stores links to many pages; it does not in itself contain actual information on a topic, but points to pages that contain actual information. In contrast, an authority is a page that contains actual information on a topic, although it may not be directly pointed to by many pages. Each page then gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige). The deﬁnitions of prestige, as before, are cyclic and are deﬁned by a set of simultaneous linear equations. A page gets higher hub-prestige if it points to many pages with high authority-prestige, while a page gets higher authority-prestige if it is pointed to by many pages with high hub-prestige. Given a query, pages with highest authority-prestige are ranked higher than other pages. See the bibliographical notes for references giving further details.

Similarity-Based Retrieval

Certain information-retrieval systems permit similarity-based retrieval. Here, the user can give the system document A, and ask the system to retrieve documents that are “similar” to A. The similarity of a document to another may be deﬁned, for example, on the basis of common terms. One approach is to ﬁnd k terms in A with highest values of r(d, t), and to use these k terms as a query to ﬁnd relevance of other documents. The terms in the query are themselves weighted by r(d, t).

If the set of documents similar to A is large, the system may present the user a few of the similar documents, allow him to choose the most relevant few, and start a new search based on similarity to A and to the chosen documents. The resultant set of documents is likely to be what the user intended to ﬁnd.

The same idea is also used to help users who ﬁnd many documents that appear to be relevant on the basis of the keywords, but are not. In such a situation, instead of adding further keywords to the query, users may be allowed to identify one or a few of the returned documents as relevant; the system then uses the identiﬁed documents to ﬁnd other similar ones. The resultant set of documents is likely to be what the user intended to ﬁnd.

Synonyms and Homonyms

Consider the problem of locating documents about motorcycle maintenance for the keywords “motorcycle” and “maintenance.” Suppose that the keywords for each document are the words in the title and the names of the authors. The document titled Motorcycle Repair would not be retrieved, since the word “maintenance” does not occur in its title.

We can solve that problem by making use of synonyms. Each word can have a set of synonyms deﬁned, and the occurrence of a word can be replaced by the or of all its synonyms (including the word itself). Thus, the query “motorcycle and repair” can be replaced by “motorcycle and (repair or maintenance).” This query would ﬁnd the desired document.

Keyword-based queries also suffer from the opposite problem, of homonyms, that is single words with multiple meanings. For instance, the word object has different meanings as a noun and as a verb. The word table may refer to a dinner table, or to a relational table. Some keyword query systems attempt to disambiguate the meaning of words in documents, and when a user poses a query, they ﬁnd out the intended meaning by asking the user. The returned documents are those that use the term in the intended meaning of the user. However, disambiguating meanings of words in documents is not an easy task, so not many systems implement this idea.

In fact, a danger even with using synonyms to extend queries is that the synonyms may themselves have different meanings. Documents that use the synonyms with an alternative intended meaning would be retrieved. The user is then left wondering why the system thought that a particular retrieved document is relevant, if it contains neither the keywords the user speciﬁed, nor words whose intended meaning in the document is synonymous with speciﬁed keywords! It is therefore advisable to verify synonyms with the user, before using them to extend a query submitted by the user.

Indexing of Documents

An effective index structure is important for efﬁcient processing of queries in an information-retrieval system. Documents that contain a speciﬁed keyword can be efﬁciently located by using an inverted index, which maps each keyword Ki to the set Si of (identiﬁers of) the documents that contain Ki. To support relevance ranking based on proximity of keywords, such an index may provide not just identiﬁers of documents, but also a list of locations in the document where the keyword appears. Since such indices must be stored on disk, the index organization also attempts to minimize the number of I/O operations to retrieve the set of (identiﬁers of) documents that contain a keyword. Thus, the system may attempt to keep the set of documents for a keyword in consecutive disk pages.

The and operation ﬁnds documents that contain all of a speciﬁed set of keywords K1, K2,... , Kn. We implement the and operation by ﬁrst retrieving the sets of document identiﬁers S1, S2,... , Sn of all documents that contain the respective keywords.

The intersection, S1 ∩ S2 ∩ ··· ∩ Sn, of the sets gives the document identiﬁers of the desired set of documents. The or operation gives the set of all documents that contain at least one of the keywords K1, K2,... , Kn. We implement the or operation by computing the union, S1 ∪S2 ∪···∪Sn, of the sets. The not operation ﬁnds documents that do not contain a speciﬁed keyword Ki. Given a set of document identiﬁers S, we can eliminate documents that contain the speciﬁed keyword Ki by taking the difference S − Si, where Si is the set of identiﬁers of documents that contain the keyword Ki.

Given a set of keywords in a query, many information retrieval systems do not insist that the retrieved documents contain all the keywords (unless an and operation is explicitly used). In this case, all documents containing at least one of the words are retrieved (as in the or operation), but are ranked by their relevance measure.

To use term frequency for ranking, the index structure should additionally maintain the number of times terms occur in each document. To reduce this effort, they may use a compressed representation with only a few bits, which approximates the term frequency. The index should also store the document frequency of each term (that is, the number of documents in which the term appears).

Measuring Retrieval Effectiveness

Each keyword may be contained in a large number of documents; hence, a compact representation is critical to keep space usage of the index low. Thus, the sets of documents for a keyword are maintained in a compressed form. So that storage space is saved, the index is sometimes stored such that the retrieval is approximate; a few relevant documents may not be retrieved (called a false drop or false negative), or a few irrelevant documents may be retrieved (called a false positive). A good index structure will not have any false drops, but may permit a few false positives; the sys- tem can ﬁlter them away later by looking at the keywords that they actually contain. In Web indexing, false positives are not desirable either, since the actual document may not be quickly accessible for ﬁltering.

Two metrics are used to measure how well an information-retrieval system is able to answer queries. The ﬁrst, precision, measures what percentage of the retrieved documents are actually relevant to the query. The second, recall, measures what percentage of the documents relevant to the query were retrieved. Ideally both should be 100 percent.

Precision and recall are also important measures for understanding how well a particular document ranking strategy performs. Ranking strategies can result in false negatives and false positives, but in a more subtle sense.

• False negatives may occur when documents are ranked, because relevant documents get low rankings; if we fetched all documents down to documents with very low ranking there would be very few false negatives. However, humans would rarely look beyond the ﬁrst few tens of returned documents, and may thus miss relevant documents because they are not ranked among the top few. Exactly what is a false negative depends on how many documents are examined.

Therefore instead of having a single number as the measure of recall, we can measure the recall as a function of the number of documents fetched.

• False positives may occur because irrelevant documents get higher rankings than relevant documents. This too depends on how many documents are examined. One option is to measure precision as a function of number of documents fetched.

A better and more intuitive alternative for measuring precision is to measure it as a function of recall. With this combined measure, both precision and recall can be computed as a function of number of documents, if required.

For instance, we can say that with a recall of 50 percent the precision was 75 per- cent, whereas at a recall of 75 percent the precision dropped to 60 percent. In general, we can draw a graph relating precision to recall. These measures can be computed for individual queries, then averaged out across a suite of queries in a query benchmark.

Yet another problem with measuring precision and recall lies in how to deﬁne which documents are really relevant and which are not. In fact, it requires under- standing of natural language, and understanding of the intent of the query, to decide if a document is relevant or not. Researchers therefore have created collections of documents and queries, and have manually tagged documents as relevant or irrelevant to the queries. Different ranking systems can be run on these collections to measure their average precision and recall across multiple queries.

Web Search Engines

Web crawlers are programs that locate and gather information on the Web. They recursively follow hyperlinks present in known documents to ﬁnd other documents. A crawler retrieves the documents and adds information found in the documents to a combined index; the document is generally not stored, although some search engines do cache a copy of the document to give clients faster access to the documents.

Since the number of documents on the Web is very large, it is not possible to crawl the whole Web in a short period of time; and in fact, all search engines cover only some portions of the Web, not all of it, and their crawlers may take weeks or months to perform a single crawl of all the pages they cover. There are usually many pro- cesses, running on multiple machines, involved in crawling. A database stores a set of links (or sites) to be crawled; it assigns links from this set to each crawler process. New links found during a crawl are added to the database, and may be crawled later if they are not crawled immediately. Pages found during a crawl are also handed over to an indexing system, which may be running on a different machine. Pages have to be refetched (that is, links recrawled) periodically to obtain updated information, and to discard sites that no longer exist, so that the information in the search index is kept reasonably up to date.

The indexing system itself runs on multiple machines in parallel. It is not a good idea to add pages to the same index that is being used for queries, since doing so would require concurrency control on the index, and affect query and update performance. Instead, one copy of the index is used to answer queries while another copy is updated with newly crawled pages. At periodic intervals the copies switch over, with the old one being updated while the new copy is being used for queries.

To support very high query rates, the indices may be kept in main memory, and there are multiple machines; the system selectively routes queries to the machines to balance the load among them.

Directories

A typical library user may use a catalog to locate a book for which she is looking. When she retrieves the book from the shelf, however, she is likely to browse through other books that are located nearby. Libraries organize books in such a way that re- lated books are kept close together. Hence, a book that is physically near the desired book may be of interest as well, making it worthwhile for users to browse through such books.

To keep related books close together, libraries use a classiﬁcation hierarchy. Books on science are classiﬁed together. Within this set of books, there is a ﬁner classiﬁcation, with computer-science books organized together, mathematics books organized together, and so on. Since there is a relation between mathematics and computer science, relevant sets of books are stored close to each other physically. At yet another level in the classiﬁcation hierarchy, computer-science books are broken down into subareas, such as operating systems, languages, and algorithms. Figure 22.10 illustrates a classiﬁcation hierarchy that may be used by a library. Because books can be kept at only one place, each book in a library is classiﬁed into exactly one spot in the classiﬁcation hierarchy.

In an information retrieval system, there is no need to store related documents close together. However, such systems need to organize documents logically so as to permit browsing. Thus, such a system could use a classiﬁcation hierarchy similar to

one that libraries use, and, when it displays a particular document, it can also display a brief description of documents that are close in the hierarchy.

In an information retrieval system, there is no need to keep a document in a single spot in the hierarchy. A document that talks of mathematics for computer scientists could be classiﬁed under mathematics as well as under computer science. All that is stored at each spot is an identiﬁer of the document (that is, a pointer to the document), and it is easy to fetch the contents of the document by using the identiﬁer.

As a result of this ﬂexibility, not only can a document be classiﬁed under two locations, but also a subarea in the classiﬁcation hierarchy can itself occur under two areas. The class of “graph algorithm” document can appear both under mathematics and under computer science. Thus, the classiﬁcation hierarchy is now a directed acyclic graph (DAG), as shown in Figure 22.11. A graph-algorithm document may appear in a single location in the DAG, but can be reached via multiple paths.

A directory is simply a classiﬁcation DAG structure. Each leaf of the directory stores links to documents on the topic represented by the leaf. Internal nodes may also contain links, for example to documents that cannot be classiﬁed under any of the child nodes.

To ﬁnd information on a topic, a user would start at the root of the directory and follow paths down the DAG until reaching a node representing the desired topic.

While browsing down the directory, the user can ﬁnd not only documents on the topic he is interested in, but also ﬁnd related documents and related classes in the classiﬁcation hierarchy. The user may learn new information by browsing through documents (or subclasses) within the related classes.

Organizing the enormous amount of information available on the Web into a directory structure is a daunting task.

• The ﬁrst problem is determining what exactly the directory hierarchy should be.

• The second problem is, given a document, deciding which nodes of the directory are categories relevant to the document.

To tackle the ﬁrst problem, portals such as Yahoo have teams of “internet librarians” who come up with the classiﬁcation hierarchy and continually reﬁne it. The Open Directory Project is a large collaborative effort, with different volunteers being responsible for organizing different branches of the directory.

The second problem can also be tackled manually by librarians, or Web site maintainers may be responsible for deciding where their sites should lie in the hierarchy.

There are also techniques for automatically deciding the location of documents based on computing their similarity to documents that have already been classiﬁed.

Search This Blog

Database Management System course