Application Development and Administration:Data Mining

Data Mining

The term data mining refers loosely to the process of semiautomatically analyzing large databases to ﬁnd useful patterns. Like knowledge discovery in artiﬁcial intelligence (also called machine learning), or statistical analysis, data mining attempts to discover rules and patterns from data. However, data mining differs from machine learning and statistics in that it deals with large volumes of data, stored primarily on disk. That is, data mining deals with “knowledge discovery in databases.”

Some types of knowledge discovered from a database can be represented by a set of rules. The following is an example of a rule, stated informally: “Young women with annual incomes greater than $50,000 are the most likely people to buy small sports cars.” Of course such rules are not universally true, and have degrees of “sup- port” and “conﬁdence,” as we shall see. Other types of knowledge are represented by equations relating different variables to each other, or by other mechanisms for predicting outcomes when the values of some variables are known.

There are a variety of possible types of patterns that may be useful, and different techniques are used to ﬁnd different types of patterns. We shall study a few examples of patterns and see how they may be automatically derived from a database.

Usually there is a manual component to data mining, consisting of preprocessing data to a form acceptable to the algorithms, and postprocessing of discovered pat- terns to ﬁnd novel ones that could be useful. There may also be more than one type of pattern that can be discovered from a given database, and manual interaction may be needed to pick useful types of patterns. For this reason, data mining is really a semiautomatic process in real life. However, in our description we concentrate on the automatic aspect of mining.

Applications of Data Mining

The discovered knowledge has numerous applications. The most widely used applications are those that require some sort of prediction. For instance, when a person applies for a credit card, the credit-card company wants to predict if the person is a good credit risk. The prediction is to be based on known attributes of the person, such as age, income, debts, and past debt repayment history. Rules for making the prediction are derived from the same attributes of past and current credit card holders, along with their observed behavior, such as whether they defaulted on their credit- card dues. Other types of prediction include predicting which customers may switch over to a competitor (these customers may be offered special discounts to tempt them not to switch), predicting which people are likely to respond to promotional mail (“junk mail”), or predicting what types of phone calling card usage are likely to be fraudulent.

Another class of applications looks for associations, for instance, books that tend to be bought together. If a customer buys a book, an online bookstore may suggest other associated books. If a person buys a camera, the system may suggest accessories that tend to be bought along with cameras. A good salesperson is aware of such pat- terns and exploits them to make additional sales. The challenge is to automate the process. Other types of associations may lead to discovery of causation. For instance, discovery of unexpected associations between a newly introduced medicine and cardiac problems led to the ﬁnding that the medicine may cause cardiac problems in some people. The medicine was then withdrawn from the market.

Associations are an example of descriptive patterns. Clusters are another example of such patterns. For example, over a century ago a cluster of typhoid cases was found around a well, which led to the discovery that the water in the well was contaminated and was spreading typhoid. Detection of clusters of disease remains important even today.

Classiﬁcation

As mentioned in Section 22.3.1, prediction is one of the most important types of data mining. We outline what is classiﬁcation, study techniques for building one type of classiﬁers, called decision tree classiﬁers, and then study other prediction techniques.

Abstractly, the classiﬁcation problem is this: Given that items belong to one of several classes, and given past instances (called training instances) of items along with the classes to which they belong, the problem is to predict the class to which a new item belongs. The class of the new instance is not known, so other attributes of the instance must be used to predict the class.

Classiﬁcation can be done by ﬁnding rules that partition the given data into disjoint groups. For instance, suppose that a credit-card company wants to decide whether or not to give a credit card to an applicant. The company has a variety of information about the person, such as her age, educational background, annual in- come, and current debts, that it can use for making a decision.

Some of this information could be relevant to the credit worthiness of the appli- cant, whereas some may not be. To make the decision, the company assigns a credit- worthiness level of excellent, good, average, or bad to each of a sample set of cur-

rent customers according to each customer’s payment history. Then, the company attempts to ﬁnd rules that classify its current customers into excellent, good, aver- age, or bad, on the basis of the information about the person, other than the actual payment history (which is unavailable for new customers). Let us consider just two attributes: education level (highest degree earned) and income. The rules may be of the following form:

Similar rules would also be present for the other credit worthiness levels (average and bad).

The process of building a classiﬁer starts from a sample of data, called a training set. For each tuple in the training set, the class to which the tuple belongs is already known. For instance, the training set for a credit-card application may be the existing customers, with their credit worthiness determined from their payment history. The actual data, or population, may consist of all people, including those who are not existing customers. There are several ways of building a classiﬁer, as we shall see.

Decision Tree Classiﬁers

The decision tree classiﬁer is a widely used technique for classiﬁcation. As the name suggests, decision tree classiﬁers use a tree; each leaf node has an associated class, and each internal node has a predicate (or more generally, a function) associated with it. Figure 22.6 shows an example of a decision tree.

To classify a new instance, we start at the root, and traverse the tree to reach a leaf; at an internal node we evaluate the predicate (or function) on the data instance,

to ﬁnd which child to go to. The process continues till we reach a leaf node. For example, if the degree level of a person is masters, and the persons income is 40K, starting from the root we follow the edge labeled “masters,” and from there the edge labeled “25K to 75K,” to reach a leaf. The class at the leaf is “good,” so we predict that the credit risk of that person is good.

Building Decision Tree Classiﬁers

The question then is how to build a decision tree classiﬁer, given a set of training instances. The most common way of doing so is to use a greedy algorithm, which works recursively, starting at the root and building the tree downward. Initially there is only one node, the root, and all training instances are associated with that node.

At each node, if all, or “almost all” training instances associated with the node be- long to the same class, then the node becomes a leaf node associated with that class. Otherwise, a partitioning attribute and partitioning conditions must be selected to create child nodes. The data associated with each child node is the set of training instances that satisfy the partitioning condition for that child node. In our example, the attribute degree is chosen, and four children, one for each value of degree, are created. The conditions for the four children nodes are degree = none, degree = bachelors, degree = masters, and degree = doctorate, respectively. The data associated with each child consist of training instances satisfying the condition associated with that child. At the node corresponding to masters, the attribute income is chosen, with the range of values partitioned into intervals 0 to 25,000, 25,000 to 50,000, 50,000 to 75,000, and over 75,000. The data associated with each node consist of training instances with the degree attribute being masters, and the income attribute being in each of these ranges respectively. As an optimization, since the class for the range 25,000 to 50,000 and the range 50,000 to 75,000 is the same under the node degree = masters, the two ranges have been merged into a single range 25,000 to 75,000.

Best Splits

Intuitively, by choosing a sequence of partitioning attributes, we start with the set of all training instances, which is “impure” in the sense that it contains instances from many classes, and end up with leaves which are “pure” in the sense that at each leaf all training instances belong to only one class. We shall see shortly how to measure purity quantitatively. To judge the beneﬁt of picking a particular attribute and condition for partitioning of the data at a node, we measure the purity of the data at the children resulting from partitioning by that attribute. The attribute and condition that result in the maximum purity are chosen.

The purity of a set S of training instances can be measured quantitatively in several ways. Suppose there are k classes, and of the instances in S the fraction of instances in class i is pi. One measure of purity, the Gini measure is deﬁned as

That is, the purity is the weighted average of the purity of the sets Si. The above formula can be used with both the Gini measure and the entropy measure of purity.

The information gain due to a particular split of S into Si,i = 1, 2,... ,r is then

Information-gain(S, {S1, S2,... , Sr }) = purity(S) − purity(S1, S2,... , Sr )

Splits into fewer sets are preferable to splits into many sets, since they lead to simpler and more meaningful decision trees. The number of elements in each of the sets Si may also be taken into account; otherwise, whether a set Si has 0 elements or 1 element would make a big difference in the number of sets, although the split is the same for almost all the elements. The information content of a particular split can be

Finding Best Splits

How do we ﬁnd the best split for an attribute? How to split an attribute depends on the type of the attribute. Attributes can be either continuous valued, that is, the values can be ordered in a fashion meaningful to classiﬁcation, such as age or income, or can be categorical, that is, they have no meaningful order, such as department names or country names. We do not expect the sort order of department names or country names to have any signiﬁcance to classiﬁcation.

Usually attributes that are numbers (integers/reals) are treated as continuous valued while character string attributes are treated as categorical, but this may be con- trolled by the user of the system. In our example, we have treated the attribute degree as categorical, and the attribute income as continuous valued.

We ﬁrst consider how to ﬁnd best splits for continuous-valued attributes. For simplicity, we shall only consider binary splits of continuous-valued attributes, that is, splits that result in two children. The case of multiway splits is more complicated; see the bibliographical notes for references on the subject.

To ﬁnd the best binary split of a continuous-valued attribute, we ﬁrst sort the at- tribute values in the training instances. We then compute the information gain obtained by splitting at each value. For example, if the training instances have values 1, 10, 15, and 25 for an attribute, the split points considered are 1, 10, and 15; in each case values less than or equal to the split point form one partition and the rest of the values form the other partition. The best binary split for the attribute is the split that gives the maximum information gain.

For a categorical attribute, we can have a multiway split, with a child for each value of the attribute. This works ﬁne for categorical attributes with only a few distinct values, such as degree or gender. However, if the attribute has many distinct values, such as department names in a large company, creating a child for each value is not a good idea. In such cases, we would try to combine multiple values into each child, to create a smaller number of children. See the bibliographical notes for references on how to do so.

Decision-Tree Construction Algorithm

The main idea of decision tree construction is to evaluate different attributes and different partitioning conditions, and pick the attribute and partitioning condition that results in the maximum information gain ratio. The same procedure works recur-

sively on each of the sets resulting from the split, thereby recursively constructing a decision tree. If the data can be perfectly classiﬁed, the recursion stops when the purity of a set is 0. However, often data are noisy, or a set may be so small that par- titioning it further may not be justiﬁed statistically. In this case, the recursion stops when the purity of a set is “sufﬁciently high,” and the class of resulting leaf is deﬁned as the class of the majority of the elements of the set. In general, different branches of the tree could grow to different levels.

Figure 22.7 shows pseudocode for a recursive tree construction procedure, which takes a set of training instances S as parameter. The recursion stops when the set is sufﬁciently pure or the set S is too small for further partitioning to be statistically signiﬁcant. The parameters δp and δs deﬁne cutoffs for purity and size; the system may give them default values, that may be overridden by users.

There are a wide variety of decision tree construction algorithms, and we outline the distinguishing features of a few of them. See the bibliographical notes for details.

With very large data sets, partitioning may be expensive, since it involves repeated copying. Several algorithms have therefore been developed to minimize the I/O and computation cost when the training data are larger than available memory.

Several of the algorithms also prune subtrees of the generated decision tree to reduce overﬁtting: A subtree is overﬁtted if it has been so highly tuned to the speciﬁcs of the training data that it makes many classiﬁcation errors on other data. A subtree is pruned by replacing it with a leaf node. There are different pruning heuristics; one heuristic uses part of the training data to build the tree and another part of the training data to test it. The heuristic prunes a subtree if it ﬁnds that misclassiﬁcation on the test instances would be reduced if the subtree were replaced by a leaf node.

We can generate classiﬁcation rules from a decision tree, if we so desire. For each leaf we generate a rule as follows: The left-hand side is the conjunction of all the split conditions on the path to the leaf, and the class is the class of the majority of the training instances at the leaf. An example of such a classiﬁcation rule is

degree = masters and income > 75, 000 ⇒ excellent

Other Types of Classiﬁers

There are several types of classiﬁers other than decision tree classiﬁers. Two types that have been quite useful are neural net classiﬁers and Bayesian classiﬁers. Neural net classiﬁers use the training data to train artiﬁcial neural nets. There is a large body of literature on neural nets, and we do not consider them further here.

Bayesian classiﬁers ﬁnd the distribution of attribute values for each class in the training data; when given a new instance d, they use the distribution information to estimate, for each class cj , the probability that instance d belongs to class cj , denoted by p(cj |d), in a manner outlined here. The class with maximum probability becomes the predicted class for instance d.

To ﬁnd the probability p(cj |d) of instance d being in class cj , Bayesian classiﬁers use Bayes’ theorem, which says

where p(d|cj ) is the probability of generating instance d given class cj , p(cj ) is the probability of occurrence of class cj , and p(d) is the probability of instance d occur- ring. Of these, p(d) can be ignored since it is the same for all classes. p(cj ) is simply the fraction of training instances that belong to class cj .

Finding p(d|cj ) exactly is difﬁcult, since it requires a complete distribution of instances of cj . To simplify the task, naive Bayesian classiﬁers assume attributes have independent distributions, and thereby estimate

That is, the probability of the instance d occurring is the product of the probability of occurrence of each of the attribute values di of d, given the class is cj .

The probabilities p(di|cj ) derive from the distribution of values for each attribute i, for each class class cj . This distribution is computed from the training instances that belong to each class cj ; the distribution is usually approximated by a histogram. For instance, we may divide the range of values of attribute i into equal intervals, and store the fraction of instances of class cj that fall in each interval. Given a value di for attribute i, the value of p(di|cj ) is simply the fraction of instances belonging to class cj that fall in the interval to which di belongs.

A signiﬁcant beneﬁt of Bayesian classiﬁers is that they can classify instances with unknown and null attribute values — unknown or null attributes are just omitted from the probability computation. In contrast, decision tree classiﬁers cannot meaningfully handle situations where an instance to be classiﬁed has a null value for a partitioning attribute used to traverse further down the decision tree.

Regression

Regression deals with the prediction of a value, rather than a class. Given values for a set of variables, X1, X2,... , Xn, we wish to predict the value of a variable Y . For instance, we could treat the level of education as a number and income as another number, and, on the basis of these two variables, we wish to predict the likelihood of default, which could be a percentage chance of defaulting, or the amount involved in the default.

One way is to infer coefﬁcients a0, a1, a1,... , an such that

Y = a0 + a1 ∗ X1 + a2 ∗ X2 + ··· + an ∗ Xn

Finding such a linear polynomial is called linear regression. In general, we wish to ﬁnd a curve (deﬁned by a polynomial or other formula) that ﬁts the data; the process is also called curve ﬁtting.

The ﬁt may only be approximate, because of noise in the data or because the relationship is not exactly a polynomial, so regression aims to ﬁnd coefﬁcients that give the best possible ﬁt. There are standard techniques in statistics for ﬁnding regression coefﬁcients. We do not discuss these techniques here, but the bibliographical notes provide references.

Association Rules

Retail shops are often interested in associations between different items that people buy. Examples of such associations are:

• Someone who buys bread is quite likely also to buy milk

• A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts.

Association information can be used in several ways. When a customer buys a particular book, an online shop may suggest associated books. A grocery shop may decide to place bread close to milk, since they are often bought together, to help shoppers ﬁn- ish their task faster. Or the shop may place them at opposite ends of a row, and place other associated items in between to tempt people to buy those items as well, as the shoppers walk from one end of the row to the other. A shop that offers discounts on one associated item may not offer a discount on the other, since the customer will probably buy the other anyway.

Association Rules

An example of an association rule is

bread ⇒ milk

In the context of grocery-store purchases, the rule says that customers who buy bread also tend to buy milk with a high probability. An association rule must have an associated population: the population consists of a set of instances. In the grocery-store example, the population may consist of all grocery store purchases; each purchase is an instance. In the case of a bookstore, the population may consist of all people who made purchases, regardless of when they made a purchase. Each customer is an in- stance. Here, the analyst has decided that when a purchase is made is not signiﬁcant, whereas for the grocery-store example, the analyst may have decided to concentrate on single purchases, ignoring multiple visits by the same customer.

Rules have an associated support, as well as an associated conﬁdence. These are deﬁned in the context of the population:

• Support is a measure of what fraction of the population satisﬁes both the an- tecedent and the consequent of the rule.

For instance, suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule

milk ⇒ screwdrivers

is low. The rule may not even be statistically signiﬁcant—perhaps there was only a single purchase that included both milk and screwdrivers. Businesses are usually not interested in rules that have low support, since they involve few customers, and are not worth bothering about.

On the other hand, if 50 percent of all purchases involve milk and bread, then support for rules involving bread and milk (and no other item) is relatively high, and such rules may be worth attention. Exactly what minimum degree of support is considered desirable depends on the application.

• Conﬁdence is a measure of how often the consequent is true when the antecedent is true. For instance, the rule

bread ⇒ milk

has a conﬁdence of 80 percent if 80 percent of the purchases that include bread also include milk. A rule with a low conﬁdence is not meaningful. In business applications, rules usually have conﬁdences signiﬁcantly less than 100 percent, whereas in other domains, such as in physics, rules may have high conﬁdences.

Note that the conﬁdence of bread ⇒ milk may be very different from the conﬁdence of milk ⇒ bread, although both have the same support.

Finding Association Rules

To discover association rules of the form

i1, i2,... , in ⇒ i0

we ﬁrst ﬁnd sets of items with sufﬁcient support, called large itemsets. In our example we ﬁnd sets of items that are included in a sufﬁciently large number of instances. We will shortly see how to compute large itemsets.

For each large itemset, we then output all rules with sufﬁcient conﬁdence that involve all and only the elements of the set. For each large itemset S, we output a

rule S − s ⇒ s for every subset s ⊂ S, provided S − s ⇒ s has sufﬁcient conﬁdence; the conﬁdence of the rule is given by support of s divided by support of S.

We now consider how to generate all large itemsets. If the number of possible sets of items is small, a single pass over the data sufﬁces to detect the level of support for all the sets. A count, initialized to 0, is maintained for each set of items. When a purchase record is fetched, the count is incremented for each set of items such that all items in the set are contained in the purchase. For instance, if a purchase included items a, b, and c, counts would be incremented for {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, and {a, b, c}. Those sets with a sufﬁciently high count at the end of the pass correspond to items that have a high degree of association.

The number of sets grows exponentially, making the procedure just described infeasible if the number of items is large. Luckily, almost all the sets would normally have very low support; optimizations have been developed to eliminate most such sets from consideration. These techniques use multiple passes on the database, considering only some sets in each pass.

In the a priori technique for generating large itemsets, only sets with single items are considered in the ﬁrst pass. In the second pass, sets with two items are considered, and so on.

At the end of a pass all sets with sufﬁcient support are output as large itemsets.

Sets found to have too little support at the end of a pass are eliminated. Once a set is eliminated, none of its supersets needs to be considered. In other words, in pass i we need to count only supports for sets of size i such that all subsets of the set have been found to have sufﬁciently high support; it sufﬁces to test all subsets of size i − 1 to ensure this property. At the end of some pass i, we would ﬁnd that no set of size i has sufﬁcient support, so we do not need to consider any set of size i + 1. Computation then terminates.

Other Types of Associations

Using plain association rules has several shortcomings. One of the major shortcomings is that many associations are not very interesting, since they can be predicted. For instance, if many people buy cereal and many people buy bread, we can predict that a fairly large number of people would buy both, even if there is no connection be- tween the two purchases. What would be interesting is a deviation from the expected co-occurrence of the two. In statistical terms, we look for correlations between items; correlations can be positive, in that the co-occurrence is higher than would have been expected, or negative, in that the items co-occur less frequently than predicted. See a standard textbook on statistics for more information about correlations.

Another important class of data-mining applications is sequence associations (or correlations). Time-series data, such as stock prices on a sequence of days, form an example of sequence data. Stock-market analysts want to ﬁnd associations among stock-market price sequences. An example of such a association is the following rule: “Whenever bond rates go up, the stock prices go down within 2 days.” Discovering such association between sequences can help us to make intelligent investment decisions. See the bibliographical notes for references to research on this topic.

Deviations from temporal patterns are often interesting. For instance, if a company has been growing at a steady rate each year, a deviation from the usual growth rate is surprising. If sales of winter clothes go down in summer, it is not surprising, since we can predict it from past years; a deviation that we could not have predicted from past experience would be considered interesting. Mining techniques can ﬁnd deviations from what one would have expected on the basis of past temporal/sequential patterns. See the bibliographical notes for references to research on this topic.

Clustering

Intuitively, clustering refers to the problem of ﬁnding clusters of points in the given data. The problem of clustering can be formalized from distance metrics in several ways. One way is to phrase it as the problem of grouping points into k sets (for a given k) so that the average distance of points from the centroid of their assigned cluster is minimized.5 Another way is to group points so that the average distance between every pair of points in each cluster is minimized. There are other deﬁnitions too; see the bibliographical notes for details. But the intuition behind all these deﬁnitions is to group similar points together in a single set.

Another type of clustering appears in classiﬁcation systems in biology. (Such classiﬁcation systems do not attempt to predict classes, rather they attempt to cluster related items together.) For instance, leopards and humans are clustered under the class mammalia, while crocodiles and snakes are clustered under reptilia. Both mammalia and reptilia come under the common class chordata. The clustering of mammalia has further subclusters, such as carnivora and primates. We thus have hierarchical clus- tering. Given characteristics of different species, biologists have created a complex hierarchical clustering scheme grouping related species together at different levels of the hierarchy.

Hierarchical clustering is also useful in other domains—for clustering documents, for example. Internet directory systems (such as Yahoo’s) cluster related documents in a hierarchical fashion (see Section 22.5.5). Hierarchical clustering algorithms can

be classiﬁed as agglomerative clustering algorithms, which start by building small clusters and then creater higher levels, or divisive clustering algorithms, which ﬁrst create higher levels of the hierarchical clustering, then reﬁne each resulting cluster into lower level clusters.

The statistics community has studied clustering extensively. Database research has provided scalable clustering algorithms that can cluster very large data sets (that may not ﬁt in memory). The Birch clustering algorithm is one such algorithm. Intuitively, data points are inserted into a multidimensional tree structure (based on R-trees, de- scribed in Section 23.3.5.3), and guided to appropriate leaf nodes based on nearness to representative points in the internal nodes of the tree. Nearby points are thus clus- tered together in leaf nodes, and summarized if there are more points than ﬁt in memory. Some postprocessing after insertion of all points gives the desired overall clustering. See the bibliographical notes for references to the Birch algorithm, and other techniques for clustering, including algorithms for hierarchical clustering.

An interesting application of clustering is to predict what new movies (or books,

or music) a person is likely to be interested in, on the basis of:

1. The person’s past preferences in movies

2. Other people with similar past preferences

3. The preferences of such people for new movies

One approach to this problem is as follows. To ﬁnd people with similar past prefer- ences we create clusters of people based on their preferences for movies. The accuracy of clustering can be improved by previously clustering movies by their similarity, so even if people have not seen the same movies, if they have seen similar movies they would be clustered together. We can repeat the clustering, alternately clustering people, then movies, then people, and so on till we reache an equilibrium. Given a new user, we ﬁnd a cluster of users most similar to that user, on the basis of the user’s preferences for movies already seen. We then predict movies in movie clusters that are popular with that user’s cluster as likely to be interesting to the new user. In fact, this problem is an instance of collaborative ﬁltering, where users collaborate in the task of ﬁltering information to ﬁnd information of interest.

Other Types of Mining

Text mining applies data mining techniques to textual documents. For instance, there are tools that form clusters on pages that a user has visited; this helps users when they browse the history of their browsing to ﬁnd pages they have visited earlier. The distance between pages can be based, for instance, on common words in the pages (see Section 22.5.1.3). Another application is to classify pages into a Web directory automatically, according to their similarity with other pages (see Section 22.5.5).

Data-visualization systems help users to examine large volumes of data, and to detect patterns visually. Visual displays of data—such as maps, charts, and other graphical representations—allow data to be presented compactly to users. A single graphical screen can encode as much information as a far larger number of text screens. For example, if the user wants to ﬁnd out whether production problems at plants are correlated to the locations of the plants, the problem locations can be en- coded in a special color—say, red—on a map. The user can then quickly discover locations where problems are occurring. The user may then form hypotheses about why problems are occurring in those locations, and may verify the hypotheses quantitatively against the database.

As another example, information about values can be encoded as a color, and can be displayed with as little as one pixel of screen area. To detect associations between pairs of items, we can use a two-dimensional pixel matrix, with each row and each column representing an item. The percentage of transactions that buy both items can be encoded by the color intensity of the pixel. Items with high association will show up as bright pixels in the screen—easy to detect against the darker background.

Data visualization systems do not automatically detect patterns, but provide system support for users to detect patterns. Since humans are very good at detecting visual patterns, data visualization is an important component of data mining.

Search This Blog

Database Management System course