Application Development and Administration:Data Mining
Data Mining
The term data mining refers loosely to the process of semiautomatically analyzing large databases to find useful patterns. Like knowledge discovery in artificial intelligence (also called machine learning), or statistical analysis, data mining attempts to discover rules and patterns from data. However, data mining differs from machine learning and statistics in that it deals with large volumes of data, stored primarily on disk. That is, data mining deals with “knowledge discovery in databases.”
Some types of knowledge discovered from a database can be represented by a set of rules. The following is an example of a rule, stated informally: “Young women with annual incomes greater than $50,000 are the most likely people to buy small sports cars.” Of course such rules are not universally true, and have degrees of “sup- port” and “confidence,” as we shall see. Other types of knowledge are represented by equations relating different variables to each other, or by other mechanisms for predicting outcomes when the values of some variables are known.
There are a variety of possible types of patterns that may be useful, and different techniques are used to find different types of patterns. We shall study a few examples of patterns and see how they may be automatically derived from a database.
Usually there is a manual component to data mining, consisting of preprocessing data to a form acceptable to the algorithms, and postprocessing of discovered pat- terns to find novel ones that could be useful. There may also be more than one type of pattern that can be discovered from a given database, and manual interaction may be needed to pick useful types of patterns. For this reason, data mining is really a semiautomatic process in real life. However, in our description we concentrate on the automatic aspect of mining.
Applications of Data Mining
The discovered knowledge has numerous applications. The most widely used applications are those that require some sort of prediction. For instance, when a person applies for a credit card, the credit-card company wants to predict if the person is a good credit risk. The prediction is to be based on known attributes of the person, such as age, income, debts, and past debt repayment history. Rules for making the prediction are derived from the same attributes of past and current credit card holders, along with their observed behavior, such as whether they defaulted on their credit- card dues. Other types of prediction include predicting which customers may switch over to a competitor (these customers may be offered special discounts to tempt them not to switch), predicting which people are likely to respond to promotional mail (“junk mail”), or predicting what types of phone calling card usage are likely to be fraudulent.
Another class of applications looks for associations, for instance, books that tend to be bought together. If a customer buys a book, an online bookstore may suggest other associated books. If a person buys a camera, the system may suggest accessories that tend to be bought along with cameras. A good salesperson is aware of such pat- terns and exploits them to make additional sales. The challenge is to automate the process. Other types of associations may lead to discovery of causation. For instance, discovery of unexpected associations between a newly introduced medicine and cardiac problems led to the finding that the medicine may cause cardiac problems in some people. The medicine was then withdrawn from the market.
Associations are an example of descriptive patterns. Clusters are another example of such patterns. For example, over a century ago a cluster of typhoid cases was found around a well, which led to the discovery that the water in the well was contaminated and was spreading typhoid. Detection of clusters of disease remains important even today.
Classification
As mentioned in Section 22.3.1, prediction is one of the most important types of data mining. We outline what is classification, study techniques for building one type of classifiers, called decision tree classifiers, and then study other prediction techniques.
Abstractly, the classification problem is this: Given that items belong to one of several classes, and given past instances (called training instances) of items along with the classes to which they belong, the problem is to predict the class to which a new item belongs. The class of the new instance is not known, so other attributes of the instance must be used to predict the class.
Classification can be done by finding rules that partition the given data into disjoint groups. For instance, suppose that a credit-card company wants to decide whether or not to give a credit card to an applicant. The company has a variety of information about the person, such as her age, educational background, annual in- come, and current debts, that it can use for making a decision.
Some of this information could be relevant to the credit worthiness of the appli- cant, whereas some may not be. To make the decision, the company assigns a credit- worthiness level of excellent, good, average, or bad to each of a sample set of cur-
rent customers according to each customer’s payment history. Then, the company attempts to find rules that classify its current customers into excellent, good, aver- age, or bad, on the basis of the information about the person, other than the actual payment history (which is unavailable for new customers). Let us consider just two attributes: education level (highest degree earned) and income. The rules may be of the following form:
Similar rules would also be present for the other credit worthiness levels (average and bad).
The process of building a classifier starts from a sample of data, called a training set. For each tuple in the training set, the class to which the tuple belongs is already known. For instance, the training set for a credit-card application may be the existing customers, with their credit worthiness determined from their payment history. The actual data, or population, may consist of all people, including those who are not existing customers. There are several ways of building a classifier, as we shall see.
Decision Tree Classifiers
The decision tree classifier is a widely used technique for classification. As the name suggests, decision tree classifiers use a tree; each leaf node has an associated class, and each internal node has a predicate (or more generally, a function) associated with it. Figure 22.6 shows an example of a decision tree.
To classify a new instance, we start at the root, and traverse the tree to reach a leaf; at an internal node we evaluate the predicate (or function) on the data instance,
to find which child to go to. The process continues till we reach a leaf node. For example, if the degree level of a person is masters, and the persons income is 40K, starting from the root we follow the edge labeled “masters,” and from there the edge labeled “25K to 75K,” to reach a leaf. The class at the leaf is “good,” so we predict that the credit risk of that person is good.
Building Decision Tree Classifiers
The question then is how to build a decision tree classifier, given a set of training instances. The most common way of doing so is to use a greedy algorithm, which works recursively, starting at the root and building the tree downward. Initially there is only one node, the root, and all training instances are associated with that node.
At each node, if all, or “almost all” training instances associated with the node be- long to the same class, then the node becomes a leaf node associated with that class. Otherwise, a partitioning attribute and partitioning conditions must be selected to create child nodes. The data associated with each child node is the set of training instances that satisfy the partitioning condition for that child node. In our example, the attribute degree is chosen, and four children, one for each value of degree, are created. The conditions for the four children nodes are degree = none, degree = bachelors, degree = masters, and degree = doctorate, respectively. The data associated with each child consist of training instances satisfying the condition associated with that child. At the node corresponding to masters, the attribute income is chosen, with the range of values partitioned into intervals 0 to 25,000, 25,000 to 50,000, 50,000 to 75,000, and over 75,000. The data associated with each node consist of training instances with the degree attribute being masters, and the income attribute being in each of these ranges respectively. As an optimization, since the class for the range 25,000 to 50,000 and the range 50,000 to 75,000 is the same under the node degree = masters, the two ranges have been merged into a single range 25,000 to 75,000.
Best Splits
Intuitively, by choosing a sequence of partitioning attributes, we start with the set of all training instances, which is “impure” in the sense that it contains instances from many classes, and end up with leaves which are “pure” in the sense that at each leaf all training instances belong to only one class. We shall see shortly how to measure purity quantitatively. To judge the benefit of picking a particular attribute and condition for partitioning of the data at a node, we measure the purity of the data at the children resulting from partitioning by that attribute. The attribute and condition that result in the maximum purity are chosen.
The purity of a set S of training instances can be measured quantitatively in several ways. Suppose there are k classes, and of the instances in S the fraction of instances in class i is pi. One measure of purity, the Gini measure is defined as
That is, the purity is the weighted average of the purity of the sets Si. The above formula can be used with both the Gini measure and the entropy measure of purity.
The information gain due to a particular split of S into Si,i = 1, 2,... ,r is then
Information-gain(S, {S1, S2,... , Sr }) = purity(S) − purity(S1, S2,... , Sr )
Splits into fewer sets are preferable to splits into many sets, since they lead to simpler and more meaningful decision trees. The number of elements in each of the sets Si may also be taken into account; otherwise, whether a set Si has 0 elements or 1 element would make a big difference in the number of sets, although the split is the same for almost all the elements. The information content of a particular split can be
Finding Best Splits
How do we find the best split for an attribute? How to split an attribute depends on the type of the attribute. Attributes can be either continuous valued, that is, the values can be ordered in a fashion meaningful to classification, such as age or income, or can be categorical, that is, they have no meaningful order, such as department names or country names. We do not expect the sort order of department names or country names to have any significance to classification.
Usually attributes that are numbers (integers/reals) are treated as continuous valued while character string attributes are treated as categorical, but this may be con- trolled by the user of the system. In our example, we have treated the attribute degree as categorical, and the attribute income as continuous valued.
We first consider how to find best splits for continuous-valued attributes. For simplicity, we shall only consider binary splits of continuous-valued attributes, that is, splits that result in two children. The case of multiway splits is more complicated; see the bibliographical notes for references on the subject.
To find the best binary split of a continuous-valued attribute, we first sort the at- tribute values in the training instances. We then compute the information gain obtained by splitting at each value. For example, if the training instances have values 1, 10, 15, and 25 for an attribute, the split points considered are 1, 10, and 15; in each case values less than or equal to the split point form one partition and the rest of the values form the other partition. The best binary split for the attribute is the split that gives the maximum information gain.
For a categorical attribute, we can have a multiway split, with a child for each value of the attribute. This works fine for categorical attributes with only a few distinct values, such as degree or gender. However, if the attribute has many distinct values, such as department names in a large company, creating a child for each value is not a good idea. In such cases, we would try to combine multiple values into each child, to create a smaller number of children. See the bibliographical notes for references on how to do so.
Decision-Tree Construction Algorithm
The main idea of decision tree construction is to evaluate different attributes and different partitioning conditions, and pick the attribute and partitioning condition that results in the maximum information gain ratio. The same procedure works recur-
sively on each of the sets resulting from the split, thereby recursively constructing a decision tree. If the data can be perfectly classified, the recursion stops when the purity of a set is 0. However, often data are noisy, or a set may be so small that par- titioning it further may not be justified statistically. In this case, the recursion stops when the purity of a set is “sufficiently high,” and the class of resulting leaf is defined as the class of the majority of the elements of the set. In general, different branches of the tree could grow to different levels.
Figure 22.7 shows pseudocode for a recursive tree construction procedure, which takes a set of training instances S as parameter. The recursion stops when the set is sufficiently pure or the set S is too small for further partitioning to be statistically significant. The parameters δp and δs define cutoffs for purity and size; the system may give them default values, that may be overridden by users.
There are a wide variety of decision tree construction algorithms, and we outline the distinguishing features of a few of them. See the bibliographical notes for details.
With very large data sets, partitioning may be expensive, since it involves repeated copying. Several algorithms have therefore been developed to minimize the I/O and computation cost when the training data are larger than available memory.
Several of the algorithms also prune subtrees of the generated decision tree to reduce overfitting: A subtree is overfitted if it has been so highly tuned to the specifics of the training data that it makes many classification errors on other data. A subtree is pruned by replacing it with a leaf node. There are different pruning heuristics; one heuristic uses part of the training data to build the tree and another part of the training data to test it. The heuristic prunes a subtree if it finds that misclassification on the test instances would be reduced if the subtree were replaced by a leaf node.
We can generate classification rules from a decision tree, if we so desire. For each leaf we generate a rule as follows: The left-hand side is the conjunction of all the split conditions on the path to the leaf, and the class is the class of the majority of the training instances at the leaf. An example of such a classification rule is
degree = masters and income > 75, 000 ⇒ excellent
Other Types of Classifiers
There are several types of classifiers other than decision tree classifiers. Two types that have been quite useful are neural net classifiers and Bayesian classifiers. Neural net classifiers use the training data to train artificial neural nets. There is a large body of literature on neural nets, and we do not consider them further here.
Bayesian classifiers find the distribution of attribute values for each class in the training data; when given a new instance d, they use the distribution information to estimate, for each class cj , the probability that instance d belongs to class cj , denoted by p(cj |d), in a manner outlined here. The class with maximum probability becomes the predicted class for instance d.
To find the probability p(cj |d) of instance d being in class cj , Bayesian classifiers use Bayes’ theorem, which says
where p(d|cj ) is the probability of generating instance d given class cj , p(cj ) is the probability of occurrence of class cj , and p(d) is the probability of instance d occur- ring. Of these, p(d) can be ignored since it is the same for all classes. p(cj ) is simply the fraction of training instances that belong to class cj .
Finding p(d|cj ) exactly is difficult, since it requires a complete distribution of instances of cj . To simplify the task, naive Bayesian classifiers assume attributes have independent distributions, and thereby estimate
That is, the probability of the instance d occurring is the product of the probability of occurrence of each of the attribute values di of d, given the class is cj .
The probabilities p(di|cj ) derive from the distribution of values for each attribute i, for each class class cj . This distribution is computed from the training instances that belong to each class cj ; the distribution is usually approximated by a histogram. For instance, we may divide the range of values of attribute i into equal intervals, and store the fraction of instances of class cj that fall in each interval. Given a value di for attribute i, the value of p(di|cj ) is simply the fraction of instances belonging to class cj that fall in the interval to which di belongs.
A significant benefit of Bayesian classifiers is that they can classify instances with unknown and null attribute values — unknown or null attributes are just omitted from the probability computation. In contrast, decision tree classifiers cannot meaningfully handle situations where an instance to be classified has a null value for a partitioning attribute used to traverse further down the decision tree.
Regression
Regression deals with the prediction of a value, rather than a class. Given values for a set of variables, X1, X2,... , Xn, we wish to predict the value of a variable Y . For instance, we could treat the level of education as a number and income as another number, and, on the basis of these two variables, we wish to predict the likelihood of default, which could be a percentage chance of defaulting, or the amount involved in the default.
One way is to infer coefficients a0, a1, a1,... , an such that
Y = a0 + a1 ∗ X1 + a2 ∗ X2 + ··· + an ∗ Xn
Finding such a linear polynomial is called linear regression. In general, we wish to find a curve (defined by a polynomial or other formula) that fits the data; the process is also called curve fitting.
The fit may only be approximate, because of noise in the data or because the relationship is not exactly a polynomial, so regression aims to find coefficients that give the best possible fit. There are standard techniques in statistics for finding regression coefficients. We do not discuss these techniques here, but the bibliographical notes provide references.
Association Rules
Retail shops are often interested in associations between different items that people buy. Examples of such associations are:
• Someone who buys bread is quite likely also to buy milk
• A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts.
Association information can be used in several ways. When a customer buys a particular book, an online shop may suggest associated books. A grocery shop may decide to place bread close to milk, since they are often bought together, to help shoppers fin- ish their task faster. Or the shop may place them at opposite ends of a row, and place other associated items in between to tempt people to buy those items as well, as the shoppers walk from one end of the row to the other. A shop that offers discounts on one associated item may not offer a discount on the other, since the customer will probably buy the other anyway.
Association Rules
An example of an association rule is
bread ⇒ milk
In the context of grocery-store purchases, the rule says that customers who buy bread also tend to buy milk with a high probability. An association rule must have an associated population: the population consists of a set of instances. In the grocery-store example, the population may consist of all grocery store purchases; each purchase is an instance. In the case of a bookstore, the population may consist of all people who made purchases, regardless of when they made a purchase. Each customer is an in- stance. Here, the analyst has decided that when a purchase is made is not significant, whereas for the grocery-store example, the analyst may have decided to concentrate on single purchases, ignoring multiple visits by the same customer.
Rules have an associated support, as well as an associated confidence. These are defined in the context of the population:
• Support is a measure of what fraction of the population satisfies both the an- tecedent and the consequent of the rule.
For instance, suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule
milk ⇒ screwdrivers
is low. The rule may not even be statistically significant—perhaps there was only a single purchase that included both milk and screwdrivers. Businesses are usually not interested in rules that have low support, since they involve few customers, and are not worth bothering about.
On the other hand, if 50 percent of all purchases involve milk and bread, then support for rules involving bread and milk (and no other item) is relatively high, and such rules may be worth attention. Exactly what minimum degree of support is considered desirable depends on the application.
• Confidence is a measure of how often the consequent is true when the antecedent is true. For instance, the rule
bread ⇒ milk
has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk. A rule with a low confidence is not meaningful. In business applications, rules usually have confidences significantly less than 100 percent, whereas in other domains, such as in physics, rules may have high confidences.
Note that the confidence of bread ⇒ milk may be very different from the confidence of milk ⇒ bread, although both have the same support.
Finding Association Rules
To discover association rules of the form
i1, i2,... , in ⇒ i0
we first find sets of items with sufficient support, called large itemsets. In our example we find sets of items that are included in a sufficiently large number of instances. We will shortly see how to compute large itemsets.
For each large itemset, we then output all rules with sufficient confidence that involve all and only the elements of the set. For each large itemset S, we output a
rule S − s ⇒ s for every subset s ⊂ S, provided S − s ⇒ s has sufficient confidence; the confidence of the rule is given by support of s divided by support of S.
We now consider how to generate all large itemsets. If the number of possible sets of items is small, a single pass over the data suffices to detect the level of support for all the sets. A count, initialized to 0, is maintained for each set of items. When a purchase record is fetched, the count is incremented for each set of items such that all items in the set are contained in the purchase. For instance, if a purchase included items a, b, and c, counts would be incremented for {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, and {a, b, c}. Those sets with a sufficiently high count at the end of the pass correspond to items that have a high degree of association.
The number of sets grows exponentially, making the procedure just described infeasible if the number of items is large. Luckily, almost all the sets would normally have very low support; optimizations have been developed to eliminate most such sets from consideration. These techniques use multiple passes on the database, considering only some sets in each pass.
In the a priori technique for generating large itemsets, only sets with single items are considered in the first pass. In the second pass, sets with two items are considered, and so on.
At the end of a pass all sets with sufficient support are output as large itemsets.
Sets found to have too little support at the end of a pass are eliminated. Once a set is eliminated, none of its supersets needs to be considered. In other words, in pass i we need to count only supports for sets of size i such that all subsets of the set have been found to have sufficiently high support; it suffices to test all subsets of size i − 1 to ensure this property. At the end of some pass i, we would find that no set of size i has sufficient support, so we do not need to consider any set of size i + 1. Computation then terminates.
Other Types of Associations
Using plain association rules has several shortcomings. One of the major shortcomings is that many associations are not very interesting, since they can be predicted. For instance, if many people buy cereal and many people buy bread, we can predict that a fairly large number of people would buy both, even if there is no connection be- tween the two purchases. What would be interesting is a deviation from the expected co-occurrence of the two. In statistical terms, we look for correlations between items; correlations can be positive, in that the co-occurrence is higher than would have been expected, or negative, in that the items co-occur less frequently than predicted. See a standard textbook on statistics for more information about correlations.
Another important class of data-mining applications is sequence associations (or correlations). Time-series data, such as stock prices on a sequence of days, form an example of sequence data. Stock-market analysts want to find associations among stock-market price sequences. An example of such a association is the following rule: “Whenever bond rates go up, the stock prices go down within 2 days.” Discovering such association between sequences can help us to make intelligent investment decisions. See the bibliographical notes for references to research on this topic.
Deviations from temporal patterns are often interesting. For instance, if a company has been growing at a steady rate each year, a deviation from the usual growth rate is surprising. If sales of winter clothes go down in summer, it is not surprising, since we can predict it from past years; a deviation that we could not have predicted from past experience would be considered interesting. Mining techniques can find deviations from what one would have expected on the basis of past temporal/sequential patterns. See the bibliographical notes for references to research on this topic.
Clustering
Intuitively, clustering refers to the problem of finding clusters of points in the given data. The problem of clustering can be formalized from distance metrics in several ways. One way is to phrase it as the problem of grouping points into k sets (for a given k) so that the average distance of points from the centroid of their assigned cluster is minimized.5 Another way is to group points so that the average distance between every pair of points in each cluster is minimized. There are other definitions too; see the bibliographical notes for details. But the intuition behind all these definitions is to group similar points together in a single set.
Another type of clustering appears in classification systems in biology. (Such classification systems do not attempt to predict classes, rather they attempt to cluster related items together.) For instance, leopards and humans are clustered under the class mammalia, while crocodiles and snakes are clustered under reptilia. Both mammalia and reptilia come under the common class chordata. The clustering of mammalia has further subclusters, such as carnivora and primates. We thus have hierarchical clus- tering. Given characteristics of different species, biologists have created a complex hierarchical clustering scheme grouping related species together at different levels of the hierarchy.
Hierarchical clustering is also useful in other domains—for clustering documents, for example. Internet directory systems (such as Yahoo’s) cluster related documents in a hierarchical fashion (see Section 22.5.5). Hierarchical clustering algorithms can
be classified as agglomerative clustering algorithms, which start by building small clusters and then creater higher levels, or divisive clustering algorithms, which first create higher levels of the hierarchical clustering, then refine each resulting cluster into lower level clusters.
The statistics community has studied clustering extensively. Database research has provided scalable clustering algorithms that can cluster very large data sets (that may not fit in memory). The Birch clustering algorithm is one such algorithm. Intuitively, data points are inserted into a multidimensional tree structure (based on R-trees, de- scribed in Section 23.3.5.3), and guided to appropriate leaf nodes based on nearness to representative points in the internal nodes of the tree. Nearby points are thus clus- tered together in leaf nodes, and summarized if there are more points than fit in memory. Some postprocessing after insertion of all points gives the desired overall clustering. See the bibliographical notes for references to the Birch algorithm, and other techniques for clustering, including algorithms for hierarchical clustering.
An interesting application of clustering is to predict what new movies (or books,
or music) a person is likely to be interested in, on the basis of:
1. The person’s past preferences in movies
2. Other people with similar past preferences
3. The preferences of such people for new movies
One approach to this problem is as follows. To find people with similar past prefer- ences we create clusters of people based on their preferences for movies. The accuracy of clustering can be improved by previously clustering movies by their similarity, so even if people have not seen the same movies, if they have seen similar movies they would be clustered together. We can repeat the clustering, alternately clustering people, then movies, then people, and so on till we reache an equilibrium. Given a new user, we find a cluster of users most similar to that user, on the basis of the user’s preferences for movies already seen. We then predict movies in movie clusters that are popular with that user’s cluster as likely to be interesting to the new user. In fact, this problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest.
Other Types of Mining
Text mining applies data mining techniques to textual documents. For instance, there are tools that form clusters on pages that a user has visited; this helps users when they browse the history of their browsing to find pages they have visited earlier. The distance between pages can be based, for instance, on common words in the pages (see Section 22.5.1.3). Another application is to classify pages into a Web directory automatically, according to their similarity with other pages (see Section 22.5.5).
Data-visualization systems help users to examine large volumes of data, and to detect patterns visually. Visual displays of data—such as maps, charts, and other graphical representations—allow data to be presented compactly to users. A single graphical screen can encode as much information as a far larger number of text screens. For example, if the user wants to find out whether production problems at plants are correlated to the locations of the plants, the problem locations can be en- coded in a special color—say, red—on a map. The user can then quickly discover locations where problems are occurring. The user may then form hypotheses about why problems are occurring in those locations, and may verify the hypotheses quantitatively against the database.
As another example, information about values can be encoded as a color, and can be displayed with as little as one pixel of screen area. To detect associations between pairs of items, we can use a two-dimensional pixel matrix, with each row and each column representing an item. The percentage of transactions that buy both items can be encoded by the color intensity of the pixel. Items with high association will show up as bright pixels in the screen—easy to detect against the darker background.
Data visualization systems do not automatically detect patterns, but provide system support for users to detect patterns. Since humans are very good at detecting visual patterns, data visualization is an important component of data mining.
Comments
Post a Comment