Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 727 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
727
Dung lượng
12,98 MB
Nội dung
Contents Introduction 1.1 Machine Perception 1.2 An Example 1.2.1 Related fields 1.3 The Sub-problems of Pattern Classification 1.3.1 Feature Extraction 1.3.2 Noise 1.3.3 Overfitting 1.3.4 Model Selection 1.3.5 Prior Knowledge 1.3.6 Missing Features 1.3.7 Mereology 1.3.8 Segmentation 1.3.9 Context 1.3.10 Invariances 1.3.11 Evidence Pooling 1.3.12 Costs and Risks 1.3.13 Computational Complexity 1.4 Learning and Adaptation 1.4.1 Supervised Learning 1.4.2 Unsupervised Learning 1.4.3 Reinforcement Learning 1.5 Conclusion Summary by Chapters Bibliographical and Historical Remarks Bibliography Index 3 11 11 11 12 12 12 12 13 13 13 14 14 15 15 16 16 16 17 17 17 17 19 19 22 CONTENTS Chapter Introduction with which we recognize a face, understand spoken words, read handwritT hetenease characters, identify our car keys in our pocket by feel, and decide whether an apple is ripe by its smell belies the astoundingly complex processes that underlie these acts of pattern recognition Pattern recognition — the act of taking in raw data and taking an action based on the “category” of the pattern — has been crucial for our survival, and over the past tens of millions of years we have evolved highly sophisticated neural and cognitive systems for such tasks 1.1 Machine Perception It is natural that we should seek to design and build machines that can recognize patterns From automated speech recognition, fingerprint identification, optical character recognition, DNA sequence identification and much more, it is clear that reliable, accurate pattern recognition by machine would be immensely useful Moreover, in solving the myriad problems required to build such systems, we gain deeper understanding and appreciation for pattern recognition systems in the natural world — most particularly in humans For some applications, such as speech and visual recognition, our design efforts may in fact be influenced by knowledge of how these are solved in nature, both in the algorithms we employ and the design of special purpose hardware 1.2 An Example To illustrate the complexity of some of the types of problems involved, let us consider the following imaginary and somewhat fanciful example Suppose that a fish packing plant wants to automate the process of sorting incoming fish on a conveyor belt according to species As a pilot project it is decided to try to separate sea bass from salmon using optical sensing We set up a camera, take some sample images and begin to note some physical differences between the two types of fish — length, lightness, width, number and shape of fins, position of the mouth, and so on — and these suggest features to explore for use in our classifier We also notice noise or variations in the CHAPTER INTRODUCTION images — variations in lighting, position of the fish on the conveyor, even “static” due to the electronics of the camera itself Given that there truly are differences between the population of sea bass and that model of salmon, we view them as having different models — different descriptions, which are typically mathematical in form The overarching goal and approach in pattern classification is to hypothesize the class of these models, process the sensed data to eliminate noise (not due to the models), and for any sensed pattern choose the model that corresponds best Any techniques that further this aim should be in the conceptual toolbox of the designer of pattern recognition systems Our prototype system to perform this very specific task might well have the form shown in Fig 1.1 First the camera captures an image of the fish Next, the camera’s presignals are preprocessed to simplify subsequent operations without loosing relevant processing information In particular, we might use a segmentation operation in which the images of different fish are somehow isolated from one another and from the background The segmentation information from a single fish is then sent to a feature extractor, whose purpose is to reduce the data by measuring certain “features” or “properties.” These features feature extraction (or, more precisely, the values of these features) are then passed to a classifier that evaluates the evidence presented and makes a final decision as to the species The preprocessor might automatically adjust for average light level, or threshold the image to remove the background of the conveyor belt, and so forth For the moment let us pass over how the images of the fish might be segmented and consider how the feature extractor and classifier might be designed Suppose somebody at the fish plant tells us that a sea bass is generally longer than a salmon These, then, give us our tentative models for the fish: sea bass have some typical length, and this is greater than that for salmon Then length becomes an obvious feature, and we might attempt to classify the fish merely by seeing whether or not the length l of a fish exceeds some critical value l∗ To choose l∗ we could obtain some design or training training samples of the different types of fish, (somehow) make length measurements, samples and inspect the results Suppose that we this, and obtain the histograms shown in Fig 1.2 These disappointing histograms bear out the statement that sea bass are somewhat longer than salmon, on average, but it is clear that this single criterion is quite poor; no matter how we choose l∗ , we cannot reliably separate sea bass from salmon by length alone Discouraged, but undeterred by these unpromising results, we try another feature — the average lightness of the fish scales Now we are very careful to eliminate variations in illumination, since they can only obscure the models and corrupt our new classifier The resulting histograms, shown in Fig 1.3, are much more satisfactory — the classes are much better separated So far we have tacitly assumed that the consequences of our actions are equally costly: deciding the fish was a sea bass when in fact it was a salmon was just as cost undesirable as the converse Such a symmetry in the cost is often, but not invariably the case For instance, as a fish packing company we may know that our customers easily accept occasional pieces of tasty salmon in their cans labeled “sea bass,” but they object vigorously if a piece of sea bass appears in their cans labeled “salmon.” If we want to stay in business, we should adjust our decision boundary to avoid antagonizing our customers, even if it means that more salmon makes its way into the cans of sea bass In this case, then, we should move our decision boundary x∗ to smaller values of lightness, thereby reducing the number of sea bass that are classified as salmon (Fig 1.3) The more our customers object to getting sea bass with their 1.2 AN EXAMPLE Figure 1.1: The objects to be classified are first sensed by a transducer (camera), whose signals are preprocessed, then the features extracted and finally the classification emitted (here either “salmon” or “sea bass”) Although the information flow is often chosen to be from the source to the classifier (“bottom-up”), some systems employ “top-down” flow as well, in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows) Yet others combine two or more stages into a unified step, such as simultaneous segmentation and feature extraction salmon — i.e., the more costly this type of error — the lower we should set the decision threshold x∗ in Fig 1.3 Such considerations suggest that there is an overall single cost associated with our decision, and our true task is to make a decision rule (i.e., set a decision boundary) so as to minimize such a cost This is the central task of decision theory of which pattern classification is perhaps the most important subfield Even if we know the costs associated with our decisions and choose the optimal decision boundary x∗ , we may be dissatisfied with the resulting performance Our first impulse might be to seek yet a different feature on which to separate the fish Let us assume, though, that no other single visual feature yields better performance than that based on lightness To improve recognition, then, we must resort to the use decision theory CHAPTER INTRODUCTION salmon sea bass Count 22 20 18 16 12 10 Length 10 15 l* 20 25 Figure 1.2: Histograms for the length feature for the two categories No single threshold value l∗ (decision boundary) will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors The value l∗ marked will lead to the smallest number of errors, on average Count 14 salmon sea bass 12 10 2 x* Lightness 10 Figure 1.3: Histograms for the lightness feature for the two categories No single threshold value x∗ (decision boundary) will serve to unambiguously discriminate between the two categories; using lightness alone, we will have some errors The value x∗ marked will lead to the smallest number of errors, on average 1.2 AN EXAMPLE Width 22 salmon sea bass 21 20 19 18 17 16 15 Lightness 14 10 Figure 1.4: The two features of lightness and width for sea bass and salmon The dark line might serve as a decision boundary of our classifier Overall classification error on the data shown is lower than if we use only one feature as in Fig 1.3, but there will still be some errors of more than one feature at a time In our search for other features, we might try to capitalize on the observation that sea bass are typically wider than salmon Now we have two features for classifying fish — the lightness x1 and the width x2 If we ignore how these features might be measured in practice, we realize that the feature extractor has thus reduced the image of each fish to a point or feature vector x in a two-dimensional feature space, where x= x1 x2 Our problem now is to partition the feature space into two regions, where for all patterns in one region we will call the fish a sea bass, and all points in the other we call it a salmon Suppose that we measure the feature vectors for our samples and obtain the scattering of points shown in Fig 1.4 This plot suggests the following rule for separating the fish: Classify the fish as sea bass if its feature vector falls above the decision boundary shown, and as salmon otherwise This rule appears to a good job of separating our samples and suggests that perhaps incorporating yet more features would be desirable Besides the lightness and width of the fish, we might include some shape parameter, such as the vertex angle of the dorsal fin, or the placement of the eyes (as expressed as a proportion of the mouth-to-tail distance), and so on How we know beforehand which of these features will work best? Some features might be redundant: for instance if the eye color of all fish correlated perfectly with width, then classification performance need not be improved if we also include eye color as a feature Even if the difficulty or computational cost in attaining more features is of no concern, might we ever have too many features? Suppose that other features are too expensive or expensive to measure, or provide little improvement (or possibly even degrade the performance) in the approach described above, and that we are forced to make our decision based on the two features in Fig 1.4 If our models were extremely complicated, our classifier would have a decision boundary more complex than the simple straight line In that case all the decision boundary CHAPTER INTRODUCTION Width 22 salmon sea bass 21 20 19 ? 18 17 16 15 Lightness 14 10 Figure 1.5: Overly complex models for the fish will lead to decision boundaries that are complicated While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be misclassified as a sea bass generalization training patterns would be separated perfectly, as shown in Fig 1.5 With such a “solution,” though, our satisfaction would be premature because the central aim of designing a classifier is to suggest actions when presented with novel patterns, i.e., fish not yet seen This is the issue of generalization It is unlikely that the complex decision boundary in Fig 1.5 would provide good generalization, since it seems to be “tuned” to the particular training samples, rather than some underlying characteristics or true model of all the sea bass and salmon that will have to be separated Naturally, one approach would be to get more training samples for obtaining a better estimate of the true underlying characteristics, for instance the probability distributions of the categories In most pattern recognition problems, however, the amount of such data we can obtain easily is often quite limited Even with a vast amount of training data in a continuous feature space though, if we followed the approach in Fig 1.5 our classifier would give a horrendously complicated decision boundary — one that would be unlikely to well on novel patterns Rather, then, we might seek to “simplify” the recognizer, motivated by a belief that the underlying models will not require a decision boundary that is as complex as that in Fig 1.5 Indeed, we might be satisfied with the slightly poorer performance on the training samples if it means that our classifier will have better performance on novel patterns.∗ But if designing a very complex recognizer is unlikely to give good generalization, precisely how should we quantify and favor simpler classifiers? How would our system automatically determine that the simple curve in Fig 1.6 is preferable to the manifestly simpler straight line in Fig 1.4 or the complicated boundary in Fig 1.5? Assuming that we somehow manage to optimize this tradeoff, can we then predict how well our system will generalize to new patterns? These are some of the central problems in statistical pattern recognition For the same incoming patterns, we might need to use a drastically different cost ∗ The philosophical underpinnings of this approach derive from William of Occam (1284-1347?), who advocated favoring simpler explanations over those that are needlessly complicated — Entia non sunt multiplicanda praeter necessitatem (“Entities are not to be multiplied without necessity”) Decisions based on overly complex models often lead to lower accuracy of the classifier 1.2 AN EXAMPLE Width 22 salmon sea bass 21 20 19 18 17 16 15 Lightness 14 10 Figure 1.6: The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier function, and this will lead to different actions altogether We might, for instance, wish instead to separate the fish based on their sex — all females (of either species) from all males if we wish to sell roe Alternatively, we might wish to cull the damaged fish (to prepare separately for cat food), and so on Different decision tasks may require features and yield boundaries quite different from those useful for our original categorization problem This makes it quite clear that our decisions are fundamentally task or cost specific, and that creating a single general purpose artificial pattern recognition device — i.e., one capable of acting accurately based on a wide variety of tasks — is a profoundly difficult challenge This, too, should give us added appreciation of the ability of humans to switch rapidly and fluidly between pattern recognition tasks Since classification is, at base, the task of recovering the model that generated the patterns, different classification techniques are useful depending on the type of candidate models themselves In statistical pattern recognition we focus on the statistical properties of the patterns (generally expressed in probability densities), and this will command most of our attention in this book Here the model for a pattern may be a single specific set of features, though the actual pattern sensed has been corrupted by some form of random noise Occasionally it is claimed that neural pattern recognition (or neural network pattern classification) should be considered its own discipline, but despite its somewhat different intellectual pedigree, we will consider it a close descendant of statistical pattern recognition, for reasons that will become clear If instead the model consists of some set of crisp logical rules, then we employ the methods of syntactic pattern recognition, where rules or grammars describe our decision For example we might wish to classify an English sentence as grammatical or not, and here statistical descriptions (word frequencies, word correlations, etc.) are inapapropriate It was necessary in our fish example to choose our features carefully, and hence achieve a representation (as in Fig 1.6) that enabled reasonably successful pattern classification A central aspect in virtually every pattern recognition problem is that of achieving such a “good” representation, one in which the structural relationships among the components is simply and naturally revealed, and one in which the true (unknown) model of the patterns can be expressed In some cases patterns should be represented as vectors of real-valued numbers, in others ordered lists of attributes, in yet others descriptions of parts and their relations, and so forth We seek a represen- 30 APPENDIX A MATHEMATICAL FOUNDATIONS Thus, we have verified that the conditional density px1 |x2 (x1 |x2 ) is a normal districonditional bution Moreover, we have explicit formulas for the conditional mean µ2|1 and the mean conditional variance σ2|1 : µ2|1 σ2|1 σ2 (x1 − µ1 ) σ1 = σ22 (1 − ρ2 ), = µ2 + ρ and (116) as illustrated in Fig A.4 These formulas provide some insight into the question of how knowledge of the value of x1 helps us to estimate x2 Suppose that we know the value of x1 Then a natural estimate for x2 is the conditional mean, µ2|1 In general, µ2|1 is a linear function of x1 ; if the correlation coefficient ρ is positive, the larger the value of x1 , the larger the value of µ2|1 If it happens that x1 is the mean value µ1 , then the best we can is to guess that x2 is equal to µ2 Also, if there is no correlation between x1 and x2 , we ignore the value of x1 , whatever it is, and we always estimate x2 by µ2 Note that in that case the variance of x2 , given that we know x1 , is the same as the variance for the marginal distribution, i.e., σ2|1 = σ22 If there is correlation, knowledge of the value of x1 , whatever the value is, reduces the variance Indeed, with 100% correlation there is no variance left in x2 when the value of x1 is known A.6 statistical significance Hypothesis testing Suppose samples are drawn either from distribution D0 or they are not In pattern classification, we seek to determine which distribution was the source of any sample, and if it is indeed D0 , we would classify the point accordingly, into ω1 , say Hypothesis testing addresses a somewhat different but related problem We assume initially that distribution D0 is the source of the patterns; this is called the null hypothesis, and often denoted H0 Based on the value of any observed sample we ask whether we can reject the null hypothesis, that is, state with some degree of confidence (expressed as a probability) that the sample did not come from D0 For instance, D0 might be a standardized Gaussian, p(x) ∼ N (0, 1), and our null hypothesis is that a sample comes from a Gaussian with mean µ = If the value of a particular sample is small (e.g., x = 0.3), it is likely that it came from the D0 ; after all, 68% of the samples drawn from that distribution have absolute value less than x = 1.0 (cf Fig A.1) If a sample’s value is large (e.g., x = 5), then we would be more confident that it did not come from D0 At such a situation we merely conclude that (with some probability) the sample was drawn from a distribution with µ = Viewed another way, for any confidence — expressed as a probability — there exists a criterion value such that if the sampled value differs from µ = by more than that criterion, we reject the null hypothesis (It is traditional to use confidences of 01 or 05.) We then say that the difference of the sample from is statistically significant For instance, if our null hypothesis is a standardized Gaussian, then if our sample differs from the value x = by more than 2.576, we could reject the null hypothesis “at the 01 confidence level,” as can be deduced from Table A.1 A more sophisticated analysis could be applied if several samples are all drawn from D0 or if the null hypothesis involved a distribution other than a Gaussian Of course, this usage of “significance” applies only to the statistical properties of the problem — it implies nothing about whether the results are “important.” Hypothesis testing is of A.6 HYPOTHESIS TESTING 31 great generality, and useful when we seek to know whether something other than the assumed case (the null hypothesis) is the case A.6.1 Chi-squared test Hypothesis testing can be applied to discrete problems too Suppose we have n patterns — n1 of which are known to be in ω1 , and n2 in ω2 — and we are interested in determining whether a particular decision rule is useful or informative In this case, the null hypothesis is a random decision rule — one that selects a pattern and with some probability P places it in a category which we will call the “left” category, and otherwise in the “right” category We say that a candidate rule is informative if it differs signficantly from such a random decision What we need is a clear mathematical definition of statistical significance under these conditions The random rule (the null hypothesis) would place P n1 patterns from ω1 and P n2 from ω2 independently in the left category and the remainder in the right category Our candidate decision rule would differ significantly from the random rule if the proportions differed significantly from those given by the random rule Formally, we let niL denote the number of patterns from category ωi placed in the left category by our candidate rule The so-called chi-squared statistic for this case is χ2 = k=1 (niL − nie )2 nie (117) where according to the null hypothesis, the number of patterns in category ωi that we expect to be placed in the left category is nie = P ni Clearly χ2 is non-negative, and is zero if and only if all the observed match the expected numbers The higher the χ2 , the less likely it is that the null hypothesis is true Thus, for a sufficiently high χ2 , the difference between the expected and observed distributions is statistically significant, we can reject the null hypothesis, and can consider our candidate decision rule is “informative.” For any desired level of significance — such as 01 or 05 — a table gives the critical values of χ2 that allow us to reject the null hypothesis (Table A.2) There is one detail that must be addressed: the number of degrees of freedom In the situation described above, once the probability P is known, there is only one free variable needed to describe a candidate rule For instance, once the number of patterns from ω1 placed in the left category are known, all other values are determined uniquely Hence in this case the number of degrees of freedom is If there were more categories, or if the candidate decision rule had more possible outcomes, then df would be greater than The higher the number of degrees of freedom, the higher must be the computed χ2 to meet a disired level of significance We denote the critical values as, for instance, χ2.01(1) = 6.64, where the subscript denotes the significance, here 01, and the integer in parentheses is the degrees of freedom (In the Table, we conform to the usage in statistics, where this positive integer is denoted df , despite the possible confusion in calculus where it denotes an infinitessimal real number.) Thus if we have one degree of freedom, and the observed χ2 is greater than 6.64, then we can reject the null hypothesis, and say that, at the 01 confidence level our results did not come from a (weighted) random decision 32 APPENDIX A MATHEMATICAL FOUNDATIONS Table A.2: Critical values of chi-square (at two confidence levels) for different degrees of freedom (df ) df 10 A.7 A.7.1 05 3.84 5.99 7.82 9.49 11.07 12.59 14.07 15.51 16.92 18.31 01 6.64 9.21 11.34 13.28 15.09 16.81 18.48 20.09 21.67 23.21 df 11 12 13 14 15 16 17 18 19 20 05 19.68 21.03 22.36 23.68 25.00 26.30 27.59 28.87 30.14 31.41 01 24.72 26.22 27.69 29.14 30.58 32.00 33.41 34.80 37.57 37.57 df 21 22 23 24 25 26 27 28 29 30 05 32.67 33.92 35.17 36.42 37.65 38.88 40.11 41.34 42.56 43.77 01 38.93 40.29 41.64 42.98 44.31 45.64 46.96 48.28 49.59 50.89 Information theory Entropy and information Assume we have a discrete set of symbols {v1 v2 vm } with associated probabilities Pi The entropy of the discrete distribution — a measure of the randomness or unpredictability of a sequence of symbols drawn from it — is m H=− Pi log2 Pi , (118) i=1 bit surprise where since we use the logarithm base entropy is measured in bits In case any of the probabilities vanish, we use the relation log = One bit corresponds to the uncertainty that can be resolved by the answer to a single yes/no question (For continuous distributions, we often use logarithm base e, denoted ln, in which case the unit is nat.) The expectation operator (cf Eq 41) can be used to write H = E[log 1/P ], where we think of P as being a random variable whose possible values are P1 , P2 , , Pm The term log2 1/P is sometimes called the surprise — if Pi = except for one i, then there is no surprise when the corresponding symbol occurs Note that the entropy does not depend on the symbols themselves, just on their probabilities For a given number of symbols m, the uniform distribution in which each symbol is equally likely, is the maximum entropy distribution (and H = log2 m bits) — we have the maximum uncertainty about the identity of each symbol that will be chosen Clearly if x is equally likely to take on integer values 0, 1, , 7, we need bits to describe the outcome and H = log2 23 = Conversely, if all the pi are except one, we have the minimum entropy distribution (H = bits) — we are certain as to the symbol that will appear For a continuous distribution, the entropy is ∞ H=− −∞ p(x) ln p(x)dx, (119) A.7 INFORMATION THEORY 33 and again H = E[ln 1/p] It is worth mentioning that among all continuous density functions having a given mean µ and √ variance σ , it is the Gaussian that has the maximum entropy (H = + log2 ( 2πσ) bits) We can let σ approach zero to find that a probability density in the form of a Dirac delta function, i.e., δ(x − a) ∞ = if x = a if x = a, Dirac delta with ∞ δ(x)dx = 1, (120) −∞ has the minimum entropy (H = −∞ bits) For a Dirac function, we are sure that the value a will be selected each time Our use of entropy in continuous functions, such as in Eq 119, belies some subtle issues which are worth pointing out If x had units, such as meters, then the probability density p(x) would have to have units of 1/x There would be something fundamentally wrong in taking the logarithm of p(x) — the argument of the logarithm function should be dimensionless What we should really be dealing with is a dimensionless quantity, say p(x)/p0 (x), where p0 (x) is some reference density function (cf., Sect A.7.2) For discrete variable x and arbitrary function f (·), we have H(f (x)) ≤ H(x), i.e., processing decreases entropy For instance, if f (x) = const, the entropy will vanish Another key property of the entropy of a discrete distribution is that it is invariant to “shuffling” the event labels The related question with continuous variables concerns what happens when one makes a change of variables In general, if we make a change of variables, such as y = x3 or even y = 10x, we will get a different value for the integral of q(y)log q(y) dy, where q is the induced density for y If entropy is supposed to measure the intrinsic disorganization, it doesn’t make sense that y would have a different amount of intrinsic disorganization than x, since one is always derivable from the other; only if there were some randomness (e.g., shuffling) incorporated into the mapping could we say that one is more disorganized than the other Fortunately, in practice these concerns not present important stumbling blocks since relative entropy and differences in entropy are more fundamental than H taken by itself Nevertheless, questions of the foundations of entropy measures for continuous variables are addressed in books listed in Bibliographical Remarks A.7.2 Relative entropy Suppose we have two discrete distributions over the same variable x, p(x) and q(x) The relative entropy or Kullback-Leibler distance (which is closely related to cross entropy, information divergence and information for discrimination) is a measure of the “distance” between these distributions: DKL (p(x), q(x)) = q(x) p(x) (121) q(x) dx p(x) (122) q(x)ln x The continuous version is ∞ DKL (p(x), q(x)) = q(x)ln −∞ KullbackLeibler distance 34 APPENDIX A MATHEMATICAL FOUNDATIONS Although DKL (p(·), q(·)) ≥ and DKL (p(·), q(·)) = if and only if p(·) = q(·), the relative entropy is not a true metric, since DKL is not necessarily symmetric in the interchange p ↔ q and furthermore the triangle inequality need not be satisfied A.7.3 Mutual information Now suppose we have two distributions over possibly different variables, e.g., p(x) and q(y) The mutual information is the reduction in uncertainty about one variable due to the knowledge of the other variable I(p; q) = H(p) − H(p|q) = r(x, y)log x,y r(x, y) , p(x)q(y) (123) where r(x, y) is the joint distribution of finding value x and y Mutual information is simply the relative entropy between the joint distribution r(x, y) and the product distribution p(x)q(y) and as such it measures how much the distributions of the variables differ from statistical independence Mutual information does not obey all the properties of a metric In particular, the metric requirement that if p(x) = q(y) then I(x; y) = need not hold, in general As an example, suppose we have two binary random variables with r(0, 0) = r(1, 1) = 1/2, so r(0, 1) = r(1, 0) = According to Eq 123, the mutual information between p(x) and q(y) is log = The relationships among the entropy, relative entropy and mutual information are summarized in Fig A.5 The figure shows, for instance, that the joint entropy H(p, q) is always larger than individual entropies H(p) and H(q); that H(p) = H(p|q) + I(p; q), and so on H(p,q) H(p|q) I(p;q) H(q|p) H(p) H(q) Figure A.5: The mathematical relationships among the entropy of distributions p and q, mutual information I(p, q), and conditional entropies H(p|q) and H(q|p) From this figure one can quickly see relationships among the information functions For instance we can see immediately that I(p; p) = H(p); that if I(p; q) = then H(q|p) = H(q); that H(p, q) = H(p|q) + H(q), and so forth A.8 Computational complexity In order to analyze and describe the difficulty of problems and the algorithms designed to solve such problems, we turn now to the technical notion of computational complexity For instance, calculating the covariance matrix for a samples is somehow “harder” than calculating the mean Furthermore, some algorithms for computing some function may be faster or take less memory, than another algorithm We seek A.8 COMPUTATIONAL COMPLEXITY 35 to specify such differences, independent of the current computer hardware (which is always changing anyway) To this end we use the concept of the order of a function and the asymptotic notations “big oh,” “big omega,” and “big theta.” The three asymptotic bounds most often used are: Asymptotic upper bound O(g(x)) = {f (x): there exist positive constants c and x0 such that ≤ f (x) ≤ cg(x) for all x ≥ x0 } Asymptotic lower bound Ω(g(x)) = {f (x): there exist positive constants c and x0 such that ≤ cg(x) ≤ f (x) for all x ≥ x0 } Asymptotically tight bound Θ(g(x)) = {f (x): there exist positive constants c1 , c2 , and x0 such that ≤ c1 g(x) ≤ f (x) ≤ c2 g(x) for all x ≥ x0 } f(x) = Ω(g(x)) f(x) = O(g(x)) f(x) = Θ(g(x)) f(x) c2 g(x) c g(x) f(x) c g(x) c1 g(x) f(x) x x0 a) x x0 b) x x0 c) Figure A.6: Three types of asymptotic bounds: a) f (x) = O(g(x)) b) f (x) = Ω(g(x)) c) f (x) = Θ(g(x)) Consider the asymptotic upper bound We say that f (x) is “of order big oh of g(x)” (written f (x) = O(g(x)) if there exist constants c0 and x0 such that f (x) ≤ c0 g(x) for all x > x0 We shall assume that all our functions are positive and dispense with taking absolute values This means simply that for sufficiently large x, an upper bound on f (x) grows no worse than g(x) For instance, if f (x) = a + bx + cx2 then f (x) = O(x2 ) because for sufficiently large x, the constant, linear and quadratic terms can be “overcome” by proper choice of c0 and x0 The generalization to functions of two or more variables is straightforward It should be clear that by the definition above, the (big oh) order of a function is not unique For instance, we can describe our particular f (x) as being O(x2 ), O(x3 ), O(x4 ), O(x2 ln x), and so forth We use big omega notation, Ω(·), for lower bounds, and little omega, ω(·), for the tightest lower bound Of these, the big oh notation has proven to be most useful since we generally want an upper bound on the resources when solving a problem The lower bound on the complexity of the problem is denoted Ω(g(x)), and is therefore the lower bound on any algorithm algorithm that solves that problem Similarly, if the complexity of an algorithm is O(g(x)), it is an upper bound on the complexity of the problem it solves The complexity of some problems — such as computing the mean of a discrete set — is known, and thus once we have found an algorithm having equal complexity, the only possible improvement could be on lowering the constants of proportionality The complexity of other problems — such as inverting a matrix big oh 36 space complexity time complexity APPENDIX A MATHEMATICAL FOUNDATIONS — is not yet known, and if fundamental analysis cannot derive it, we must rely on algorithm developers who find algorithms whose complexity Approximately Such a rough analysis does not tell us the constants c and x0 For a finite size problem it is possible that a particular O(x3 ) algorithm is simpler than a particular O(x2 ) algorithm, and it is occasionally necessary for us to determine these constants to find which of several implemementations is the simplest Nevertheless, for our purposes the big oh notation as just described is generally the best way to describe the computational complexity of an algorithm Suppose we have a set of n vectors, each of which is d-dimensional and we want to calculate the mean vector Clearly, this requires O(nd) multiplications Sometimes we stress space and time complexities, which are particularly relevant when contemplating parallel hardware implementations For instance, the d-dimensional sample mean could be calculated with d separate processors, each adding n sample values Thus we can describe this implementation as O(d) in space (i.e., the amount of memory or possibly the number of processors) and O(n) in time (i.e., number of sequential steps) Of course for any particular algorithm there may be a number of time-space tradeoffs Bibliographical Remarks There are several good books on linear systems, such as [14], and matrix computations [8] Lagrange optimization and related techniques are covered in the definitive book [2] While [13] and [3] are of foundational and historic interest, readers seeking clear presentations of the central ideas in probability should consult [10, 7, 6, 21] A handy reference to terms in probability and statistics is [20] A number of hypothesis testing and statistical significance, elementary, such as [24], and more advanced [18, 25] Shannon’s foundational paper [22] should be read by all students of pattern recognition It, and many other historically important papers on information theory can be found in [23] An excellent textbook at the level of this one is [5] and readers seeking a more abstract and formal treatment should consult [9] The study of time complexity of algorithms began with [12], and space complexity [11, 19] The multivolume [15, 16, 17] contains a description of computational complexity, the big oh and other asymptotic notations Somewhat more accessible treatments can be found in [4] and [1] Bibliography [1] Alfred V Aho, John E Hopcroft, and Jeffrey D Ullman The Design and Analysis of Computer Algorithms Addison-Wesley, Reading, MA, 1974 [2] Dimitri P Bertsekas Constrained Optimization and Lagrange Multiplier Methods Athena Scientific, Belmont, MA, 1996 [3] Patrick Billingsley Probability and Measure Wiley, New York, NY, second edition, 1986 [4] Thomas H Cormen, Charles E Leiserson, and Ronald L Rivest Introduction to Algorithms MIT Press, Cambridge, MA, 1990 [5] Thomas M Cover and Joy A Thomas Elements of Information Theory Wiley Interscience, New York, NY, 1991 [6] Alvin W Drake Fundamentals of Applied Probability Theory McGraw-Hill, New York, NY, 1967 [7] William Feller An Introduction to Probability Theory and Its Applications, volume Wiley, New York, NY, 1968 [8] Gene H Golub and Charles F Van Loan Matrix Computations Johns Hopkins University Press, Baltimore, MD, third edition, 1996 [9] Robert M Gray Entropy and Information Theory Springer-Verlag, New York, NY, 1990 [10] Richard W Hamming The Art of Probability for Scientists and Engineers Addison-Wesley, New York, NY, 1991 [11] Juris Hartmanis, Philip M Lewis II, and Richard E Stearns Hierarchies of memory limited computations Proceedings of the Sixth Annual IEEE Symposium on Switching Circuit Theory and Logical Design, pages 179–190, 1965 [12] Juris Hartmanis and Richard E Stearns On the computational complexity of algorithms Transactions of the American Mathematical Society, 117:285–306, 1965 [13] Harold Jeffreys Theory of Probability Oxford University Press, Oxford, UK, 1961 reprint edition, 1939 [14] Thomas Kailath Linear Systems Prentice-Hall, Englewood Cliffs, NJ, 1980 37 38 BIBLIOGRAPHY [15] Donald E Knuth The Art of Computer Programming, volume AddisonWesley, Reading, MA, edition, 1973 [16] Donald E Knuth The Art of Computer Programming, volume AddisonWesley, Reading, MA, edition, 1973 [17] Donald E Knuth The Art of Computer Programming, volume AddisonWesley, Reading, MA, edition, 1981 [18] Erich L Lehmann Testing Statistical Hypotheses Springer, New York, NY, 1997 [19] Philip M Lewis II, Richard E Stearns, and Juris Hartmanis Memory bounds for recognition of context-free and context-sensitive languages Proceedings of the Sixth Annual IEEE Symposium on Switching Circuit Theory and Logical Design, pages 191–202, 1965 [20] Francis H C Marriott A Dictionary of Statistical Terms Longman Scientific & Technical, Essex, UK, fifth edition, 1990 [21] Yuri A Rozanov Probability Theory: A Concise Course Dover, New York, NY, 1969 [22] Claude E Shannon A mathematical theory of communication Bell Systems Technical Journal, 27:379–423, 623–656, 1948 [23] David Slepian, editor Key Papers in the Development of Information Theory IEEE Press, New York, NY, 1974 [24] Richard C Sprinthall Basic Statistical Analysis Allyn & Bacon, Needham Heights, MA, fifth edition, 1996 [25] Rand R Wilcox Introduction to Robust Estimation and Hypotheses Testing Academic Press, New York, NY, 1997 Index † , see matrix, pseudoinverse ρ, see correlation, coefficient E[·], see expectation adjoint, see matrix, adjoint asymptotic lower bound, see lower bound, asymptotic asymptotic notation, 35 asymptotic tight bound, see tight bound, asymptotic asymptotic upper bound, see upper bound, asymptotic average, see expected value Bayes’ rule, 17, 21 vector, 19 Bienaym´e-Chebychev, see Chebyshev’s inequality big oh, 35 big omega, 35 big theta, 35 Cauchy-Schwarz inequality, 7, 16 vector analog, 20 Central Limit Theorem, 23 Chebyshev’s inequality, 13 chi-square, 31 table, 32 chi-squared statistic, 31 cofactor matrix, see matrix, cofactor complexity space, 36 time, 36 computational complexity, 34–36 conditional probability, see probability, conditional confidence level, 31 convolution, 22 correlation coefficient, 16, 28, 30 covariance, 15, 20 matrix, see matrix, covariance normalized, 16 cross entropy, see distance, KullbackLeibler cross moment, see covariance density Gaussian bivariate, 28 conditional mean, 30 marginal, 29 mean, 23 univariate, 23 variance, 23 joint singular, 29 distance Euclidean, Kullback-Leibler, 33 Mahalanobis, 23, 27 distribution Gaussian, 23 area, 13 covariance, 28 eigenvector, 27 moment, 26 multivariate, 26 principal axes, 27 univariate, 23 joint, 18 marginal, 18 maximum entropy, 32 prior, 18 dot product, see inner product dyadic product, see matrix product eigenvalue, 11 eigenvector, 11 entropy, 32 continuous distribution, 32 discrete, 33 39 40 relative, 33 surprise, 32 error function, 25 Euclidean norm, see distance, Euclidean events mutually exclusive, 17 evidence, 18 expectation continuous, 20 entropy, 32 linearity, 13, 15 vector, 19 expected value, 13 two variables, 15 factorial, 25 function Dirac delta, 33 gamma, 25 Kronecker, vector valued, 21 INDEX Kronecker delta, see function, Kronecker Kullback-Leibler, see distance, KullbackLeibler Lagrange optimization, see optimization, Lagrange Lagrange undetermined multiplier, 12 Law of Total Probability, 17 level curves, 28 likelihood, 18 linear independence, matrix columns, 11 little omega, 35 lower bound asymptotic, 35 Mahalanobis distance, see distance, Mahalanobis marginal, 14 distribution, 14 mass function probability, see probability, mass gamma function, see function, gamma function Gaussian matrix table, 24 addition, unidimensional, 23 adjoint, 11 Gaussian derivative, 24–25 anti-symmetric, gradient, covariance, determinant, 27, 28 Hessian matrix, see matrix, Hessian diagonal, 20, 21, 26 hypothesis eigenvalues, 20 null, see null hypothesis inverse, 27, 28 hypothesis testing, 30 derivative, 8–9 determinant, 9–10 identity matrix, see matrix, identity hypervolume, independence Hessian, statistical, 15 identity (I), independent variables inverse sum, 22 derivative, information inversion, 10–12 bit, see bit Jacobian, divergence, see distance, Kullbackmultiplication, Leibler non-negative, for discrimination, see distance, Kullbackpositive semi-definite, 20 Leibler product, see outer product mutual, 34 pseudoinverse, 11 information theory, 32–34 skew-symmetric, inner product, square, symmetric, 6, Jacobian, Jacobian matrix, see matrix, Jacobian trace, 10 INDEX maximum entropy, 32 mean, see expected value calculation computational complexity, 34 two variables, 15 mean vector, see vector, mean moment cross, see covariance second, 13 multiple integral, 21 mutual information, see information, mutual normal, see distribution, Gaussian null hypothesis, 30 optimization Lagrange, 12 outer product, 7, 19 principal axes, see axes, principal prior, 18 prior distribution, see distribution, prior probability conditional, 16–17 density, 20 joint, 21 joint, 14, 17 mass, 16, 20 joint, 14 mass function, 12 total law, see Bayes’ rule probability theory, 12–24 product space, 14 random variable discrete, 12 vector, 18–20 scalar product, see inner product second moment, see moment, second significance level, see confidence level statistical, 30 space-time tradeoff, 36 standard deviation, 13, 23 statistic chi-squared, see chi-squared statistic statistical 41 independence expectation, 16 statistical dependence, 16 statistical independence, see independence, statistical, 16, 20 Gaussian, 29 vector, 18 statistical significance, see significance, statistical surprise, 32 Taylor series, tight bound asymptotic (Θ(·)), 35 trace, see matrix, trace transpose, unpredictability, see entropy upper bound asymptotic, 35 variable random continuous, 20–21 discrete, 14 standardized, 27 standardized, 23 variables uncorrelated, 16 variance, 13 nonlinearity, 14 two variables, 15 vector, addition, colinearity, linearly independent, mean, 19 orthogonal, space, span, vector product, see outer product z score, 23