Jürgen Beyerer, Matthias Richter, Matthias Nagel Pattern Recognition De Gruyter Graduate Also of Interest Dynamic Fuzzy Machine Learning L Li, L Zhang, Z Zhang, 2018 ISBN 978-3-11-051870-2, e-ISBN 978-3-11-052065-1, e-ISBN (EPUB) 978-3-11-051875-7, Set-ISBN 978-3-11-052066-8 Lie Group Machine Learning F Li, L Zhang, Z Zhang, 2019 ISBN 978-3-11-050068-4, e-ISBN 978-3-11-049950-6, e-ISBN (EPUB) 978-3-11-049807-3, Set-ISBN 978-3-11-049955-1 Complex Behavior in Evolutionary Robotics L König, 2015 ISBN 978-3-11-040854-6, e-ISBN 978-3-11-040855-3, e-ISBN (EPUB) 978-3-11-040918-5, Set-ISBN 978-3-11-040917-8 Pattern Recognition on Oriented Matroids A O Matveev, 2017 ISBN 978-3-11-053071-1, e-ISBN 978-3-11-048106-8, e-ISBN (EPUB) 978-3-11-048030-6, Set-ISBN 978-3-11-053115-2 Graphs for Pattern Recognition D Gainanov, 2016 ISBN 978-3-11-048013-9, e-ISBN 978-3-11-052065-1, e-ISBN (EPUB) 978-3-11-051875-7, Set-ISBN 978-3-11-048107-5 Authors Prof Dr.-Ing habil Jürgen Beyerer Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB Fraunhoferstr 76131 Karlsruhe juergen.beyerer@iosb.fraunhofer.de -andInstitute of Anthropomatics and Robotics, Chair IES Karlsruhe Institute of Technology Adenauerring 76131 Karlsruhe Matthias Richter Institute of Anthropomatics and Robotics, Chair IES Karlsruhe Institute of Technology Adenauerring 76131 Karlsruhe matthias.richter@kit.edu Matthias Nagel Institute of Theoretical Informatics, Cryptography and IT Security Karlsruhe Institute of Technology Am Fasanengarten 76131 Karlsruhe matthias.nagel@kit.edu ISBN 978-3-11-053793-2 e-ISBN (PDF) 978-3-11-053794-9 e-ISBN (EPUB) 978-3-11-053796-3 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de © 2018 Walter de Gruyter GmbH, Berlin/Boston Cover image: Top Photo Corporation/Top Photo Group/thinkstock www.degruyter.com Preface PATTERN RECOGNITION ⊂ MACHINE LEARNING ⊂ ARTIFICIAL INTELLIGENCE: This relation could give the impression that pattern recognition is only a tiny, very specialized topic That, however, is misleading Pattern recognition is a very important field of machine learning and artificial intelligence with its own rich structure and many interesting principles and challenges For humans, and also for animals, their natural abilities to recognize patterns are essential for navigating the physical world which they perceive with their naturally given senses Pattern recognition here performs an important abstraction from sensory signals to categories: on th most basic level, it enables the classification of objects into “Eatable” or “Not eatable” or, e.g., into “Friend” or “Foe.” These categories (or, synonymously, classes) not always have a tangible character Examples of non-material classes are, e.g., “secure situation” or “dangerous situation.” Such classes may even shift depending on the context, for example, when deciding whether an action is socially acceptable or not Therefore, everybody is very much acquainted, at least at an intuitive level, with what pattern recognition means to our daily life This fact is surely one reason why pattern recognition as a technical subdiscipline is a source of so much inspiration for scientists and engineers In order to implement pattern recognition capabilities in technical systems, it is necessary to formalize it in such a way, that the designer of a pattern recognition system can systematically engineer the algorithms and devices necessary for a technical realization This textbook summarizes a lecture course about pattern recognition that one of the authors (Jürgen Beyerer) has been giving for students of technical and natural sciences at the Karlsruhe Institute of Technology (KIT) since 2005 The aim of this book is to introduce the essential principles, concepts and challenges of pattern recognition in a comprehensive and illuminating presentation We will try to explain all aspects of pattern recognition in a well understandable, self-contained fashion Facts are explained with a mixture of a sufficiently deep mathematical treatment, but without going into the very last technical details of a mathematical proof The given explanations will aid readers to understand the essential ideas and to comprehend their interrelations Above all, readers will gain the big picture that underlies all of pattern recognition The authors would like to thank their peers and colleagues for their support: Special thanks are owed to Dr Ioana Gheța who was very engaged during the early phases of the lecture “Pattern Recognition” at the KIT She prepared most of the many slides and accompanied the course along many lecture periods Thanks as well to Dr Martin Grafmüller and to Dr Miro Taphanel for supporting the lecture Pattern Recognition with great dedication Moreover, many thanks to to Prof Michael Heizmann and Prof Fernando Puente León for inspiring discussions, which have positively influenced to the evolution of the lecture Thanks to Christian Hermann and Lars Sommer for providing additional figures and examples of deep learning Our gratitude also to our friends and colleagues Alexey Pak, Ankush Meshram, Chengchao Qu, Christian Hermann, Ding Luo, Julius Pfrommer, Julius Krause, Johannes Meyer, Lars Sommer, Mahsa Mohammadikaji, Mathias Anneken, Mathias Ziearth, Miro Taphanel, Patrick Philipp, and Zheng Li for providing valuable input and corrections for the preparation of this manuscript Lastly, we thank De Gruyter for their support and collaboration in this project Karlsruhe, Summer 2017 Jürgen Beyerer Matthias Richter Matthias Nagel Contents Preface List of Tables List of Figures Notation Introduction Fundamentals and definitions 1.1 Goals of pattern recognition 1.2 Structure of a pattern recognition system 1.3 Abstract view of pattern recognition 1.4 Design of a pattern recognition system 1.5 Exercises Features 2.1 Types of features and their traits 2.1.1 Nominal scale 2.1.2 Ordinal scale 2.1.3 Interval scale 2.1.4 Ratio scale and absolute scale 2.2 Feature space inspection 2.2.1 Projections 2.2.2 Intersections and slices 2.3 Transformations of the feature space 2.4 Measurement of distances in the feature space 2.4.1 Basic definitions 2.4.2 Elementary norms and metrics 2.4.3 A metric for sets 2.4.4 Metrics on the ordinal scale 2.4.5 The Kullback–Leibler divergence 2.4.6 Tangential distance measure 2.5 Normalization 2.5.1 Alignment, elimination of physical dimension, and leveling of proportions 2.5.2 Lighting adjustment of images 2.5.3 Distortion adjustment of images 2.5.4 Dynamic time warping 2.6 Selection and construction of features 2.6.1 Descriptive features 2.6.2 2.6.3 2.7 2.7.1 2.7.2 2.7.3 2.7.4 2.7.5 2.7.6 2.8 Model-driven features Construction of invariant features Dimensionality reduction of the feature space Principal component analysis Kernelized principal component analysis Independent component analysis Multiple discriminant analysis Dimensionality reduction by feature selection Bag of words Exercises Bayesian decision theory 3.1 General considerations 3.2 The maximum a posteriori classifier 3.3 Bayesian classification 3.3.1 The Bayesian optimal classifier 3.3.2 Reference example: Optimal decision regions 3.3.3 The minimax classifier 3.3.4 Normally distributed features 3.3.5 Arbitrarily distributed features 3.4 Exercises Parameter estimation 4.1 Maximum likelihood estimation 4.2 Bayesian estimation of the class-specific distributions 4.3 Bayesian parameter estimation 4.3.1 Least squared estimation error 4.3.2 Constant penalty for failures 4.4 Additional remarks on Bayesian classification 4.5 Exercises Parameter free methods 5.1 The Parzen window method 5.2 The k-nearest neighbor method 5.3 k-nearest neighbor classification 5.4 Exercises General considerations 6.1 Dimensionality of the feature space 6.2 Overfitting 6.3 Exercises Fig B.1 Tangential distance measure, reproduced from Figure 2.12 C Random processes The following will give a brief overview of random processes This overview is by no means meant to be comprehensive, but should be sufficient to understand the concepts in Chapter A random process g is a random variable that is also a function of some arguments Here, we will focus only on two-dimensional random processes over the real numbers, i e., random functions of the form From one perspective, g is a random function that is evaluated at any point x ∈ 2: depending on the realization of g, a different result will be obtained Hence, g can be described by a probability distribution over the space of all possible functions g: 2→ that map from to In particular, one can find an expectation μ and a variance σ2 for g, Note that both μ and σ2 are themselves functions The covariance is a function of two points x, y ∈ 2, A second interpretation considers g as a (deterministic) function that maps x to a random variable rx on From this perspective, there is an infinite (and uncount able) set {rx|x ∈ 2} of random variables Assume that a probability density function p(x1, , xk ) exists for each finite subset with k arbitrary points x1, , xk ∈ The corresponding distributions are called finite-dimensional marginal distributions (fidis) Moreover, we require that for all x and for all other moments, in case they exist Note the subtle mathematical difference: on the left of the equations, the stochastic moment of a random function is calculated first and then the resulting (non-random) function is evaluated at the point x; on the right, the function is evaluated at x to a random variable first, and then the stochastic moment of that random variable is calculated With these notations at hand it is possible to introduce two properties of stochastic processes Definition C.1 ((Strictly) stationary process (of order m)) Let g be a random process, k∈ finite dimension, x1, , xk ∈ arbitrary points, and τ ∈ a translation vector be a If holds for all valid choices of k, x1, , xk , and τ, the process g is called (strictly) stationary This means that all fidis are invariant under translation A stochastic process is called (strictly) stationary of order m, if the above holds for all k≤m Definition C.2 (Homogeneity (of order m)) A stochastic process is called homogeneous of order m if This means the first m moments not depend on the point x Obviously, stationarity is much stronger than homogeneity A stationary process is always homogeneous (up to the same order) but not vice versa Definition C.3 ((Two-dimensional) weakly stationary process) Let g denote a two-dimensional random process, x, y ∈ two points, and τ ∈ a translation vector The process is a weakly stationary process if for all x, y, τ, This means the expectation is constant for every point x and the covariance is also constant in the sense that its value only depends on the relative position of x and y but is not affected by a translation Especially for x = y, this implies Var{g} (x) = σ2 is constant for all x ∈ The condition of weak stationarity is more restrictive than a homogeneity of order two, but less restrictive than stationarity of order two For a process to be homogeneous, it is only required that its expectation and variance be constant: this does not say anything about its covariance In contrast, to be a stationary process of order two, it is required that all two-dimensional marginal distributions be identical The latter is much stronger than having only identical covariances Definition C.4 (Expectation-free (two-dimensional) weakly stationary process) A twodimensional weakly stationary process g is called expectation free if for all x ∈ Note that the term “expectation free” is a bit misleading: an expectation free random process is not free of having an expectation It has an expectation: Definition C.5 ((Two-dimensional) white noise) A two-dimensional random process e mn is white noise if it is weakly stationary and fulfills the additional requirements Actually, both requirements already ensure that it is a weakly stationary process, but demand much more Especially, the last requirement implies that any two states are uncorrelated with each other Lastly, we consider a certain assumption about random processes that makes reasoning about them easier in many circumstances: ergodicity Informally, in an ergodic process, a reasonably large sample from that process is representative of the process as a whole Formally, let E denote a probability space and let e∈ E be an elementary event Moreover, let g(x) = g(x, e) denote the realization of the random process g(x) with respect to the elementary event e Definition C.6 (Ergodic process) Let g be a stationary process and let μ(x) = E {g}(x denote the expectation of g This means that the expectation μ(x) = μ is constant for all x ∈ The process g is said to be ergodic if for all events e∈ E and all y ∈ 2, On the right side of Equation (C.14), one arbitrary point y is fixed and the average over all possible realizations g of g is calculated On the left, one realization g(x) = g(x, e) is fixed and the average over all points x ∈ is calculated Hence, under the assumption that g is ergodic, one can determine the unknown expectation and variance of g by taking the average over all points of only one single realization Bibliography R Aster, B Borchers, and C Thurber Parameter Estimation and Inverse Problems International Geophysics Series Academic Press, 2013 ISBN 9780123850485 Y Bengio, P Lamblin, D Popovici, and H Larochelle Greedy layer-wise training of deep networks In Advances in neural information processing systems, pages 153–160, 2007 J Beyerer Analyse von Riefentexturen PhD thesis, Düsseldorf, 1994 J Beyerer, F Puente León, and C Frese Machine Vision Springer, 2016 B E Boser, I M Guyon, and V N Vapnik A training algorithm for optimal margin classifiers In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152 ACM, 1992 L Breiman Random forests Machine learning, 45(1):5–32, 2001 J Bromley, J W Bentz, L Bottou, I Guyon, Y LeCun, C Moore, E Säckinger, and R Shah Signature verification using a "siamese" time delay neural network IJPRAI, 7(4):669–688, 1993 C Cortes and V Vapnik Support-vector networks Machine learning, 20(3):273–297, 1995 T Cover and P Hart Nearest neighbor pattern classification IEEE Transactions on Information Theory, 13(1):21–27, Jan 1967 K Crammer and Y Singer On the algorithmic implementation of multiclass kernel-based vector machines Journal of machine learning research, 2(Dec):265–292, 2001 A Criminisi, J Shotton, E Konukoglu, et al Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning Foundations and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012 N Cristianini and J Shawe-Taylor An introduction to support vector machines and other kernel-based learning methods Cambridge university press, 2000 G Csurka, C Dance, L Fan, J Willamowski, and C Bray Visual categorization with bags of keypoints In Workshop on statistical learning in computer vision, ECCV, volume 1, pages 1–2 Prague, 2004 A P Dempster, N M Laird, and D B Rubin Maximum likelihood from incomplete data via the em algorithm Journal of the Royal Statistical Society Series B (Methodological), 39(1):1–38, 1977 ISSN 00359246 J Deng, W Dong, R Socher, L.-J Li, K Li, and L Fei-Fei Imagenet: A large-scale hierarchical image database In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255 IEEE, 2009 R O Duda, P E Hart, and D G Stork Pattern classification Wiley, New York, edition, 2001 B Efron and T Hastie Computer Age Statistical Inference: Algorithms, Evidence, and Data Science Cambridge University Press, New York, NY, USA, 1st edition, 2016 ISBN 9781107149892 G A Fink Mustererkennung mit Markov-Modellen Vieweg+Teubner Verlag, 2003 ISBN 978-3-519-00453-0 G D Forney The viterbi algorithm Proceedings of the IEEE, 61(3):268–278, Mar 1973 ISSN 0018-9219 10.1109/PROC.1973.9030 Y Freund and R Schapire A tutorial on boosting, 2004 Y Freund and R E Schapire A decision-theoretic generalization of on-line learning and an application to boosting Journal of Computer and System Sciences, 55(1):119 – 139, 1997 A S Georghiades, P N Belhumeur, and D J Kriegman From few to many: Illumination cone models for face recognition under variable lighting and pose IEEE Trans Pattern Anal Mach Intelligence, 23(6):643–660, 2001 I Guyon, J Makhoul, R Schwartz, and V Vapnik What size test set gives good error rate estimates? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):52–64, 1998 T Hastie, R Tibshirani, and J Friedman The Elements of Statistical Learning, volume Springer series in statistics Springer, Berlin, 2001 K He, X Zhang, S Ren, and J Sun Deep residual learning for image recognition In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016 C Herrmann, D Willersinn, and J Beyerer Low-resolution convolutional neural networks for video face recognition In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal Based Surveillance, Colorado Springs, USA, Aug 2016 IEEE T K Ho Random decision forests In Document Analysis and Recognition, volume 1, pages 278–282 IEEE, 1995 S Hochreiter and J Schmidhuber Long short-term memory Neural computation, 9(8):1735–1780, 1997 R Hoffmann Signalanalyse und –erkennung Springer, 1998 A Hyvärinen, J Karhunen, and E Oja Independent Component Analysis, volume 46 John Wiley & Sons, 2004 A Jaglom and I Jaglom Wahrscheinlichkeit und Information Deutscher Verlag der Wissenschaften, Berlin, 1960 Translated from Russian A N Kolmogorov On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition American Mathematical Society Translation, 28(2):55–59, 1963 D Krahe and J Beyerer A parametric method to quantify the balance of groove sets of honed cylinder bores In Intelligent Systems & Advanced Manufacturing, pages 192–201 International Society for Optics and Photonics, 1997 A Krizhevsky, I Sutskever, and G E Hinton Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems, pages 1097–1105, 2012 K Küpfmüller Die entropie der deutschen sprache Fernmeldetechnische Zeitung, 7(6):265–272, 1954 A Laubenheimer Automatische Registrierung adaptiver Modelle zur Typerkennung technischer Objekte PhD thesis, Universität Karlsruhe, 2004 Y LeCun, L Bottou, Y Bengio, and P Haffner Gradient-based learning applied to document recognition Proceedings of the IEEE, 86(11):2278–2324, 1998 A H Lipkus A proof of the triangle inequality for the tanimoto distance Journal of Mathematical Chemistry, 26(1-3):263– 265, 1999 K Liu and G Mattyus Fast multiclass vehicle detection on aerial images Geoscience and Remote Sensing Letters, IEEE, PP(99):1–5, 2015 ISSN 1545-598X 10.1109/LGRS.2015.2439517 D O Loftsgaarden and C P Quesenberry A nonparametric estimate of a multivariate density function The Annals of Mathematical Statistics, 36(3):1049–1051, 1965 D G Lowe Distinctive image features from scale-invariant keypoints International journal of computer vision, 60(2):91–110, 2004 V Maz’ya and G Schmidt On approximate approximations using gaussian kernels IMA Journal of Numerical Analysis, 16(1):13–29, 1996 J Mercer Functions of positive and negative type, and their connection with the theory of integral equations Philosophical transactions of the royal society of London, 209:415–446, 1909 T K Moon and W C Stirling Mathematical Methods and Algorithms for Signal Processing Prentice Hall, Upper Saddle River, NJ, 2000 ISBN 0-201-36186-8 J Neyman and E S Pearson On the problem of the most efficient tests of statistical hypotheses In Breakthroughs in statistics, pages 73–108 Springer, 1992 A B J Novikoff On convergence proofs on perceptrons Proceedings of the Symposium on the Mathematical Theory of Automata, 12:615–622, 1962 C Rasmussen and C Williams Gaussian Processes for Machine Learning Adaptative computation and machine learning series University Press Group Limited, 2006 ISBN 9780262182539 M Richter, T Längle, and J Beyerer Knowing when you don’t: Bag of visual words with reject option for automatic visual inspection of bulk materials In Proceedings of the 23rd International Conference on Patter Recognition (ICPR), Cancun, Mexiko, Dec 2016 A Rieder Keine Probleme mit Inversen Problemen: Eine Einführung in ihre stabile Lösung Vieweg+Teubner Verlag, 2003 ISBN 9783528031985 H Ritter, T Martinetz, and K Schulten Neuronale netze Addison-Wesley, 1990 C P Robert A comparison of the bayesian and frequentist approaches to estimation by francisco j samaniego International Statistical Review, 79(1):117–118, 2011 L Rokach Pattern classification using ensemble methods, volume 75 World Scientific, 2010 F Rosenblatt The Perceptron: A Perceiving and Recognizing Automaton, volume Report 85-60-1 Cornell Aeronautical Laboratory, 1957 F Rosenblatt Principles of Neurodynamics: Perceptrons and the Rheory of Brain Mechanisms Spartan, 1962 J Schmidhuber Deep learning in neural networks: An overview Neural networks, 61:85–117, 2015 B Schölkopf and C J Burges Advances in Kernel Methods: Support Vector Learning MIT press, 1999 B Schölkopf, A Smola, and K.-R Müller Kernel principal component analysis In International Conference on Artificial Neural Networks, pages 583–588 Springer, 1997 J Schürmann Pattern classification: a unified view of statistical and neural approaches Wiley Online Library, 1996 C E Shannon, N J A N J A Sloane, A D Wyner, and I I theory society, editors Claude Elwood Shannon: collected papers IEEE Press, New York, 1993 ISBN 0-7803-0434-9 IEEE Information Theory Society L W Sommer, T Schuchert, and J Beyerer Deep learning based multi-category object detection in aerial images In Proc SPIE 10202, Automatic Target Recognition XXVII, Anaheim, United States, May 2017 N Srivastava, G E Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov Dropout: a simple way to prevent neural networks from overfitting Journal of Machine Learning Research, 15(1): 1929–1958, 2014 S S Stevens On the theory of scales of measurement Science, 103(2684):677–680, June 1946 L Valiant A theory of the learnable Communications of the ACM, 27(11):1134–1142, 1984 V N Vapnik and V Vapnik Statistical learning theory, volume Wiley New York, 1998 A J Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithm IEEE Transactions on Information Theory, 13(2):260–269, Apr 1967 ISSN 0018-9448.10.1109/TIT.1967.1054010 M D Zeiler and R Fergus Visualizing and understanding convolutional networks In European Conference on Computer Vision (ECCV), pages 818–833 Springer, 2014 Z.-H Zhou Ensemble methods: foundations and algorithms CRC press, 2012 Glossary A posteriori distribution The distribution of the classes with respect to a fixed feature A priori distribution The distribution of the classes without knowledge of the features Absolute norm A special type of Minkowski norm Absolute scale Scale of measurement for counting quantities AR model Autoregressive signal model Autoregressive signal model Representation of a type of random process Bagging Bootstrap aggregating Bayes’ law Fundamental result in probability theory Bayesian classifier A special classifier that uses all the ingredients of the Bayesian framework and is optimal with respect to the risk Bias In parameter estimation: Error of an estimator that is not due to chance Binary classifier A classifier that only decides between two classes Boosting Meta-method that combines several weak classifiers into one strong classifier Bootstrap aggregating Ensemble method in which several classifiers are trained on random subsets of the same training data Central limit theorem A central theorem in probability theory Chebyshev norm A special type of Minkowski norm Class A subset of the world grouping similar objects Class-specific feature distribution The distribution of the features given a class Classifier A (mathematical) method of assigning an object to a equivalence class based on features Conditional distribution The distribution of a random quantity if another quantity from a joint probability space is kept fixed Confusion matrix Table that compares the ground truth with the classifier’s prediction on a validation set Consistent estimator An estimator that converges almost surely to the true value Convolutional neural network Type of deep learning architecture suitable for multidimensional data with repeating local structure Cost function A function that describes the costs of assigning a class with respect to the true class CR-efficient estimator A special estimator that has minimum variace Cramér–Rao bound A lower bound on the variance of an unbiased estimator CRB Cramér–Rao bound Cross-validation Technique to estimate the performance of a classifier with a small data set Dataset The set of all objects that were collected to define, validate and test a pattern recognition system Decision boundary The boundary of a decision region The entirety of the boundaries is an equivalent description of the classifier Decision function A function that maps a feature vector to one component of the decision space Decision region A partition in the feature space Decision space An intermediate space to unify the mathematical description of the classes Decision tree Tree structured classifier where the inner nodes correspond to tests, the edges correspond to the outcomes of the tests, and the leaf nodes govern the class decision Decision vector The vector of decision functions of all classes Dirac sequence A sequence of probability distributions that converges to the Dirac distribution Discrepancy A function that quantifies the similarity between two (mathematical) objects that lacks some properties of a metric Distance function Usually a synonym for metric (usage may vary depending on contex) Distribution Mathematical object that encapsulates the properties of random variables Divergence A discrepancy between probability distributions EM Expectation maximization Emission probability In hidden Markov models: Probability of seeing an observable given a chain of states Empirical operation Mathematical operation that corresponds to an experiment, e.g., addition of the masses of two objects by putting both on a scale at the same time Empirical relation Mathematical relations that emerge from experiments, e.g., by comparing the weight of two objects Empirical risk minimization From statistical learning theory: Minimization of the average loss on a training set Empiricism Philosophy of science that emphasizes evidence and experiments Entropy measure Impurity measure corresponding to the entropy of the empirical class distribution of that data set Estimator A measurable function from the space of all finite datasets into the parameter space of a parametric distribution assumption Euclidean norm A special type of Minkowski norm Expectation maximization Iterative technique to maximize the likelihood function of an estimator Fall-out rate False-positive rate False alarm The event that a binary classifier incorrectly decides for “positive” although the sample is negative False-negative rate The probability of a binary classifier of deciding on “negative” although the sample actually belongs to the positive class False-positive rate The probability of a binary classifier of deciding on “positive” although the sample actually belongs to the negative class Feature A mathematical quantity that describes the characteristics of an object Feature space The set of all possible features Fisher information Variance of the score Gaussian mixture A random variable whose density is a convex combination of Gaussian densities Generalization Ability of a classifier to perform well on unseen data Gini impurity Impurity measure corresponding to the expected error probability of random class assignment on that data set Hidden Markov model Markov model where the states and state transitions are hidden and can only be inferred from observations HMM Hidden Markov model Homogeneous process A random process whose moments not depend on the point of evaluation Hyper parameters Parameters that govern a classifer but are not estimated from the training set Impurity measure Measure that assesses the class distribution in a data set Interval scale Scale of measurement for measuring intervals but lacking a natural zero Joint distribution The distribution of several random quantities in a joint probability space k-nearest neighbor method A parameter-free technique to define a density given a number of finite samples See also Parzen window method Kullback–Leibler divergence Measure (but not a metric) of the difference between probability distributions Leave-one-out cross-validation Cross-validation where only one sample is used for evaluation, and the rest are used to train the classifier Likelihood function A function of the parameters of a statistical model for a given data set Likelihood ratio The ratio two likelihood functions with different models Used in hypothesis testing Linear discriminant A basic classifier that draws hyperplanes between classes in the feature space Log-likelihood function The logarithm of the likelihood function Long short term memory Type of deep learning architecture suitable for sequential data Mahalanobis norm Norm of a vector with respect to some positive definite matrix Manhattan metric Metric deduced from the absolute norm; also: taxicab metric MAP classifier Maximum a posteriori classifier Marginal distribution The projection of a joint distribution onto one of the axes Markov model Probabilistic model of states and transitions between states with certain restrictions Maximum a posteriori classifier A classifier that decides on the class with the highest a posteriori probability with respect to a given feature Maximum norm A special type of Minkowski norm Maximum-likelihood estimator An estimator that chooses the parameter that makes the given observation most likely under the model Mean squared error Mean of the squared derivations of an estimator to the target variable Median The middle entry in a sorted list of items Metric A function that defines a distance Metric space A set with a distance measure Minimax classifier A special type of classifier that estimates the class such that the maximal risk with respect to any a priori distribution is minimized See also classifier Minkowski norm A parametrized norm for real vector spaces Misclassification measure Impurity measure corresponding to the empirical error probability of the dominant class in that data set ML estimator Maximum-likelihood estimator Mode In statistics: The global maximum of a probability mass or probability density, i.e., the most probable value Nearest neighbor classifier A classifier that assigns an object the same class as the nearest (in the feature space) sample of the training set Nominal scale Scale of measurement made up of labels Norm Function to measure the length of a vector Ordinal scale Scale of measurement with an ordering Overfitting Phenomenon where a classifier performs well on training set, but very poorly on unseen data Parameter space The (vector) space of all quantities that define a classifier Parameter vector A point in the parameter space Parzen window method A parameter-free technique to define a density given a number of finite samples See also knearest neighbor method Pattern The raw data from a sensor Pattern space The set of all possible patterns PCA Principal component analysis Permutation metric A metric for features on the ordinal scale Principal component analysis A method for finding a lower-dimensional subspace such that the projection of the dataset has a minimal squared reconstruction error Probability simplex A subset in the decision space Quantile Summary statistic to describe location within an ordered sample Random forest An ensemble of decision trees Random process Mathematical description of a (time-ordered) series of random events Ratio scale Scale of measurement for measuring ratios Recall The event that a binary classifier correctly decides for “positive” The probability of a recall is the true-positive rate Receiver operating characteristics Plot of the fall-out rate against the sensitivity of a binary classifier Rectified linear unit Activation function used in deep learning Risk The expected cost of the decisions of a classifier See also cost function ROC Receiver operating characteristic Rubber-sheeting Distortion of a surface to allow seamless joins Scale of measurement Defines certain types of variables and permissible operations on the variables of a given type Score In statistics: Measure of how much a parameter influences the density of a random variable Sensitivity True-positive rate Slack The event that a binary classifier incorrectly decides for “negative” although the sample is positive Slack variable In SVMs: Variables associated with the training samples to measure the violation of the maximum margin constraint Specificity True-negative rate State transition probability In Markov models: Probability to switch betwenn states Stationary process A random process that does not change the joint distribution of a derived time series when shifted in time Stochastic gradient descent Randomized version of the gradient descent optimization algorithm Structural risk minimization From statistical learning theory: Joint minimization of the average loss on a training set and the model complexity Supervised learning Learning when the classes of the training samples are known, e.g., classification Support vector machine A linear classifier that maximizes the margin between the decision boundary and the training samples SVM Support vector machine Target vector A unit vector in the decision space and a corner of the probability simplex Taxicab metric Metric deduced from the absolute norm; also: Manhattan metric Test set A special subset of the dataset that is used to test the performance of a classifier Training set A special subset of the dataset that is used to define the parameters of a classifier True-negative rate The probability of a binary classifier of deciding on “negative” if the sample actually belongs to the negative class True-positive rate The probability of a binary classifier of deciding on “positive” if the sample actually belongs to the positive class Unbiased estimator A special estimator whose expectation value equals the parameter being estimated, if considered as an random variable on its own Unbiasedness See unbiased estimator Unsupervised learning Learning when the classes of the training samples are not known or not needed, e.g., clustering, density estimation, etc Validation set A special subset of the dataset that is used to define the design parameters of a classifier Vapnik–Chervonenkis dimension Measure of complexity of a given family of classifiers Weak classifier A classifier that performs only marginally better than random guessing Weakly stationary process A random process whose expectation and covariance are constant at every point Window function A function that is nonzero only in some interval, often used to assign a weight according to some distance, e.g., in the Parzen window method Index activation function AR model autoencoder backpropagation bag of words – bag of visual words bagging Bayes’ law Bayesian classifier Bayesianism bias boosting bootstrap aggregating bootstrapping central limit theorem class 1, classifier 1, CNN confusion matrix 1, convolutional neural network correlation coefficient cost function Cramér–Rao bound cross-validation – leave-one-out cross-validation curse of dimensionality 1, data matrix dataset 1, decision boundary decision function decision region 1, decision space 1, decision tree decision vector degree of compactness degree of convexity degree of filling differential entropy – conditional differential entropy Dirac sequence discrepancy distance function distribution – a posteriori distribution – a priori distribution – class-specific feature distribution – conditional distribution 1, – joint distribution – marginal distribution divergence dropout eigenfaces 1, EM emission probability empirical operation empirical relation empirical risk minimization equivalence relation ERM estimator – consistent estimator – CR-efficient estimator – unbiased estimator expectation maximization feature 1, feature space 1, 2, feed-forward network ferret box fidis Fisher information Fisherfaces form factor frequentism Gaussian distribution – multivariate – univariate Gaussian mixture general representation theorem generalization group – group action 1, – Lie group – Lie transformation group – stabilizer hidden Markov model hyper parameters impurity measure – entropy measure – Gini impurity measure – misclassification measure independence k-nearest neighbor method kernel function 1, – Gaussian kernel – linear kernel – polynomial kernel – RBF kernel kernel trick 1, KL divergence Kolmogorov axioms Kullback–Leibler divergence learning – supervised learning 1, – unsupervised learning 1, learning rate likelihood function likelihood ratio linear discriminant – Fischer linear discriminant log-likelihood function LSTM manifold MAP classifier margin Markov model max pooling mean squared error median metric – induced metric – Manhattan metric – permutation metric – Tanimoto metric – taxicab metric metric space Minimax classifier ML estimator mode mutual information nearest neighbor classifier 1, 2, norm –p-norm – Chebyshev norm – Euclidean norm – Mahalanobis norm – maximum norm – Minkowski norm normal distribution – multivariate – univariate normed vector space overfitting 1, parameter space parameter vector 1, partition Parzen window method pattern pattern space PCA pre-training probability – conditional probability probability simplex pseudo-inverse random forest random process 1, – ergodic process – expectation-free weakly stationary process – homogeneous random process – stationary process – strictly stationary process – weakly stationary process Rayleigh coefficient receiver operating characteristic ReLU risk ROC curve rubber-sheeting scale of measurement 1, – absolute scale – interval scale – nominal scale – ordinal scale – ratio scale scatter matrix score slack variable SRM state transition probability statistic stochastic gradient descent structural risk minimization support vector machine – dual form – hard margin SVM – primal form – soft margin SVM target vector test set training set uncorrelated validation set VC confidence VC dimension weak classifier window function – Gaussian window function Endnotes See Appendix C for an explanation of the terms “weakly stationary”, “white noise”, etc Maximum a posteriori classification according to Equation (3.23) under the assumption that the features are statistically independent Not to be confused with the k-nearest neighbor classifier ... Introduction Fundamentals and definitions 1.1 Goals of pattern recognition 1.2 Structure of a pattern recognition system 1.3 Abstract view of pattern recognition 1.4 Design of a pattern recognition system... misleading Pattern recognition is a very important field of machine learning and artificial intelligence with its own rich structure and many interesting principles and challenges For humans, and also... the steps involved and defines the terms pattern and feature Fig 1.2 Processing pipeline of a pattern recognition system Figure 1.2 shows the processing pipeline of a pattern recognition system