Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 493 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
493
Dung lượng
7,82 MB
Nội dung
www.allitebooks.com ScalingUpMachineLearning Parallel and Distributed Approaches This book comprises a collection of representative approaches for scalingupmachinelearning and data mining methods on parallel and distributed computing platforms Demand for parallelizing learning algorithms is highly task-specific: in some settings it is driven by the enormous dataset sizes, in others by model complexity or by real-time performance requirements Making task-appropriate algorithm and platform choices for large-scale machinelearning requires understanding the benefits, trade-offs, and constraints of the available options Solutions presented in the book cover a range of parallelization platforms from FPGAs and GPUs to multi-core systems and commodity clusters; concurrent programming frameworks that include CUDA, MPI, MapReduce, and DryadLINQ; and various learning settings: supervised, unsupervised, semi-supervised, and online learning Extensive coverage of parallelization of boosted trees, support vector machines, spectral clustering, belief propagation, and other popular learning algorithms accompanied by deep dives into several applications make the book equally useful for researchers, students, and practitioners Dr Ron Bekkerman is a computer engineer and scientist whose experience spans across disciplines from video processing to business intelligence Currently a senior research scientist at LinkedIn, he previously worked for a number of major companies including HewlettPackard and Motorola Ron’s research interests lie primarily in the area of large-scale unsupervised learning He is the corresponding author of several publications in top-tier venues, such as ICML, KDD, SIGIR, WWW, IJCAI, CVPR, EMNLP, and JMLR Dr Mikhail Bilenko is a researcher in the MachineLearning Group at Microsoft Research His research interests center on machinelearning and data mining tasks that arise in the context of large behavioral and textual datasets Mikhail’s recent work has focused on learning algorithms that leverage user behavior to improve online advertising His papers have been published in KDD, ICML, SIGIR, and WWW among other venues, and I have received best paper awards from SIGIR and KDD Dr John Langford is a computer scientist working as a senior researcher at Yahoo! Research Previously, he was affiliated with the Toyota Technological Institute and IBM T J Watson Research Center John’s work has been published in conferences and journals including ICML, COLT, NIPS, UAI, KDD, JMLR, and MLJ He received the Pat Goldberg Memorial Best Paper Award, as well as best paper awards from ACM EC and WSDM He is also the author of the popular machinelearning weblog, hunch.net www.allitebooks.com www.allitebooks.com ScalingUpMachineLearning Parallel and Distributed Approaches Edited by Ron Bekkerman Mikhail Bilenko John Langford www.allitebooks.com cambridge university press Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, S˜ao Paulo, Delhi, Tokyo, Mexico City Cambridge University Press 32 Avenue of the Americas, New York, NY 10013-2473, USA www.cambridge.org Information on this title: www.cambridge.org/9780521192248 C Cambridge University Press 2012 This publication is in copyright Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press First published 2012 Printed in the United States of America A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication data Scalingupmachinelearning : parallel and distributed approaches / [edited by] Ron Bekkerman, Mikhail Bilenko, John Langford p cm Includes index ISBN 978-0-521-19224-8 (hardback) Machinelearning Data mining Parallel algorithms Parallel programs (Computer programs) I Bekkerman, Ron II Bilenko, Mikhail III Langford, John Q325.5.S28 2011 2011016323 006.3 1–dc23 ISBN 978-0-521-19224-8 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate www.allitebooks.com Contents Contributors Preface xi xv ScalingUpMachine Learning: Introduction Ron Bekkerman, Mikhail Bilenko, and John Langford 1.1 MachineLearning Basics 1.2 Reasons for ScalingUpMachineLearning 1.3 Key Concepts in Parallel and Distributed Computing 1.4 Platform Choices and Trade-Offs 1.5 Thinking about Performance 1.6 Organization of the Book 1.7 Bibliographic Notes References Part One 10 17 19 Frameworks for ScalingUpMachineLearning MapReduce and Its Application to Massively Parallel Learning of Decision Tree Ensembles Biswanath Panda, Joshua S Herbach, Sugato Basu, and Roberto J Bayardo 2.1 Preliminaries 2.2 Example of PLANET 2.3 Technical Details 2.4 Learning Ensembles 2.5 Engineering Issues 2.6 Experiments 2.7 Related Work 2.8 Conclusions Acknowledgments References v www.allitebooks.com 23 24 30 33 38 39 41 44 46 47 47 vi contents Large-Scale MachineLearning Using DryadLINQ Mihai Budiu, Dennis Fetterly, Michael Isard, Frank McSherry, and Yuan Yu 3.1 Manipulating Datasets with LINQ 3.2 k-Means in LINQ 3.3 Running LINQ on a Cluster with DryadLINQ 3.4 Lessons Learned References IBM Parallel MachineLearning Toolbox Edwin Pednault, Elad Yom-Tov, and Amol Ghoting 4.1 Data-Parallel Associative-Commutative Computation 4.2 API and Control Layer 4.3 API Extensions for Distributed-State Algorithms 4.4 Control Layer Implementation and Optimizations 4.5 Parallel Kernel k-Means 4.6 Parallel Decision Tree 4.7 Parallel Frequent Pattern Mining 4.8 Summary References Uniformly Fine-Grained Data-Parallel Computing for MachineLearning Algorithms Meichun Hsu, Ren Wu, and Bin Zhang 5.1 Overview of a GP-GPU 5.2 Uniformly Fine-Grained Data-Parallel Computing on a GPU 5.3 The k-Means Clustering Algorithm 5.4 The k-Means Regression Clustering Algorithm 5.5 Implementations and Performance Comparisons 5.6 Conclusions References Part Two 49 49 52 53 65 67 69 70 71 76 77 79 80 83 86 87 89 91 93 97 99 102 105 105 Supervised and Unsupervised Learning Algorithms PSVM: Parallel Support Vector Machines with Incomplete Cholesky Factorization Edward Y Chang, Hongjie Bai, Kaihua Zhu, Hao Wang, Jian Li, and Zhihuan Qiu 6.1 Interior Point Method with Incomplete Cholesky Factorization 6.2 PSVM Algorithm 6.3 Experiments 6.4 Conclusion Acknowledgments References Massive SVM Parallelization Using Hardware Accelerators Igor Durdanovic, Eric Cosatto, Hans Peter Graf, Srihari Cadambi, Venkata Jakkula, Srimat Chakradhar, and Abhinandan Majumdar 7.1 Problem Formulation 7.2 Implementation of the SMO Algorithm www.allitebooks.com 109 112 114 121 125 125 125 127 128 131 contents 7.3 Micro Parallelization: Related Work 7.4 Previous Parallelizations on Multicore Systems 7.5 Micro Parallelization: Revisited 7.6 Massively Parallel Hardware Accelerator 7.7 Results 7.8 Conclusion References Large-Scale Learning to Rank Using Boosted Decision Trees Krysta M Svore and Christopher J C Burges 8.1 Related Work 8.2 LambdaMART 8.3 Approaches to Distributing LambdaMART 8.4 Experiments 8.5 Conclusions and Future Work 8.6 Acknowledgments References The Transform Regression Algorithm Ramesh Natarajan and Edwin Pednault 9.1 Classification, Regression, and Loss Functions 9.2 Background 9.3 Motivation and Algorithm Description 9.4 TReg Expansion: Initialization and Termination 9.5 Model Accuracy Results 9.6 Parallel Performance Results 9.7 Summary References 10 Parallel Belief Propagation in Factor Graphs Joseph Gonzalez, Yucheng Low, and Carlos Guestrin 10.1 Belief Propagation in Factor Graphs 10.2 Shared Memory Parallel Belief Propagation 10.3 Multicore Performance Comparison 10.4 Parallel Belief Propagation on Clusters 10.5 Conclusion Acknowledgments References 11 Distributed Gibbs Sampling for Latent Variable Models Arthur Asuncion, Padhraic Smyth, Max Welling, David Newman, Ian Porteous, and Scott Triglia 11.1 Latent Variable Models 11.2 Distributed Inference Algorithms 11.3 Experimental Analysis of Distributed Topic Modeling 11.4 Practical Guidelines for Implementation 11.5 A Foray into Distributed Inference for Bayesian Networks 11.6 Conclusion Acknowledgments References www.allitebooks.com vii 132 133 136 137 145 146 146 148 149 151 153 158 168 169 169 170 171 172 173 177 184 186 188 189 190 191 195 209 210 214 214 214 217 217 220 224 229 231 236 237 237 viii contents 12 Large-Scale Spectral Clustering with MapReduce and MPI Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y Chang 12.1 Spectral Clustering 12.2 Spectral Clustering Using a Sparse Similarity Matrix 12.3 Parallel Spectral Clustering (PSC) Using a Sparse Similarity Matrix 12.4 Experiments 12.5 Conclusions References 13 Parallelizing Information-Theoretic Clustering Methods Ron Bekkerman and Martin Scholz 13.1 Information-Theoretic Clustering 13.2 Parallel Clustering 13.3 Sequential Co-clustering 13.4 The DataLoom Algorithm 13.5 Implementation and Experimentation 13.6 Conclusion References Part Three 240 241 243 245 251 258 259 262 264 266 269 270 274 277 278 Alternative Learning Settings 14 Parallel Online Learning Daniel Hsu, Nikos Karampatziakis, John Langford, and Alex J Smola 14.1 Limits Due to Bandwidth and Latency 14.2 Parallelization Strategies 14.3 Delayed Update Analysis 14.4 Parallel Learning Algorithms 14.5 Global Update Rules 14.6 Experiments 14.7 Conclusion References 15 Parallel Graph-Based Semi-Supervised Learning Jeff Bilmes and Amarnag Subramanya 15.1 Scaling SSL to Large Datasets 15.2 Graph-Based SSL 15.3 Dataset: A 120-Million-Node Graph 15.4 Large-Scale Parallel Processing 15.5 Discussion References 16 Distributed Transfer Learning via Cooperative Matrix Factorization Evan Xiang, Nathan Liu, and Qiang Yang 16.1 Distributed Coalitional Learning 16.2 Extension of DisCo to Classification Tasks www.allitebooks.com 283 285 286 288 290 298 302 303 305 307 309 310 317 319 327 328 331 333 343 contents 16.3 Conclusion References 17 Parallel Large-Scale Feature Selection Jeremy Kubica, Sameer Singh, and Daria Sorokina 17.1 Logistic Regression 17.2 Feature Selection 17.3 Parallelizing Feature Selection Algorithms 17.4 Experimental Results 17.5 Conclusions References ix 350 350 352 353 354 358 363 368 368 Part Four Applications 18 Large-Scale Learning for Vision with GPUs Adam Coates, Rajat Raina, and Andrew Y Ng 18.1 A Standard Pipeline 18.2 Introduction to GPUs 18.3 A Standard Approach Scaled Up 18.4 Feature Learning with Deep Belief Networks 18.5 Conclusion References 19 Large-Scale FPGA-Based Convolutional Networks Cl´ement Farabet, Yann LeCun, Koray Kavukcuoglu, Berin Martini, Polina Akselrod, Selcuk Talay, and Eugenio Culurciello 19.1 Learning Internal Representations 19.2 A Dedicated Digital Hardware Architecture 19.3 Summary References 20 Mining Tree-Structured Data on Multicore Systems Shirish Tatikonda and Srinivasan Parthasarathy 20.1 The Multicore Challenge 20.2 Background 20.3 Memory Optimizations 20.4 Adaptive Parallelization 20.5 Empirical Evaluation 20.6 Discussion Acknowledgments References 21 Scalable Parallelization of Automatic Speech Recognition Jike Chong, Ekaterina Gonina, Kisun You, and Kurt Keutzer 21.1 Concurrency Identification 21.2 Software Architecture and Implementation Challenges 21.3 Multicore and Manycore Parallel Platforms 21.4 Multicore Infrastructure and Mapping www.allitebooks.com 373 374 377 380 388 395 395 399 400 405 416 417 420 422 423 427 431 437 442 443 443 446 450 452 454 455 462 21 scalable parallelization of automatic speech recognition In this implementation, we offload the entire inference process to the GTX280 platform and take advantage the efficient hardware-assisted atomic operations to facilitate the challenge of having frequent synchronizations The large data working set is stored on the 1GB dedicated memory on the GTX280 platform and accessed through a memory bus with 140GB/s peak throughput We start an iteration by preparing an ActiveSet data structure to gather the necessary operands into a “coalesced” data structure to maximize communication efficiency and improve computation-to-communication ratio We then use an arc-based traversal to handle the irregular data structure and maximize SIMD efficiency in the evaluation of state-to-state transitions Finally, we leverage the CUDA runtime to efficiently meet the challenge in scheduling the unpredictable workload size with variable runtimes onto the 30 parallel multiprocessors on GTX280 Following Figure 21.3, after mapping the application to the parallel platform, we need to profile performance and sensitivity analysis to different trade-offs particular to the specific implementation platform We describe this process for our ASR inference engine application in Section 21.6 21.6 Implementation Profiling and Sensitivity Analysis We have addressed the known performance challenges by examining data, task, and runtime concerns and constructed a functionally correct implementation Now we can analyze the performance achieved by these implementations 21.6.1 Speech Models and Test Sets Our ASR profiling uses speech models from the SRI CALO real-time meeting recognition system (Tur et al., 2008) The front end uses 13 dimensional perceptual linear prediction (PLP) features with first-, second-, and third-order differences, is vocaltrack-length normalized, and is projected to 39 dimensions using heteroscedastic linear discriminant analysis (HLDA) The acoustic model is trained on conversational telephone and meeting speech corpora using the discriminative minimum-phone-error (MPE) criterion The language model is trained on meeting transcripts, conversational telephone speech, and web and broadcast data (Stolcke et al., 2008) The acoustic model includes 52K triphone states that are clustered into 2,613 mixtures of 128 Gaussian components The pronunciation model contains 59K words with a total of 80K pronunciations We use a small back-off bigram language model with 167K bigram transitions The speech model is an H ◦ C ◦ L ◦ G model compiled using WFST techniques (Mohri et al., 2002) and contains 4.1 million states and 9.8 million arcs The test set consisted of excerpts from NIST conference meetings taken from the “individual head-mounted microphone” condition of the 2007 NIST Rich Transcription evaluation The segmented audio files total 44 minutes in length and comprise 10 speakers For the experiment, we assumed that the feature extraction is performed offline so that the inference engine can directly access the feature files 21.6 implementation profiling and sensitivity analysis 463 Figure 21.7 Ratio of computation-intensive phase of the algorithm versus communicationintensive phase of the algorithm 21.6.2 Overall Performance We analyze the performance of our inference engine implementations on both the Core i7 multicore processor and the GTX280 manycore processor The sequential baseline is implemented on a single core in a Core i7 quad-core processor, utilizing a SIMD-optimized Phase routine and non–SIMD graph traversal routine for Phase Compared to this highly optimized sequential baseline implementation, we achieve 3.4× speedup using all cores of Core i7 and 10.5× speedup on GTX280 The performance gain is best illustrated in Figure 21.7 by highlighting the distinction between the compute-intensive phase (black bar) and the communication-intensive phase (white bar) The compute-intensive phase achieves 3.6× speedup on the multicore processor and 17.7× on the manycore processor, while the communicationintensive phase achieves only 2.8× speedup on the multicore processor and 3.7× on the manycore processor The speedup numbers indicate that the communication-intensive Phase dominates the runtime as more processors need to be coordinated In terms of the ratio between the compute- and communication-intensive phases, the pie charts in Figure 21.7 show that 82.7% of the time in the sequential implementation is spent in the computeintensive phase of the application As we scale to the manycore implementation, the compute-intensive phase becomes proportionally less dominant, taking only 49.0% of the total runtime The dominance of the communication-intensive phase motivates further detailed examination of Phase in our inference engine 21.6.3 Sensitivity Analysis In order to determine the sensitivity to different styles of the algorithm in the communication-intensive phase, we constructed a series of experiments for both the multicore and the manycore platform The trade-offs in both task granularity and core synchronization techniques are examined for both platforms The design space for our experiments as well as the performance results are shown in Figure 21.8 The choice in task granularity has direct implications for load balance and task creation overhead, whereas the choice of the traversal technique determines the cost of the core-level synchronization The columns in Figure 21.8 represent different graph traversal techniques and the rows indicate different transition evaluation granularity The figure provides 464 21 scalable parallelization of automatic speech recognition Figure 21.8 Recognition performance normalized for second of speech for different algorithm styles on Intel Core i7 and NVIDIA GTX280 performance improvement information for Phases and as well as sequential overhead for all parallel implementation styles The speedup numbers are reported over our fastest sequential version in the state-based propagation style On both of the platforms, the propagation-based style achieved better performance However, the choice of best-performing task granularity differed for the two platforms For the manycore implementation, the load-balancing benefits of arc-based approaches were much greater than the overhead of creating the finer-grained tasks On the multicore architecture, the arc-based approach not only presented more overhead in creating finer-grained tasks but also resulted in a larger working set, thus increasing cache capacity misses On wider SIMD units in future multicore platforms, however, we expect the arc-based propagation style will be faster than the state-based propagation style The figure also illustrates that the sequential overhead in our implementation is less than 2.5% of the total runtime even for the fastest implementations This demonstrates that we have a scalable software architecture that promises greater potential speedups with more platform parallelism expected in future generations of processors After performing the profiling and sensitivity analysis, we end up with a highly optimized implementation of the application on the parallel platform (see Figure 21.3) We can further optimize the implementation by making application-level decisions and trade-offs subject to the constraints and bottlenecks identified in the parallellization process Section 21.7 describes an example of such optimizations 21.7 Application-Level Optimization An efficient implementation is not the end of the parallelization process For the inference engine on GTX280, for example, we observed that given the challenging 21.7 application-level optimization 465 algorithm requirements, the dominant kernel has shifted from the compute-intensive Phase to the communication-intensive Phase in the implementation We have also observed that modifying the inference engine implementation style does not improve the implementation any further In this situation, we should take an opportunity to reexamine possible application-level transformations to further mitigate parallelization bottlenecks 21.7.1 Speech Model Alternatives Phase of the algorithm involves a graph traversal process through an irregular speech model There are two types of arcs in a WFST-based speech model: arcs with an input label (nonepsilon arcs) and arcs without input labels (epsilon arcs) In order to compute the set of next states in a given time step, we must traverse both the nonepsilon and all the levels of epsilon arcs from the current set of active states This multi-level traversal can impair performance significantly as each level requires multiple steps of crosscore synchronization We explore a set of application transformations to modify the speech model to reduce the levels of traversal that are required, while maintaining the WFST invariant of accumulating the same weight (likelihood) for the same input after a traversal To illustrate this, Figure 21.9 shows a small section of a WFST-based speech model Each time step starts with a set of currently active states, for example, states and in Figure 21.9, representing the alternative interpretations of the input utterances It proceeds to evaluate all outgoing nonepsilon arcs to reach a set of destination states, such as states and The traversal then extends through epsilon arcs to reach more states, such as state 5, before the next time step The traversal from state and to 3, 4, and can be seen as a process of active state wavefront expansion in a time step The challenge for data parallel operations is that the expansion from to to to requires multiple levels of traversal In this case, three-level expansion is required, with one nonepsilon level and two epsilon levels By flattening the epsilon arcs as shown in Figure 21.9, we arrive at the Twolevel WFST model, where by doing one nonepsilon-level expansion and one epsilon expansion we can reach all anticipated states If we flatten the graph further, we can Figure 21.9 Model modification techniques for a data-parallel inference engine 466 21 scalable parallelization of automatic speech recognition Figure 21.10 Communication-intensive phase runtime in the inference engine (normalized to second of speech) eliminate all epsilon arcs and achieve the same results with one level of nonepsilon arc expansion Although model flattening can help eliminate the overhead of multiple levels of synchronization, it can also increase the total number of arcs traversed Depending on the specific model topology of the speech model, we may achieve varying amount of improvements in the final application performance metrics 21.7.2 Evaluation of Alternatives We constructed all three variations of the speech model and measured both the number of arcs evaluated as well as the execution time of the communication-intensive Phase We varied the number of alternative interpretations, which is shown in Figure 21.10 as a percentage of total states that are active in the speech model The “L” shaped curves connect implementations that achieve the same recognition accuracy At the application level, we are interested in reducing the execution times for the communication-intensive phase Going from the Original setup to the Twolevel setup, we observe a large improvement in execution time, shown as a drop in the execution-time graph of the communication-intensive phase This execution-time improvement was accompanied by a moderate increase in the number of arcs traversed during decoding, shown as a small shift to the right Going from the Two-level setup to the One-level setup, we see a relatively smaller improvement in execution time, with a large increase in the number of arcs traversed An application domain expert who understands the implications of input formats on performance of the parallel application can make application-level transformations to further improve the application performance For example, in ASR, for the recognition task that maintains the smallest number of active arcs in this set of experiments, the speech model transformations are able to reduce the execution time of the communication-intensive phase from 97 ms to 75 ms, and further to 53 ms, thus almost doubling the performance for this phase 21.8 conclusion and key lessons 467 21.8 Conclusion and Key Lessons 21.8.1 Process of Parallelization This chapter describes a process for the scalable parallelization of an inference engine in automatic speech recognition Looking back at Figure 21.3, we start the parallelization process at the application level and consider the available concurrency sources in an application The challenge is to identify the richest source of concurrency that improves performance given a particular application constraint such as latency or throughput (see Section 21.1) With the identified concurrency source, we construct the software architecture for the application using design patterns Design patterns help create software architectures by composing structural and computational patterns (Keutzer and Mattson, 2009), as shown in Section 21.2 The design patterns help identify the application challenges and bottlenecks in a software architecture to be addressed by the implementation The detailed implementation of the software architecture is performed with the consideration of three areas of concern (data, task, and runtime) for each particular platform The most effective parallel implementation strategy must recognize the architecture characteristics of the implementation platform and leverage the available hardware and software infrastructures Some of the areas of concern are well taken care of by the infrastructure or the runtime system In other cases, various styles of implementation strategy must be explicitly constructed as a series of experiments to determine the best implementation choice for a particular trade-off, leading to a performance sensitivity analysis The performance of an application can be improved by modifying the algorithm based on application domain knowledge As illustrated in Section 21.6, the speech domain expert can make application-level decisions about the speech model structure while still preserving logical correctness By identifying bottlenecks in the current implementation of the application, the domain expert can choose to modify the parameters of the application in order to make the application less sensitive to parallelization bottlenecks 21.8.2 Enabling Efficient Parallel Application Development Using Frameworks In order to develop a highly optimized implementation, one needs to have strong expertise in all areas of the development stack Strong application domain expertise is required to identify available application concurrency as well as to propose applicationlevel transformations that can mitigate software architecture challenges Strong parallel programming expertise is required in developing a parallel implementation, in which one needs to articulate the data, task, and runtime considerations for a software architecture on an implementation platform This complexity increases the risks in deploying large parallel software projects as the levels of expertise vary across the domains Our ongoing work on software design patterns and frameworks at the PALLAS group in the Department of Electrical Engineering and Computer Science at University of California, Berkeley attempts to address this problem by encapsulating the low-level 468 21 scalable parallelization of automatic speech recognition parallel programming constructs into frameworks for domain experts The PALLAS group believes that the key to the design of parallel programs is software architecture, and the key to efficient implementation of the software architecture is frameworks Borrowed from civil architecture, the term design pattern refers to a solution to a recurring design problem that domain experts learn with time A software architecture is a hierarchical composition of architectural software design patterns, which can be subsequently refined using implementation design patterns The software architecture and its refinement, although useful, are entirely conceptual To implement the software, we rely on frameworks We define a pattern-oriented software framework as an environment built on top of a software architecture in which customization is allowed only in harmony with the framework’s architecture For example, if the software architecture is based on the Pipe & Filter pattern, the customization involves only modifying pipes or filters We see application domain experts being serviced by application frameworks These application frameworks have two advantages: First, the application programmer works within a familiar environment using concepts drawn from the application domain Second, the frameworks prevent expression of many notoriously hard problems of parallel programming such as nondeterminism, races, deadlock, and starvation Specifically for ASR, we have tested and demonstrated this pattern-oriented approach during the process of designing this implementation Patterns served as a conceptual tool to aid in the architectural design and implementation of the application Referring back to Figure 21.3, we can use patterns from the software architecture to define a pattern-oriented framework for a speech recognition inference engine application The framework will be able to encapsulate many data, task, and runtime considerations as well as profiling capabilities and will be able to be extended to related applications Although this framework is our ongoing research, we believe that these software design patterns and the pattern-oriented frameworks will empower ASR domain experts, as well as other machinelearning experts, to quickly construct efficient parallel implementations of their applications References Blumofe, R D., Joerg, C F., Kuszmaul, B C., Leiserson, C E., Randall, K H., and Zhou, Y 1995 Cilk: An Efficient Multithreaded Runtime System Journal of Parallel and Distributed Computing, 207–216 Butenhof, D R 1997 Programming with POSIX Threads Reading, MA: Addison-Wesley Cardinal, P., Dumouchel, P., Boulianne, G., and Comeau, M 2008 GPU Accelerated Acoustic Likelihood Computations Pages 964–967 of: Proceeding of the 9th Annual Conference of the International Speech Communication Association (InterSpeech) Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., and McDonald, J 2000 Parallel Programming in OpenMP San Francisco, CA: Morgan Kaufmann Chong, J., Gonina, E., Yi, Y., and Keutzer, K 2009 (September) A Fully Data-parallel WFST-based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit Pages 1183– 1186 of: Proceeding of the 10th Annual Conference of the International Speech Communication Association (InterSpeech) references 469 Chong, J., Gonina, E., You, K., and Keutzer, K 2010a (September) Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms In: Proceeding of the 11th Annual Conference of the International Speech Communication Association (InterSpeech) Chong, J., Friedland, G., Janin, A., Morgan, N., and Oei, C 2010b (June) Opportunities and Challenges of Parallelizing Speech Recognition In: 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar’10) Dixon, P R., Oonishi, T., and Furui, S 2009 Harnessing Graphics Processors for the Fast Computation of Acoustic Likelihoods in Speech Recognition Computer Speech and Language, 23(4), 510–526 Intel 2009 Intel 64 and IA-32 Architectures Software Developer’s Manuals Ishikawa, S., Yamabana, K., Isotani, R., and Okumura, A 2006 (May) Parallel LVCSR Algorithm for Cellphone-oriented Multicore Processors Pages 117–180 of: 2006 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2006 Proceedings Keutzer, K., and Mattson, T 2009 A Design Pattern Language for Engineering (Parallel) Software Intel Technology Journal, Addressing the Challenges of Tera-scale Computing, 13(4), 6–19 Kumar, S., Hughes, C J., and Nguyen, A 2007 Carbon: Architectural Support for Fine-grained Parallelism on Chip Multiprocessors Pages 162–173 of: In ISCA 07: Proceedings of the 34th Annual International Symposium on Computer Architecture ACM Mohri, M., Pereira, F., and Riley, M 2002 Weighted Finite State Transducers in Speech Recognition Computer Speech and Language, 16, 69–88 Ney, H., and Ortmanns, S 1999 Dynamic Programming Search for Continuous Speech Recognition IEEE Signal Processing Magazine, 16, 64–83 NVIDIA 2009 (May) NVIDIA CUDA Programming Guide NVIDIA Corporation Version 2.2.1 Ravishankar, M 1993 Parallel Implementation of Fast Beam Search for Speaker-Independent Continuous Speech Recognition Technical Report, Computer Science and Automation, Indian Institute of Science, Bangalore, India Stolcke, A., Anguera, X., Boakye, K., Cetin, O., Janin, A., Magimai-Doss, M., Wooters, C., and Zheng, J 2008 The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System Lecture Notes in Computer Science, 4625(2), 450–463 Tur, G., Stolcke, A., Voss, L., Dowding, J., Favre, B., Fernandez, R., Frampton, M., Frandsen, M., Frederickson, C., Graciarena, M., Hakkani-Tr, D., Kintzing, D., Leveque, K., Mason, S., Niekrasz, J., Peters, S., Purver, M., Riedhammer, K., Shriberg, E., Tien, J., Vergyri, D., and Yang, F 2008 The CALO Meeting Speech Recognition and Understanding System Pages 69–72 of: Proceedings of IEEE Spoken Language Technology Workshop You, K., Chong, J., Yi, Y., Gonina, E., Hughes, C., Chen, Y K., Sung, W., and Keutzer, K 2009a (November) Parallel Scalability in Speech Recognition: Inference Engine in Large Vocabulary Continuous Speech Recognition IEEE Signal Processing Magazine, 124–135 You, K., Lee, Y., and Sung, W 2009b (April) OpenMP-based Parallel Implementation of a Continuous Speech Recognizer on a Multicore System Pages 621–624 of: IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 ICASSP 2009 Subject Index -neighborhood, 243 λ-gradient, 151 τ -approximation, 197 k-means clustering, 12, 70, 104, 145, 150, 245, 262 Fuzzy, 70 Harmonic, 94 Kernel, 70, 79 k-nearest neighbors, 70, 313 NET, 49 Belief propagation (BP), 14, 190 Loopy, 190 Round-Robin, 199 Splash, 202 Synchronous, 195 Wildfire, 200 Belief residuals, 200 Bioinformatics, 421 Blue Gene, 77 Boosted decision tree, 12, 14, 25, 149 Boosted regression tree, 152 Data-distributed, 155 Feature-distributed, 154 AdaBoost, 150 Alternating least squares (ALS), 334 Amdahl’s law, 453 Application concurrency, 450 Application programming interface (API), 69, 171 Application-specific integrated circuit (ASIC), 16, 411 Apriori algorithm, 83, 424 Area under a curve (AUC), 363 Arithmetic logic unit (ALU), 91 Arnoldi factorization, 244 ARPACK, 244 Asynchronous distributed learning, 222 Atomic operations, 199, 200 Average filling (AF), 340 Cache miss, 285 Candidate generation, 423 Chip multiprocessing (CMP), 421 Cholesky decomposition, 63, 180 Classification, 3, 128 Maximum margin, 128 Classification and regression tree (CART) algorithm, 385 Cloud computing, Clustering, Co-clustering, 263 Information-theoretic, 15, 265 Sequential, 15, 264 Coalesced memory access, 96, 379, 460 Coalitional classification, 343 Collaborative filtering, 332 Cross system, 333 Commutative monoid, 70 Computational complexity, 10 Backpropagation, 299 Backward elimination, 362 Balanced minimum cut, 211 Bandwidth limit, 285 Basic linear algebra subprograms (BLAS), 7, 133 Bayesian network, 14, 231 471 472 subject index Compute unified device architecture (CUDA), 12, 90, 377, 400, 459 Computer vision, 16 Conditional probability table (CPT), 232 Conditional random field (CRF), 362 Conjugate gradient, 301 Cooperative matrix factorization (CoMF), 333 Supervised, 345 Cross-correlation, 379, 382 Data parallelism, 6, 377, 434 Coarse-grained, Fine-grained, 7, 13, 89 Data sparsity problem, 331 Data weaving, 264 Data-flow computing, 407 Datacenter, DataLoom, 264 Dataset 20 Newsgroups, 251, 275, 348 AdCorpus, 41 Adult census, 139, 185 California housing, 185 Caltech-101, 403 CoverType, 122 Cslogs, 437 Forest, 139 LabelMe, 375 Medline, 224 MNIST, 138, 405 Netflix, 276, 339 NORB, 138, 389, 416 PASCAL competition, 82 PicasaWeb, 251 RCV1, 122, 251, 276, 285, 365 Spambase, 185 Switchboard I (SWB), 317 Treebank, 427 UCI repository, 224, 364 UMASS, 416 Wikipedia, 224 Decision tree, 12, 59, 70, 384 Bagging, 12, 28 Boosting, 28 Deep belief network (DBN), 16, 373 Delayed updates, 288 Digital signal processor (DSP), 137 Direct memory access (DMA), 413 Distributed coalitional learning (DisCo), 333 Double data computation, 144 Dryad runtime, 49 DryadLINQ, 12, 49 Dynamic scheduling, 194 Eclat algorithm, 84 Eigendecomposition, 243, 311 Parallel, 248 Embarrassingly parallel algorithm, Embedded subtree, 423 Embedding list (EL), 424 Ensemble learning, 28 Entropy, 315 Expectation-maximization (EM), 94, 308 Factor graph, 191 Factorized distribution, 191 Feature class, 354, 356 Feature hashing, 286 Feature pooling layer, 401 Feature selection, 16, 354 Filter, 354 Forward, 16, 355 Wrapper, 354 Feature sharding, 287 Field-programmable gate array (FPGA), 1, 128, 284, 407 Filter bank layer, 400 Forward-backward sampler, 234 Forward-backward schedule, 193, 197 Frequent pattern mining, 16, 70, 83 Subtree, 17, 421 Gaussian mixture model (GMM), 455 Generalized boosted models (GBM), 38 Gentle boost algorithm, 384 Gibbs sampling, 14, 218, 391 Collapsed, 218 Gini coefficient, 385 Global memory, 91 Global training, 298 Gomory-Hu tree, 324 Google file system (GFS), 246 Gradient boosting (GB), 152, 172 Gradient descent, 131, 357 Delayed, 288 Minibatch, 300 Online, 283 Stochastic, 283 Grafting, 357 Parallel, 361 Graph partitioning, 211 Graphical model, 14, 190 Graphics processing unit (GPU), 1, 89, 127, 229, 284, 373, 377, 459 General purpose, 90 Hadoop, 18, 23, 69, 195 Hardware description language (HDL), 411 subject index Hessian matrix, 111 Heteroscedastic linear discriminant analysis (HLDA), 462 Hidden Markov model (HMM), 14, 233, 310, 447 Hierarchical Dirichlet process (HDP), 14, 220 High performance computing (HPC), 1, 77, 186 Histogram of oriented gradient (HoG), 388, 405 Image denoising problem, 192 Incomplete Cholesky factorization (ICF), 13, 112 Independent matrix factorization (IMF), 340 Induced subtree, 423 Induction, Inference, Approximate, 192 Distributed, 220 Information bottleneck (IB), 265 Sequential, 269 Information retrieval, 14, 151 Information-theoretic clustering (ITC), 263 Instance sharding, 286 Interior point method (IPM), 13, 111 Iterative conditional modes (ICM), 266 Jacket software, 105 Job descriptor, 432 Job pool (JP), 432 Karush-Kuhn-Tucker conditions (KKT), 127, 315 kd-tree, 319 Kernel acceleration, 141 Kernel function, 79, 111 Kernel matrix, 111 Kinect, 66 Kolmogorov’s superposition theorem, 176 Kullback-Leibler divergence, 312 LambdaMART, 148, 151 Data-distributed, 155 Feature-distributed, 154 LambdaRank, 151 Language integrated queries (LINQ), 49 Laplacian matrix, 240 Normalized, 241 LASSO, 362 Latent Dirichlet allocation (LDA), 14, 19, 218, 276 Approximate distributed, 220 473 Latent semantic analysis (LSA), 218 Probabilistic, 332 Latent variable model, 217 Learning rate, 283 Linear regression, 14, 70 Linear regression tree (LRT), 178 Load balancing, 17, 190, 422, 431 Over-partitioning, 212 Partitioning, 211 Local linear embedding (LLE), 312 Local training, 294 Logistic regression, 16, 353 Longest common subsequence (LCS), 428 Loss function, 171, 357 Hinge, 172, 311 Smoothed hinge, 345 Squared, 25, 171, 311 Low-level features, 375 Machine translation, Manycore processor, 446 MapReduce, 7, 12, 23, 56, 69, 94, 150, 195, 240, 332, 358, 453 Map phase, 25 Reduce phase, 25 Shuffle phase, 25 Markov blanket, 232 Markov chain Monte Carlo (MCMC), 13, 218 Markov logic network (MLN), 192 Markov random field (MRF), 192 MART, 152, 184 Master-slave communication model, 6, 271 Matlab, 49, 105, 236 Max pooling, 382 Maximum likelihood estimation, 353 Measure propagation, 15, 312 Memory bandwidth, 422 Memory-conscious Trips (MCT), 427 Message locking, 199 Message passing interface (MPI), 7, 77, 158, 186, 213, 229, 240, 275, 335 Micro parallelization, 136 Micro-averaged accuracy, 275 MiLDe, 130, 136 MineBench, 102 Minimum phone error (MPE), 462 Model-based clustering, 99 MPICH2, 77, 213, 246 Multi-modal clustering, 263 Multicore parallelization, 290, 454 Multicore processor, 290, 408, 422, 446 474 subject index Multimedia extension instructions (MMX), 134 Multinode parallelization, 291 Multiple-data multiple-thread (MDMT), 90 Mutual information (MI), 265 Naăve Bayes, 294 Natural language processing (NLP), 17, 192 Neocognitron, 403 Neural network, 176 Convolutional, 16, 145, 400 Nonlinearity layer, 401 Nonparametric model, 220 Normalized discounted cumulative gain (NDCG), 151, 158 Normalized mutual information (NMI), 252 NVIDIA, 90, 102 Nystrăom method, 243 Object recognition, 404 Occurrence-based support, 423 Online learning, 15, 283 OpenCL, 92 OpenCV, 387 OpenGL, 92 OpenMP, 229, 451 Optical character recognition (OCR), 402 Overfitting, 2, 44 Parallel machinelearning toolbox (PML), 12, 69, 171 Parallel spectral clustering (PSC), 241 Parallel support vector machine (PSVM), 109 PARPACK, 248 Pattern growth mechanism, 424 Peer-to-peer (P2P) networks, 18 Perceptron, 283 Multi-layered, 316 Perceptual linear prediction (PLP), 462 Perplexity, 225 PLANET, 29 Predictive sparse decomposition (PSD), 403 Principal component analysis (PCA), 70, 218 Kernel, 70 ProbE data mining engine, 170 Processing tile (PT), 411 Protein side-chain prediction, 205 Pyramid match kernel (PMK), 405 Quadratic programming (QP), 111, 127 Radial basis function (RBF), 79, 139, 313 Real-time performance, 16 Recommender system, 16, 331 Recursive doubling algorithm, 249 Regression, Regression clustering, 99 Regression k-means, 12 Regression tree, 27, 152, 170 Remote procedure call (RPC), 40 Representation power, 293 Reproducing kernel Hilbert space (RKHS), 110 Residual scheduling, 201 Restricted Boltzmann machine (RBM), 390, 403 Root mean square error (RMSE), 341 Rooted tree, 423 Round-robin tournament, 273 Scale invariant feature transform (SIFT), 375, 399 Self-training, 308 Semi-supervised learning, 3, 15, 307 Graph-based, 15, 308 Sequential clustering, 263 Sequential minimal optimization (SMO), 13, 111, 130 Share no write variables (SNWV), 93 Shared cache memory, 380 Shared memory, 91 Shared memory platform (SMP), 77 Shared nothing architecture, 84 Sherman-Morrison-Woodbury formula (SMW), 114 Single feature optimization (SFO), 16, 355 Single-instruction multiple-data (SIMD), 17, 134, 460 Singular value decomposition (SVD), 62 Sliding window object detection, 383 SLIQ, 44, 81 Sparsification, 240 Spectral clustering, 14, 240 Spectral graph transduction (SGT), 311 Speech recognition, 15, 17, 446 Splash algorithm, 14, 202 Multicore, 206 Sequential, 205 Spread kernel, 134 Stream processor (SP), 378 Streaming SIMD extensions (SSE), 134 Strong scalability, 186 Subtree matching, 428 Sufficient statistics, 97 Supervised learning, Support counting, 423 subject index Support vector machine (SVM), 13, 18, 69, 109, 127 Dual, 129 Laplacian, 311 Primal, 129 Transductive, 308 Switchboard transcription project (STP), 317 Symmetric multiprocessing (SMP), 45, 309 Task parallelism, Adaptive, 433 Task scheduling service, 436 Text classification, 16 Thread blocks, 378 Thread pool (TP), 432 Topic model, 14, 217 Transaction-based support, 423 Transductive learning, 307 Transfer learning, 15, 331, 343 475 Transform regression, 14, 170 Tree architecture, 292 TreeMiner, 424 Trips algorithm, 425 Unmanned aerial vehicles (UAV), 399 Unsupervised learning, Variational inference, 230 Vector processing, 377 Viterbi algorithm, 448 Vowpal Wabbit software, 285 Weak scalability, 186 Web search ranking, 24, 148 Weighted finite state transducer (WFST), 448 Work sharing, 431 Working set selection, 130 Working set size, 422 ... Contributors Preface xi xv Scaling Up Machine Learning: Introduction Ron Bekkerman, Mikhail Bilenko, and John Langford 1.1 Machine Learning Basics 1.2 Reasons for Scaling Up Machine Learning 1.3 Key Concepts.. .Scaling Up Machine Learning Parallel and Distributed Approaches This book comprises a collection of representative approaches for scaling up machine learning and data mining... distributed machine learning We believe that parallelization provides a key pathway for scaling up machine learning to large datasets and complex methods Although large-scale machine learning has