Pipelined processor farms structured design for embedded parallel systems

P@elined Processor Farms Structured Design for Embedded Parallel Systems Martin Fleury Andrew Downton A Wiley-lnterscience Publication JOHN WZLEY & SONS, INC New York / Chichester / Weinheim / Brisbane / Singapore / Toronto This text is printed on acid-free paper @ Copy[.ight 02001 by Joho Wiley & Sons, Inc A l l ~iglirsrese~wrd Published s i m o l ~ ~ n r o u rilny Cilnadn No pun afthis publication may bc reproduced stored i n in retricvnl systcm or tlansmittcd in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or I08 of the 1976 United States Copyright Act, without either thc prior wrltren permission o f thc Publiahcr, or authorirat~onthrough payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers M A 01923, (978) 750-8400 fax (978) 750-4744 Requests to the Publisher foi perinission should bc addressed to the Permissions Denalrment John Wilev & Sons Inc 605 Third Avenue NCM,Y O T ~ N Y 10158-0012, (212) 850-601 , fan (212)850-6008, E-Mail: PERMREQO WILEY.COM For ordering and customer service, call I-800-CALL-WILEY Library o f Congress Cataloging in Publicatioli Data is available ISRN 0-471-22438-3 This title is also il\ililahle in print its ISBN 0-471-38860-2 Printed i n the United States o f America I Foreword Parallel systems are typically difficult t o construct, t o analyse, and t o optimize One way forward is t o focus on stylized forms This is the approach taken here, for Pipclincd Processor Farms (PPF) The target domain is that of embedded systems with continuous flow of data, often with real-time constraints This volume brings together the results of ten years study and development of the PPF approach and is the first comprehensive treatment beyond the original research paperu The overall methodology is illustrated throughout by a range of examples drawn from real applications These show both the scope for practical application and the range of choices for parallelism bot,h in the pipelining and in the processor farms a t each pipeline stage Freedom to choose the iiurnbers of processors for each stage is then a key factor for balancing the syst,em and for optimizing performance characteristics such as systeni throughput and latency Designs may also be optimized in other ways, e.g for cost, or tuned for alternative choices of processor, including future ones, providing a high degree of future-proofing for PPF designs An important aspect is the ability t o "what if' analysis, assisted in part by a prototype toolkit, and founded on validation of predicted performance against real As the exposition proceeds, the reader will get a n emerging understanding of designs being crafted quantitatively for desired performance characteristics This in turn feeds in t o larger scale issues and trade-offs between requirements, functionality, benefits, performance, and cost The essence for me is captured by the phrase "engineering in the performance dimension" CHRISWADSWORTH TECHNICAL CO-ORDINATOR EPSRC PROGRAMME ON PORTABI.E SOFTWARE TOOLS FOR PARALLEL ARCHITECTURES Preface In the 1980s, the advent of the transputer led t o widespread investigation of the potential of parallel computing in embedded applications Application areas included signal processing, control, robotics, real-time systems, image processing, pattern analysis and computer vision It quickly became apparent that although the transputer provided an effective parallel hardware component, and its msociated language Occam provided useful low-level software tools, there was also a need for higher-level tools together with a systematic design methodology that addressed the additional design parameters introduced by parallelism Our work a t that time was concerned with implementing real-time document processing systems which included significant computer vision problems requiring multiple processors to meet throughput and latency constraints Reviews of similar work highlighted the fact that processor farms were often favored as an effective practical parallel implementation architecture, and that many applications embodied an inherent pipeline processing structure After analyaing a number of our own systems and those reported by others we concluded that a conibiriation of the pipeline structure with a generalized processor farm implelnentation at each pipeline stage offered a flexible generalpurpose architecture for soft real-time systems We embarked upon a major project, PSTESPA (Portable Software Tools for Embedded Signal Processing Applications) to iuvestigate the scope of the Pipeline Processor Farm (PPF) design model, both in terms of its application potential and the supporting software tools it required Because the project focused mostly upon high-level design issues, its outcome largely remains valid despite seismic changes within the parallel computing industry By the cud of our PSTESPA project, nolwilhstanding ils successful oulcome, the goalposts of parallel systems had moved, and it was becoming apparent that Inany of the a~nhitiousand idealistic goals of general-purpose parallel computing had been tempered by the pragmatic reality of market forces Companies such as Inmos, Meiko, Parsys and Parsytec (producing transputer-based machines), a1d ICL, AMT, MasPar and Thinking Machines (producing SIMD machines), found that the market for parallel applications was too fragmented t o support high-volume sales of large-scale parallel machines based upon specialized processing elements, and t h a t ;~pplicationdevelopment was slow and difficult with limited supporting software tools Sharedmemory machines produced by major uniprocessor manufacturers such as IBM, DEC, Intel and Silicon Graphics, and distributed Networks of Workstations (NOWs) had however established a foothold in the market, because they are based around high-volume rommerrial off-the-shelf (COTS) processors, and achieved penetration in markets such a databare and fileserving where parallelism could be supported within the operating system In our own application field of embedded systems, NOWs and sharedmemory machines have a significant part t o play in supporting the parallel logic development process, but implementation is now increasingly geared t* wards hardware-software co-design Co-design tools may currently be based around heterogeneous computing elements ranging from conventional RISC and DSP processors a t one end of the spectrum, through embedded processor cores such as ARM, to FPGAs and ASICs a t the other Historically, such tools have been developed bottom-up, and therefore currently betray a strong hardware design ethos, and a correspondingly weak high-level software design model Our current research (also funded by EPSRC) is investigating how t o extend the P P F design methodology t o address this rapidly developing embedded applications market using a software component-based approach, which we believe cat1 provide a valuable method of unifying current disparate lowlevel hardware-software cwdesign models Such solutions will surely become essential as complex multimedia embedded applications become widespread in consumer, commercial and indust,rial mmarketn nver the next decade A NDY Colchester, Uct ZOO0 DOWNTON Acknowledgments Although this book has only two named authors, many others have contributed t o its content, both by carrying out experimental work and by collaborating in writing the journal and conference papers from which the book is derived Early work on real-time handwritten address recognition, which highlighted the problem t o be addressed, was funded by the British Post Office, supported by Roger Powell, Robin Birch and Duncan Chapman Algorithmic developments were carried out by Ehsan Kabir and Hendrawan, and initial investigations of parallel implementations were made by Robert Tregidgo and Aysegul Cuhadar, all of whom received doctorates for their work In an effort t o generalise the ideas thrown up by Robert's work in particular, further industrial contract work in a different field, image coding, was carried out, funded by BT Laboratories through the support of Mike Whybray Many people a t BT contributed t o this work through the provision of H.261 image coding software, and (later) other application codes for speech recognition and microphone beam forming Other software applications, including those for model-based coding, H.263, and Eigenfaces were also investigated in collaboration with BT In addition t o Mike Whybray, many others at BT laboratories provided valuable support for work there, including Pat Mulroy, Mike Nilsson, Bill Welsh, Mark Sharkl~ton,John Talintyre, Simon R.ingland and Alwyn Lewis BT also donated equipment, including a Meiko CS2 and Texas TMS320C40 DSP systems t o support our xtivities As a result of these early studies, funding was obtained from the EPSRC (the UK Engineering and Physical Sciences Research Council) to investigate the emergent PPF design methodology under a directed program on Portable Software Tools for Parallel Architectures (PSTPA) This project - PSTESPA (Parallel Software Tools for Embedded Signal Processing Applications) - enabled us not only t o generalise the earlier work, but also to start investigating and prototyping software tools to support the P P F design process Chris Wadsworth from Rutherford Appleton Laboratories was the technical coordinator of this program, and has our heartfelt thanks for the support and guidance he provided over a period of nearly four years Adrian Clark, with extensive previous experience of parallel image processing libraries, acted as a consultant on the PSTESPA project, and Martin Fleury was appointed as our first research fellow, distinguishing himself so much that before the end of the project he had bee11 appoililed lo ihe Department's academic staff Several other research fellows also worked alongside Martin during the project: Herkole Sava, Nilufer Sarvan, Richard Durrant and Graeme Sweeney, and all contributed considerably to its successful outcome, as is evidenced by their co-authorship of many of the publications which were generated Publication of this book is possible not only because of the contributions of the many collaborators listed above, hut also through the kind permission of the publishers of our journal papers, who have permitted us t o revise our original publications t o present a complete and coherent picture of our work here We particularly wish t o acknowledge the following sources of tables, figures and text extracts which are reproduced from previous publications: The Institution of Electrzcal Engineers (IEE), for perlnissiol~t o reprint: portions of A C Downton, R W S Tregidgo, and A Cuhadar, Topdown Strurturrd parall~lizat,ionof embedded image processing applications IEE Proceedings Part I (Vision, Image, and Signal Processing), 141(6):438-445, 1994 as text in Chapter 1, as Figure 1.1 and A.l-A.4, and as Table A.l; portions of M Fleury, A C Downton, and A F Clark, Scheduling schems for data farming, IEE Proceedings Part E (Computers and Digital Techniques), in press at the time of writing, as text in Chapter 6, as Figures 6.1-6.9, and as Tables 6.1 and 6.2; portions of A C Downton, Generalised approach to parallelising image sequence coding algorithms, IEE Proceedings I (Vision, Image, and Signal Processing), 141(6):438-445, 1994 as text in Section 8.1, as Figures A.6-8.12, and as Tables 8.1 and 8.2: portions of H P Sava, M Fleury, A C Downton, and A F Clark, Parallel pipeline implementation of wavelet transforms, IEE Proceedings Part I (Vision, Image, and Signal Processing), 144(6):355-359, 1997 as text in Section 9.2, and as Figures 9.6-9.10; portions of M Fleury, A C Downton, and A 1.' Clark, Scheduling schemes for data farming, IEE Proceedings Part E (Computers and Digital Techniques), 146(5):227-234, 1994 as text in Section 11.9, as Figures 11.11-11.17, and as Table 11.6; portions of M Fleury, H Sava, A C Downton, and A F Clark, Design of a clock synchronization sub-system for parallel embedded systems, IEE Proceedings Part E (Computers and Digital Techniques), 144(2):65-73, 1997 as text in Chapter 12, as Figures 12.1-12.4, and as Tables 12.land 12.2 Elsevier Science, for inclusion of the following: portions reprinted from M~croprocessorsand Microsystems, 21, A Cuhadar, A C Downton, and M Fleury, A structured parallel design for embedded vision systems: A case study, 131-141, Copyright 1997, with permission from Elsevier Science, as text in Chapter 3, ay Figures 3.1-3.10, and as Table 3.1 and 3.2; portions reprinted from Image and Vwion Computing, M Fleury, A F Clark, and A C Downton, Prototyping optical-flow algorithms on a parallel machine, in press a t the time of writing, Copyright 2000, with permission from Elsevier Science, as text in Section 8.4, as Figures 8.198.28, and as Tables 8.8-8.12; portions of Signal Processing: Image Communications, 7, A C Downton, Speed-up trend analysis for H.261 and model-based image coding algorithms using a parallel-pipeline model, 489-502, Cupyright 1995, with permission from Elsevier Science, as text in Section 10.2, Figures 10.510.7, and Table 10.2 Springer Verlag, for permission t o reprint: porlious of H P Sam, M Fleury, A C Donwton, and A F Clark, A case study in pipeline processor farming: Parallelising the H.263 encoder, in UK Parallel'96, 196-205, 1996, as text in Section 8.2, as Figures 8.13-8.15, and as Tables 8.3-8.5; portions of M Fleury, A C Downton, and A F Clark, Pipelined parallelization of face recognition, Machine Vision Applications, in press a t the time of writing, as text in Section 8.3, Figures 5.1 and 5.2, Figures 8.16-8.18, and Tables 8.6 and 8.7; portions of M Fleury, A C Downton, and A F Clark, Karhunen-Loeve transform: An exercise in simple image-processing parallel pipelines, in Euro-Par'97, 815-819, 1997, as text in Section 9.1, Figures 9.4-9.5; portions of M Fleury, A C Downton, and A F Clark, Parallel structure in an integrated speech-recognition network, in Euro-Par'99, 9951004, 1999, as text in Section 10.1, Figures 10.1-10.4, and Table 10.1 Academic Press, for permission to reprint: portions of A Cuhadar, D G Sxnpson, and A C Downton, A scdable parallel approach to vector quantization, Real-Tame Imaging, 2:241-247, 1995, as text in Section 9.3, Figures 9.11-9.19, and Table 9.2 The Institute of Electrical and Electronic Engineers (IEEE), for permission t o reprint: portions of M Fleury, A C Downton, and A F Clark, performance metrics for embedded parallel pipelines, IEEE Transactions in Parallel and Distributed Systems, in press a t the time of writing, ay text in Chapter 11, Figures 2.2-2.4, Figures 11.1-11.10, and as and Tables 11.111.5 John Wiley & Sons Limited, for inclusion oE portions of Constructiug generic data-farm templates, M Fleury, A C Downton, and A F Clark, C o n c u r ~ n c y :Practice and Ezperience, 11(9):1-20, 1999, @.lohn Wiley & Sons Limited, reproduced with permission, as text in Chapter and Figures 7.1-7.7 The typescript of this book was typeset by the authors using B and WinEdt W ,MikTex A C D and M F 292 REFERENCES 308 S P A Ringland Application of grammar constraints t o ASR using signature functions In Speech Recognition and Coding, pages 260-263 Springer, Berlin, 1995 Volume 147 NATO AS1 Series F 309 Rioul and M Vetterli Wavelet and signal processing IEEE Signal Processing Magazine, pages 14-38, 1991 310 J T Robinson Some analysis techniques for asynchronous multiprocessor algorithms IEEE k n s a c t i o n s on Software Engineering, 5(1):24-31, January 1979 311 A W Roscoe Routing messages through networks: An exercise in deadlock avoidance In T Muntean, editor, a Occam User Group Technical Meeting, pages 55-79 IOS, Amsterdam, 1987 312 W W Royce Managing the development of large software systems In WESTCON, CA, 1970 313 Heath S Embedded Systems Desrgn Newnes, Oxford, U K , 1997 314 D G Sampson, da Silva E A B., and M Ghanbari Low bit rate video coding using wavelet vector quantization IEE Proceedings - Vision, Image and Signal Processing, 142(3):141-148,1995 315 D G Sampson and M Ghanbari Fast lattice-based gain-shape vector quantization for image sequence coding IEE Proceedings, Part-I, Communication, Speech and Vision, Special issue on Image processing and its Applications, 140(1):56-66, 1993 316 V Sarkar Determining average program execution times and their m i ance In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 298-312, 1989 317 H Sava, M Fleury, A C Downton, and A F Clark Parallel pipeline implementation of wavelet transforms In 6th International Conference on Image Processing and its Applications, pages 171-173, 1997 318 J SchaefTer, D Szafron, G Lobe, and I Parsons The Enterprise model for developing distributed applications IEEE ParnIlel and Distributed Technology, pages 85-95, August 1993 319 M S Schlansker and B R Rau EPIC: Explicitly parallel instruction computing IEEE Computing, 33(2):3745, 2000 320 C Seitz A prototype two-level multicomputer, 1996 Project Information in DARPA/ITO project, available as http/www.myri.com/research/darpa/96summary.html 321 J M Shapiro Embedded image coding using zero-trees of wavelet coefficients IEEE Tkansactions on Signal Processing, 4(12):3445-3462,1993 REFERENCES 293 322 M Shensa The discrete wavelet transform: Wedding the A trous IEEE Zbansactions on Signal Processing, and mallat algorithms 40(10):2464-2482,1992 323 N Sherkat, R K Powalka, and R J Whitrow A parallel engine for real-time handwriting and optical character recognition In Jet Poste 93, pages 830-837,1993 324 K G Shin and P Rarnanathan Clock synchronization of a large multiprocessor system in €he presence of malicious faults IEEE 1Sansactions on Computers, 36(1):2-12, January 1987 325 R Jr Simar DSP architectures, algorithms, and code-generation: Fission or fusion In IEEE Workshop on Signal Processing Systems, pages 50-59, 1997 326 H D Simon, editor Scientific Applications of the Connection Machine World Scientific, Singapore, 1988 327 E P Simoncelli, E H Adelson, and D J Heeger Probability diutribntionv of optical flow 111 TEE8 Conference on Computer Vision and Pattern Recognition, pages 310-315, 1991 328 D B Skillicorn A taxonomy for computer architectures IEEE Computer, 21(9):46-57, September 1988 329 D B Skillicorn Foundations of Parallel Programming C.U.P., Cambridge, UK, 1994 330 M I Skolnik Introduction to Radar System McGraw-Hill, New York, 1962 331 H C Slusanschi and C Jesshope A FORTRAN 90 t o F-code compiler In UK Parallel '96, pages 40-52 Springer, London, 1996 332 B P Smith Shared memory, vectors, message passing, and scalability Lecture Notes in Computer Science, 295:29-34, 1987 4'h International DFVLR Seminar 333 M .J T Smith and T P Barnwell Exact tree construction techniques for tree-strcutured subband codecs IEEE Tmnsactions on Acoustics, Speech, and Signal Processing, 34(3):434-440,1986 334 L Snyder A taxonomy of synchronous parallel machines In 17ht International Conference on Parallel Processing, volume 1, pages 281-285, 1988 335 I Sommerville Software Engineering Addison-Wesley, Wokingham, UK, 3& edition, 1989 294 REFERENCES 336 L Sousa, J Burrios, A Costa, and M Picdate Parallel image processing for transputer-based systems In IEE International Conference on Image Processing and ils Applicalions, pages 33-36, 1992 337 J Spiers Database management systems for parallel computers Technical report, Oracle Corporation, 1990 338 H V Sreekantaswamy, S Chanson, and A Wagner Performance prediction modelling of multicomputers In 12th International Conference on Distn'buted Computing Systems, pages 2781285 IEEE, 1992 339 T K Srikanth and S Toueg Optimal clock synchronization Journal of the ACM, 34(3):626445, July 1987 340 V Srini An architectural comparison of dataflow systems IEEE Computer, 19(3):68-88, March 1986 341 S R Sternberg Pipeline architectures for image processing In K P r e ston Jr and L Uhr, editors, Multicomputers and Image Processing, pages 291-206 Academic Press, New York, 1982 342 W R Stevens Unix Network Programming Prentice Hall, Englewood Cliffs, NJ, 1990 343 W-K Su, R Siviloti, Y Cho, and D Cohen Scalable, networkconnected, reconfigurable hardware accelerator for automatic target recognition Technical report, Myricom Inc., 1998 Available as http://wwu.myri.com/darpa/atr-report.pdf 344 C Szyperski Component Software Addison-Wesley, Harlow, UK, 1998 345 D Tabak Multiprocessors Prentice-Hall, Engelwood Cliffs, NJ, 1990 346 M Tasker, J Allin, J Dixon, M Shackman, and J Forrest Professional Symbian Programming: Mobile Solutions on the EPOC Platform Wrox, London, 2000 347 A M T e m p Digital Video Processing Prentice IIall, Upper Saddle River, NJ, 1995 348 C Temperton Self-sorting mixed-radix Fast Fourier Transforms Journal of Computational Physics, 15:l-23, 1983 349 Texas Instruments Inc TMS380C4x User's Guide, 1992 350 A Thepaut, J Le Drezen, G Ouvradu, and J D Laisne Preporcessing and recognition of handwritten postal codes using a multigrain parallel machine In Jet Poste 93, pages 813-819,1993 REFERENCES 295 351 R Thoma and M Bierling Motion compensating interpolatiorl considering covered and uncovered background Signal Processing: Image Communication, 1:191-212, 1989 352 Transtech Parallel Systems Ltd., 17-19 Manor Court Yard, Hughenden Ave., High Wycombe, UK Paramid User's Guide, 1993 353 R G Tregidgo Parallel Processing and Automatic Postal Address Recognition PhD thesis, Essex University, May 1992 354 R W S Tregidgo and A C Downton Processor farm analysis and simulation for embedded parallel processing systems In S J Turner, editor, Tools and Techniques for lhnsputer Applications (OUG 12) IOS, Amsterdam, 1990 355 Tretiak and L Pastor Velocity estimation from image sequences with second order differential operators In IEEE International Conference on Pattern Recognition, volume 1, pages 16-19, 1984 356 A Trew and G Wilson Past, Present, Parallel: a Survey of Available Pavallel Computing Systems Springer, London, 1991 357 E R Tufte The Visual Display of Quantitative Information Graphics Press, Cheshire, CO, 1983 358 M Turk and A Pentland Eigenfaces for recognition Journal for Cognitive Neuroscience, 371-86, 1991 359 T H Tzen and L M Ni Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers IEEE Tkansactions on Parallel and Distributed Systems, 4(1):87-98, January 1993 360 A Uhl and A Bruckman A double-tree wavelet compression on parallel MIMD computcrs In @ International Conference on Image Processing and Its Applications, pages 179-183, 1997 IEE Conference Publication 443 361 R Umbach and H Ney Improvements in beam search for 10,000-word continuous-speech recognition IEEE T+ansactions on Syeeclr and Audio Processing, 2(2):353-356, April 1994 362 P E Undrill Transputers in image processing In M R Jane, R J Fawcett, and T P Mawby, editors, Pansputer Applications - progress and prospects IOS, Amsterdam, 1992 363 L G Valiant A bridging model for parallel computation Communications of the ACM, 33(8):103-111, August 1990 296 REFERENCES 364 L G Valiant General purpose parallel architectures In J van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A, pages 943972 Elsevier, Amsterdam, 1990 365 B van Veen and K M Buckley Beamforming: A versatile approach t o spatial filtering IEEE Acoustics, Speech, and Signal Pmcessing Magazine, pages 4-24, April 1988 366 M Vanneschi Heterogeneous HPC environments In D Pritchard and J Reeve, editors, Eum-Par'S8 Parallel Processing, pages 21-34 Springer, Berlin, 1998 Lecture Notes in Computer Science No 1470 367 E Verhulst Non-sequential processing: bridging the semantic gap left by the von Neumann architecture In 1997 IEEE Workshop on Signal Processing Systems, pages IEEE, Piscataway, NJ, 1997 368 P J Vermeulen and D P Casasent Karhunen-LoBve techniques for optimd processing of time-sequential imagery Optical Engineering, 30(4):415-423, April 1991 369 A Verri, F Girosi, and V Torre Differential techniques for optical flow Journal of the Optical Society of America A, 7(5):912-922, 1990 370 M Vetterli and C Herley Wavelets and filter banks: Theory and deuign IEEE Transactions on Signal Processing, 40(9):2207-2232, 1992 371 U Villano Monitoring parallel programs running in transputer networks In G Haring and G Kotsis, editors, Perfonnonce Measurement an Visualization of Parallel Systems, pages 67-96 Elsevier, Amsterdam, 1993 372 Vitruvius De Architectum, volume & Harvard University, 1945 & 1970 Editor: F Granger 373 A S Wagner, H V Skreekantaswamy, and S T Chanson Performance models for the processor farm paradigm IEEE Tmnsactions on Pamllel and Distributed Systems, 8(5):475-489, May 1997 374 K Walrath and M Campione The JFC Swing Tutorial: A Guide to Constructing GUIs Addison-Wesley, h a d i n g , MA, 1999 375 B W Weide Analytical models t o explain anomalous behaviour of parallel algorithms In International Conference on Parallel Processing, pages 183-187,1981 376 A Weiss Large Deviations for Performance Analysis Chapman & Hall, London, 1995 REFERENCES 297 377 P H Welch Graceful termination - graceful resetting In Bakkers A., editor, 10th Occam User Group Technical Meeting IOS, Amsterdam, 1989 378 P H Welch, G R R Justo, and C Willcock High-level paradigms for deadlock-free high-performance systems In Transputer Applications and Systems '93, pages 981-1004 IOS, Amsterdam, 1993 379 P J Welch and D C Wood Higher levels of process synchronisation In A Bakkers, editor, Parallel Programming and Java, pages 104-129 IOS, Amsterdam, 1997 380 M W Whybray Video codec design using dsp devices BT Technology Journal, 10(1):127-140, January 1992 381 B Widrow and S D Stearns Adaptive Signal Processing Prentice Hall, London, 1985 382 H Williams Threads and multithreading In Java 1.1 Unleashed, pages 121-163 Sams, Indianapolis, IN, 1997 383 S Williams Approaches to the Determination of Parallelism for Computer Programs PhD thesis, Loughborough University of Technology, 1978 384 S Williams Programming Models for Parallel Sgatem.9 Wiley, Chichester, UK, 1990 385 W J Willimas, H P Zaveri, and J C Sackellares Timefrequency analysis of electrophysiological signals in epilepsy IEEE Engineering in Medicine and Biology Magazine, pages 133-143,1995 386 G V Wilson Practical Parallel Processing MIT, Cambridge, MA, 1995 387 Wind Rivers Systems VxWorks Programmer's Guide 5.1, 1993 Part #: DOC-100000-0002 388 S C Winter Research report - current projects Technical report, University of Westminster, London, 1994 Serialization of parallel programs by G R Ribeiro Justo 389 L Wiskott, J -L Fellous, N Kriiger, and Ch von der Malsburg Face recognition and gender determination In International Workshop on Automatic Face- and Gesture-Recognition, pages 92-97, 1995 390 W H Wolf Hardware-software cc-design of embedded systems Pmceedings of the IEEE, 82(7):967-989, 1994 391 P C Woodland, C J Leggetter, J Odell, V Valtchev, and S J Young The 1994 HTK large vocabulary speech recognition system In ICASSP'S5, volume I, pages 73-76, 1995 298 REFERENCES 392 P H Worley A new PICL trace file format Technical report, Oak Ridge National Laboratory, Oak Ridge, TN, USA, September 1992 Report ORNLJTM-12125 393 Q X Wu A correlation-relaxation-labeling framework for computing optical flow - template matching from a new perspective, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):843-853, September 1995 394 Xilinx Inc., 2100 Logic Drive, San Jose, CA Field Programmable Gate Anays Datasheet, 2000 Virtex 2.51.' Available as http://www.xilinx.com/partinfo/d2003.pdf 395 J C Yan Performance tuning with AIMS-an automated instrumentation and monitoring system for ~nulticotnputers.111 27th Annual Hawaii International Conference on System Sciences, 1994 396 R B Yates, N A Thacker, S J Evans, S N Walker, and P A Ivey An array processor for general purpose digital image compression IEEE T!wrsuctions of SSod-State Circuits, 30(3):244-249, March 1995 397 E N Yourdon Structured design: fundamentals of a discipline of computer program and systems design Yourdon, NY, d edition, 1978 Index 68030 microprocessor, 102 Acoustic interference, 202 Acoustic models, 190 Adaptive filters, 20'2 Address tracing, 71 Address verification, 38 ADSP-21160,14 Affine transformation, 140 Affine warp, 141 Agents, 98 Alarm-clack interrupt, 108 Algorithmic parallelism, 7, 20,55 Alg~rithmirsk~letons,27,95 AMD Sl~arc,14 AMD SHARC, 71 AMD SHARC DSPs, 17 Amdahl's law, 4,28, 50,135,198 Anarchic development, 68 AND-parallelism, AND-tree, Animation, 88 Annotated language, 70 Aperture problem, 145 Application-specific integrated circuits, 265 APTT, 81, 85 Arithmetic Coding, 133 ARM, 263 ArMenX 55 ~scetlde;/descender sequences, 47 ASICs, 31 Asymptotic behavior, 216 Asvrn~totic " distributions 216 Asynchronous pipeline, 24 ATEMPT, 59 Autoregressive, 203 R-splines, 197 Bandpass filter, 172 Beam-former, 2OZ Beam-formers, 202 Beam search, 191 Benchmarking, 61 Bernoulli distribution, 224,237 Bernoulli trials, 255 Beta testing, 57 Bi-gram, 191 Big-endian, 77 Bimodal, 235 Bimodal distribution, 42,224 Biorthagonal, 176 BlueTooth, 267 BSP, 211 Buckets, 97 Bulk Synchronous Parallelism, 30 Burke's theory, 227 Bytecode 82 c;+, 27 C40,71 ' Cache flushine - 100 Cactles, 97 Calculus of variations, 214 Categorical data type, 96 Cauchy-Schwartz inequality, 214 Cauchy distribution, 217 Central differencing, 150 Central Limit theorem, 215 Cepstrum, 190 Channels 97 Characteristic loci algorithm, 39 Characteristic maximum, 213, 217, 219 Chi-square, 220 Class browser, 60 Client-server, 20 Clock drift, 249 Closed-form solution, 236 Clutter, 31 CMU Warp, Coarse-to-fine, 150, 158 Code vector, 179 Codebook, 179, 184 Coefficient of variation, 222 Commercial-off-the-Shelf, 17 Commercial-off-the-shelf, 264 Communicating Sequential Processes, 68, 96 Communication diameter, Compiler optimization, 220 Contputatiorlal intensity, 90 Confidence building, 68 Cotlfiderkce building, - 68 Confidence measure, 151 Connected component labeling, 53 Conservative scheduling, 236 Context-switching, 102 Continuous-flow, Continuousspeech recognition, 190 Continuous wavelet transform, 172 Convergence, 252 Convolution, Convolution patch, 153 Convolution theorern, 177 Convolving, 221 Coprocessor, 264 Correlation methods, 160 COTS, Comriarlce matrix, 167 Critical region, 97 Crusoe, 264 Crystal clocks, 249 CSP, 96, 202 Cytacomputer, Daemon, 99 DAG, 212 1)ARPA SLAAC project, 267 Data-dependent, 61 Data farming, 27 Data parallelism, 40, 180, 197 Data races, 95 Datagrams, 107 Daubechies wavelet, 175 DCOM 96 DCompose, 71 DCT, 133 Deadiock, 20, 103 Decoupled architectures, 97 Delav-cvcle analvsis 227 Derived datatypes, 100' Diagram-based display, 60 Dialect, I91 Directed acyclic graph, 212 Di~cretewent simulator, 81 Discrete wavelet transform, 171 Disparity field, 145 Distributed-memory, 75 Distribution-free, 221 Divide-and-conquer, 20 Doppler effect, 31 Double exponentid distribution, 216, 219 DSP, 71 DSP3 multiprocessor, 33 Dynamic memory allocation, 103 Eigenfaces, 72, 139 Eigenvector set, 166 Eieenvector soread 150 Eigenvectors, 141 Eigenvectors 147 Embarrassingly parallel, 28 Rmbedded system, 67 EMMAZE, 55 EMU, 59 EPOC, 263 Error map, 140 Euclidean space, 179 Event pin, 101 Event trace, 62 Event tracing, 247 Events, 97 Expectation operator, 214 Explicitly parallel instruction computing, Exponential distribution, 222, 224 Extremal statistics, 213 F-code, 70 Face images, 164 Facial geometry, 141 Factoring, 236 Farmer, 18 Fat tree, 29 Fault-tolerance, 23 Fault-tolerant, 248 Feature detection, 140 Feedback, 21 FERET database, 143 FFT, 20 Pield-programmable gate arrays, 264 Fine-mained 71 " Fine-grained algorithms, 20 FIR, 173 First asymptotic distribution, 219 First character segmentation, 50 Fly-by-wire avionics, 18 Folded-back, 21 Formal methods, 59 Fortran, 70 Fourier derivative theorem, 161 Fourier transform, 24, 140 FPGA, 23 FPGAs, 65 Fl'ame-rate inter-conversions, 146 I i r m e in-betweening, 145 Full velocities, 154 Function inlining, 78, 141 Functional larrguage, 95 Funding agency, 67 Gabor filters, 140, 154 Gamma function, 219 Garbage collection, 71, 83,266 Gaussian density, 226 Gaussian mixture rnodel, 33 Gaussian smoothing, 147 Geometric multiplexing, Geometric parallelism, Global clock, 62 Global error number, 106 Gprof, 73, I34 Gradient-based, 147 Grand Challenge, 28 Granularity, 20, 24, 144 Granularity detectors, 71 Grid-bag, 89 Griffitbs-Jim, 202 Guided self-scheduline, 237 H.263 encoder, 134 ITand-mtled anembly, 79 Handwritten postcode recoenition, 233 Handwritten postcodes, 37 Hannover, 196 Harmonic sinusoids, 165 IIashing function, 74 Heap memory, 77 Heart motion, 146 Heavyweight process, 100 Hebbian learning, 165 Heeger algorithm, 155 Heisenberg's inequality, 178 HeNCE, 60 IIeterogeneaus PEs, 65 Keuristics, 235 IIiddert Markov models, I90 Hierarchical scheme, 225 High Performance Fortran, 70 IIigher management, 67 Histogram equalization, 22 Histogramming, 54 Hot-spot contention, 225 Hotspat compiler, 72 Human vision, Hypercuboid, 58 1860, 52,223 IBM Tangora, 191 Idioms, 99 IFR, 219 111-conditioning, 253 Incoherent sensors, 164 Increasing failure rate, 219 Industrial rnactufacturing control, 18 Inertia tensur, 160 Infinitely divisible, 211 Infinitely divisible distribution, 221 Instrumentation, 95,98 Instrumentation, 247 Integration, 59 Iotel i960, 102 Intensity function, 219 Interconnect technologies, 211 Interleaved memory, 97 Internal concurrency, 99 Interpolation, 248 Interrupt latency, 110 Irradiance, 145 Irregular data dependencies, 22 Irregularly structured, Iterative server, 111 Jade, 70 Java, 71,82, 265 Java RMT, 81 JavaBean, 96 JavaS~aces,72 JIT, 82 Kali, 70 Kernel, 101 KLT, 21, 140, 164 Kolmogorov-Smirnov, 220 Language modeling, 190 LaplaceStieltjes transform, 227 Laplacian pyramid, 147, 152, 158 Last character segmentation, 50 Latency, 40 Latency hiding, 30 Latin square, 30 Lattice-based VQ, 180 Leasing, 266 1.egacy code, 57 Lightweight processes, 100 Linda program builder, 96 Linear algebra, 70 Linear least-squares error, 147 Linear programming, Linear programming, 27 1,inguistic analysis, 64 Liuked-list, 230 Linked Est, 149 Littleendian, 77 Logical clocks, 248 Logistic distribution, 233, 238 LogP model, 30 Long-tailed, 254, 257 Look-and-feel, 85 1,oop scheduling, 212 Loop unrolling, 78, 141 I.owly parallel, Lucent Orca FPGAs, 17 MAD, 59 Mahalonobis, 79 Markov order-one, 165 MasPar series, Maximum-likelihood estimate, 203 Maximum latency, 220 Maximum likelihood, 140 Mean-value statistics, Mearn latency, 211 Mean of the maximum, 213 Median, 219, 252 Meiko's CM-5, 23 Meiko, 198 Meiko Computing Surface, 39 Mel-frequency, 190 Memcpy, 106 Memory-to-memory copy, 106 Memory debugger, 60 Memory leaks, 60 Messazepassing, - \leisage aggregatiorl, 70 hlessage quruv, I I I \lessare records 97 Microphone recordings,202 MIMD, Minsky's law, 28 MIPS R3000, 102 MIT Media Laboratory, 139 MMX hlodei-based coders, 197 Model-view-controller, 112 Mouitor, 97 Moore's law, 53, 190 Morlet wavelet, 178 - MOS circuit 58 Motion estimation, 35, 133 MPEG, 133 MPEG4 35 MPI, 99 Multi-spectral analysis, 164 Multi-threaded, 71 Multicast, 20, 97 Multicasts, 266 Multimedia-extension, Multiple Data, Multiple Instr~nrtior> Multiplicative noise, 165 Multiresolutioo pyramid, IW Mutex, 106 Myrinet, 17, LO>, 267 N-ary trees, 29 Nagle's algorithm, 106 Nameserver, 266 NCubeZ, Network Time Protocol, 256 Networks of workstations, Networks of workstations, 265 Neural nets, 165 Neural networks, 55 Noise cancelling, 202 Non-reentrant system calls, 108 Noo-uniform memory access, 212 Nondeterministic operator, 99, 107 Nonlinear regression, 220 Normal component, 156 Normal distribution, 218 Normalization, 54 NP-complete, 27 NUMA, 212 Numerical analysis, 70 Numerical differentiation, 147 Nyquist sampling, 206 Object-oriented, 30, 196 Occarn, 96 OCR, 38, 54 Optical-flow, 145 Optical-flow equation, 147 Optimization, 77 Optimizing cornpilera, 78 OR-parallelism, Order statistics, 212 Ordering, 82 Ordering constraint, 230 Orientation, 160 Ortt>ogonaltransform, I64 Orthogonal transforms, 90 Outliers, 251 Overhead, 258 P6 family, 264 Pabla, 247 Paging, 149 Par, 70 ParaGraph, 58, 84 Parallel radar system, 68 Parallel slackness, 18 Paralleliaing compilers, 70 Paramid, 171,223, 238 Paramid machine, 52 PARASTT, 59 PARET, 58 Partitioning, 67 Path merging, 191 Pattern collation, 54 Patterns, 112 PB-frames, 133 Pentium, 264 Perfarmance-tuning, 96 Perplexity, 33 PETAL, Petri-nets, 58 Phase-based, 147 Phaselocked loops, 256 Phase component, 154 Phased-array radar, 31 Phosphorescent dot coding, 38 Pipeline architecture, 54 Pipelining, Pisa Parallel Progran~rnir~g Language, 95 Plant growth, 146 Plumrbing, 97 Point-to-point, 29 Pollaeaek-Khintchine, 22 Pollacaek-Khiitchiie equation, 228 Polymorphic, 96 Polymorphism, 265 Polynomial distribution, 217 Pop-ups, 89 Population statistics, 219 Parting, 77 Post-processing, 248 PawerPCs, 102 Preemptive context switching, 98 Prediction model, 59 Presentation-abstraction-controller, 112 Principal Components Algorithm, 164 Probability density function, 213 Process algebra, 96 Processor farm, 18 Product-code VQ, 180 Profiler, 60 Profilers, 220 Prototype, 62 Pratotyping, 265 Pseudo-parallel, 58 Pseudo-quadrature, 154 Pthreads, 101, 194 Purify, 77 PVM, 99, 193,198 Quadratic classifier, 39 Quantify, 73, 148, 152 Quantisation, 137 Quantization, 196 Queueing theory, 213 Race canditions, 62 Radar, 267 Random-number generator, 86 Random variables, 213 Randomized routing, 30 Red-zone protected, 105 Refresh intenal, 254 Relational database, Relaxation, 168 Remote Method Invocation, 266 Remote procedure call, 100 Rendezvous, 96, 106 Resources, 97 Reuse, 69 RISC, 71 RTSC core, 169 Roll-back, 70 Round-robin, 170 Round-robin context switching, 98 Round-trip, 251 Run-lemgttl coding, 133 Run-time executive, 109 Run-time scheduling, 70 Safe self-scheduling, 237 Sample statistics, 219 Satellite images, 164 Scalability, 264 Scalar logical clock, 108 Scatter matrix 160 Screen flicker, 88 SCSl bus, 157 SCSI link 100 Second asymptotic distribution, 219 Select, 107 Semantic neural network ISNN) , 41 Semaphore, 97 Semi-dynamic process groups, 100 Sequent Symmetry, Sequential overhead, 50 Seridiaer, 97 Series/parallel graphs, 213 Servicetime distribution, 82 Shadowing, 155 SHARC, 264 Shared-rnenrory, Shift invariant, 177 Short-tirue Fourier trausforrn, 172 Signal-processing, Signal handling, 101 Signatures, 191 SIMD, 2, I69 Simulation, 62 Single Instructiall Multiple Data, Sink process, 99 Slant correction, 47 Sobel edge detector 1290, 751, Socket APT, 106 Software component, 265 Software pipeline, 4, Software pipelining, 78 Software tools, 57 Solaris o.s., 101 Sonar, 267 Space-tirnc diagram, 69 Sparc 20, 135 Sparc 5, 135 SparcStation 2, 139 SparcStation 20, 132 Sparse graphs, 140 Spatial filtering, 22, 82 Spatintemporal filtering, 140 Speaker-dependent, 191 Spectograms, 176 Spectral energy, 1G1 Spectral estimation, 171 Specularities, 155 Speculative seardles, 25 Speech enhancement, 202 Speech recognition, 73 Speedup, 28 Speedup, 50 SPG, 213 SPV, 60 Standard deviation, 214 State-change display, 60 Static-scheduling, 27 Static profiling, 42 Store-and-forward, 21, 27 Stunted projects, 69 Subband filters, 173 Sub-pixel accuracy, 153 Subword parallelism, 264 Sum-of-squared-differences,151 Superscalar, 14, 78, 220, 264 Symantec Cafb, 72 Syrnbian, 263 Symrnetric multiprocessor, 195 Symmetric multiprocessors, 4, 265 Syrnrnetrical distribution 214 Symmetrical distributions, 238 Synchronization, 253 Synchronous applications, Syntax rules, 54 Synthetic Aperture Radar, 17 Syuthetic aperture radar, 165 Systolic, 5, 169 Systolic arrays, T800 transputer, 39 Tacltyans, 248 Tag, 71 Tagged array, I47 Taylor expansion, 232 Telenor, 132, 139 Template, 60 Tem~lates.95 Temporal multiplexing, 47, 197 Temporal parallelism, , 168 ~ e x a Instruments i '?MS~ZOC~O 23, Throughput, 43, 60 Timestamoine 218 Topology, Topology, 29 Trace file, 258 Trace recorder, 110 Transducer, 202 Transim, 68 Transparency, 155 Transputer, 238 napezoidal self-scheduling, 237 Traversal latency, 82 Tree-structured VQ, 180 nibphone, 190 Tri-state models, 193 Triangular distribution, 238 Trie dictionary, 54 Trie search, 41 Piphone model, 33 Twc-level computer, 267 Typewritten addresses, 55 Unconstrained least-squares, 202 Uni-directional ring, 22 Uni-ring, 168 Unifor!n distribution, 217 Unirnodal distributions, 213 Uniprocessor, 30 Unitarv matrix 167 Universal Time Coordinated, 250 UnixTM, 84 VAP, Variable-length encoder, 36 Variance, 61 Vector quantization, 179 Vectored messages, 107 Vectoriaation, 78 Venture capital, 67 Very-large ir~structionword, 2, 264 Video encoder, 72 Videotelept~ony,132 Virtual channel system, 101, 212 Virtual memory, 72 Virtual motion, 155 Visual surveillance, 139 Viterbi search, 191 L'LSI, 160 M.SI irrrplerrmerltatiot~, 169 VLSI implementations, 180 Von Neurnann, VxWorks, 71, 101, 110 Waiting-timc distribution, 227 Waiting time distributions, 81 WAP, 267 Waterfall, 68 Wavefront procpaeors, Wavelet, 171 Weather images, 145 White noise process, 203 Wide-sense stationary, 166 Window traps, 149 Windows 95/NT, 102 Windows NT, 84 Word case classification, 50 Word extraction, 47 Worrnl~oleroutiug, 29 X-window, Yoselnitc Valley, 187 Yourdon dataflow method, 58 Zero padding, 179 Zeta function, 215 Zipcode, 55 Zooming, 60 [...]... real-time performance requirements can be met exactly The granularity of parallelism is maximized, thus minimizing the design effort required t o move from the sequential t o the parallel implementation Design effort is focused on each performance bottleneck of each pipeline stage in turn, by identifying the throughput, latency, and scalability 1.3 AMDAHL'S LAW A N D STRUCTURED PARALLEL DESIGN Amdahl's... of processors Typical low-level image-processing operations such a s convolution and filtering can then be carried out independently on each sub-image requiring reference only t o the four nearest neighbor processors for boundary information To adapt such operations t o a processor farm, the required boundary information for each processor can be included in the original data packet sent to the processor. .. PPF design process t o achieve a scalahle parallel implemrnt,at.k,n of the algorithm with analytically defined performance bounds 1.5 CONCLUSIONS The primary requirement in parallelizing embedded applications is t o meet a particular specification for throughput and latency The Pipeline Processor Farm (PPF) design model maps conveniently onto the software structure of many continuous data flow embedded. .. effective system parallelihation requires a method of minimizing the impact of residual sequential code, as well as of parallelizing the bulk of the application a l g e rithm In the PPF design methodology, pipelining is used t o overlap residual sequential code execution with other forms of parallelism 1.4 INTRODUCTION T O PPF SYSTEMS A PPF is a software pipeline intended for recent, accessible, parallel machines... for example a radar processor which must always monitor air traffic These systems frequently need t o meet a variety of throughput, latency, and output-ordering specifications It becomes necessary t o he able t o predict performance, and t o provide a structure which permits performance scaling, by incremental addition of processors and/or transfer t o higher performance hardware once the initial design. .. for discrete cosine transformation (DCT), mction estimation and compensation, various filters, quantizers, variable length coding, and inverse versions of several of these algorithms Very few papers addressed the issue of parallelizing complete systems, in which individual algorithm parallelization could be exploited as components Therefore, a clue t o an appropriate generic parallel architecture for. .. Parallelization of the KLT 9.1.4 PPF parallelization 9.1.5 Implementution 9.2 Case Study 2: 20-Wavelet lkansform 9.2.1 Wavelet Transform 9.2.2 Computational algorithms 9.2.3 Parallel implementation of Discrete Wavelet Trnnsform (DW T ) CONTENTS 9.2.4 Parallel implementation of oversampled WT 9.3 Case Study 3: Vector Quantization 9.3.1 Parallelization of VQ 9.3.2 PPF schemes for VQ 9.3.3 VQ i7nplementation...Contents Foreword Preface uii Acknowledgments ix Acronyms Part I Introduction and Basic Concepts 1 introduction 1.1 Ouer.view 1.2 Origins 1.3 Arndahl's Law and Structured Parallel Design 1 , Introduction to I'PF Systems 1.5 Conclusions Appendix A.l Simple Design Example: The H.861 Decoder 2 u Basic Concepts 2.1 Pipelined Processing xix xiv CONTENTS 2.2 2.9 2.4... experience their latency in parallel Geometric parallelism (decomposing by some partition of the data) or algorithmic parallelism (decomposition by function) are the two main possibilities available for irregularly structured code on medium-grained proces~ors.~After geometric decomposition, data must be multiplexed by a farmer process across the processor farm which is why in P P F data parallelism is alternatively... algorithmic parallelism does have a role in certain applications, which is why it is not discounted in PPF For example, pattern matching may employ a parallel search [202],a form of OR-parallelism, whereby alternative searches take place though only the result of successful searches are retained? %Dataflowcomputers [340]have been proposed as a way of exploiting the parallelism inherent in irregularly structured ... Science, for inclusion of the following: portions reprinted from M~croprocessorsand Microsystems, 21, A Cuhadar, A C Downton, and M Fleury, A structured parallel design for embedded vision systems: ... in a performance surface for varying numbers of processors, setup times, and calculation times For small numbers of processors therc is littlc diffcrence in predicted performance for all forms... I Foreword Parallel systems are typically difficult t o construct, t o analyse, and t o optimize One way forward is t o focus on stylized forms This is the approach taken here, for Pipclincd Processor

Định dạng
Số trang	328
Dung lượng	4,66 MB