Heterogeneous milticore processor technologies for embedded systems

Heterogeneous Multicore Processor Technologies for Embedded Systems Kunio Uchiyama Fumio Arakawa Hironori Kasahara Tohru Nojiri Hideyuki Noda Yasuhiro Tawara Akio Idehara Kenichi Iwata Hiroaki Shikano ● ● ● ● Heterogeneous Multicore Processor Technologies for Embedded Systems Kunio Uchiyama Research and Development Group Hitachi, Ltd 1-6-1 Marunouchi, Chiyoda-ku Tokyo 100-8220, Japan Hironori Kasahara Green Computing Systems Waseda University R&D Center 27 Waseda-machi, Shinjuku-ku Tokyo 162-0042, Japan Hideyuki Noda Renesas Electronics Corp 4-1-3 Mizuhara, Itami-shi Hyogo 664-0005, Japan Akio Idehara Nagoya Works, Mitsubishi Electric Corp 1-14 Yada-minami 5-chome Higashi-ku Nagoya 461-8670, Japan Fumio Arakawa Renesas Electronics Corp 5-20-1 Josuihon-cho, Kodaira-shi Tokyo 187-8588, Japan Tohru Nojiri Central Research Lab Hitachi, Ltd 1-280 Higashi-koigakubo Kokubunji-shi Tokyo 185-8601, Japan Yasuhiro Tawara Renesas Electronics Corp 5-20-1 Josuihon-cho, Kodaira-shi Tokyo 187-8588, Japan Kenichi Iwata Renesas Electronics Corp 5-20-1 Josuihoncho, Kodaira Tokyo 187-8588, Japan Hiroaki Shikano Central Research Lab Hitachi, Ltd 1-280 Higashi-koigakubo Kokubunji-shi Tokyo 185-8601, Japan ISBN 978-1-4614-0283-1 ISBN 978-1-4614-0284-8 (eBook) DOI 10.1007/978-1-4614-0284-8 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2012932273 © Springer Science+Business Media New York 2012 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface The expression “Digital Convergence” was coined in the mid-1990s and became a topic of discussion Now, in the twenty-first century, the “Digital Convergence” era of various embedded systems has begun This trend is especially noticeable in digital consumer products such as cellular phones, digital cameras, digital players, car navigation systems, and digital TVs That is, various kinds of digital applications are now converged and executed on a single device For example, several video standards such as MPEG-2, MPEG-4, H.264, and VC-1 exist, and digital players need to encode and decode these multiple formats There are even more standards for audio, and newer ones are continually being proposed In addition, recognition and synthesis technologies have recently been added The latest digital TVs and DVD recorders can even extract goal-scoring scenes from soccer matches using audio and image recognition technologies Therefore, a System-on-a-Chip (SoC) embedded in the digital-convergence system needs to execute countless tasks such as media, recognition, information, and communication processing Digital convergence requires, and will continue to require, higher performance in various kinds of applications such as media and recognition processing The problem is that any improvements in the operating frequency of current embedded CPUs, DSPs, or media processors will not be sufficient in the future because of power consumption limits We cannot expect a single processor with an acceptable level of power consumption to run applications at high performance One solution that achieves high performance at low-power consumption is to develop special hardware accelerators for limited applications such as the processing of standardized formats such as MPEG videos However, the hardware-accelerator approach is not efficient enough for processing many of the standardized formats Furthermore, we need to find a more flexible solution for processing newly developed algorithms such as those for media recognition To satisfy the higher requirements of digitally converged embedded systems, this book proposes heterogeneous multicore technology that uses various kinds of lowpower embedded processor cores on a single chip With this technology, heterogeneous parallelism can be implemented on an SoC, and we can then achieve greater v vi Preface flexibility and superior performance per watt This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism We developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores The chip implementations, software environments, and applications running on the chips are also explained in the book We, the authors, hope that this book is helpful to all readers who are interested in embedded-type multicore chips and the advanced embedded systems that use these chips Kokubunji, Japan Kunio Uchiyama Acknowledgments A book like this cannot be written without the help in one way or another of many people and organizations First, part of the research and development on the heterogeneous multicore processor technologies introduced in this book was supported by three NEDO (New Energy and Industrial Technology Development Organization) projects: “Advanced heterogeneous multiprocessor,” “Multicore processors for real-time consumer electronics,” and “Heterogeneous multicore technology for information appliances.” The authors greatly appreciate this support The R&D process on heterogeneous multicore technologies involved many researchers and engineers from Hitachi, Ltd., Renesas Electronics Corp., Waseda University, Tokyo Institute of Technology, and Mitsubishi Electric Corp The authors would like to express sincere gratitude to all the members of these organizations associated with the projects We give special thanks to Prof Hideo Maejima of Tokyo Institute of Technology, Prof Keiji Kimura of Waseda University, Dr Toshihiro Hattori, Mr Osamu Nishii, Mr Masayuki Ito, Mr Yusuke Nitta, Mr Yutaka Yoshida, Mr Tatsuya Kamei, Mr Yasuhiko Saito, Mr Atsushi Hasegawa of Renesas Electronics Corp., Mr Shiro Hosotani of Mitsubishi Electric Corp., and Mr Toshihiko Odaka, Dr Naohiko Irie, Dr Hiroyuki Mizuno, Mr Masaki Ito, Mr Koichi Terada, Dr Makoto Satoh, Dr Tetsuya Yamada, Dr Makoto Ishikawa, Mr Tetsuro Hommura, and Mr Keisuke Toyama of Hitachi, Ltd for their efforts in leading the R&D process Finally, the authors thank Mr Charles Glaser and the team at Springer for their efforts in publishing this book vii Contents Background 1.1 Era of Digital Convergence 1.2 Heterogeneous Parallelism Based on Embedded Processors References 1 Heterogeneous Multicore Architecture 2.1 Architecture Model 2.2 Address Space References 11 11 16 18 Processor Cores 3.1 Embedded CPU Cores 3.1.1 SuperHTM RISC Engine Family Processor Cores 3.1.2 Efficient Parallelization of SH-4 3.1.3 Efficient Frequency Enhancement of SH-X 3.1.4 Frequency and Efficiency Enhancement of SH-X2 3.1.5 Efficient Parallelization of SH-4 FPU 3.1.6 Efficient Frequency Enhancement of SH-X FPU 3.1.7 Multicore Architecture of SH-X3 3.1.8 Efficient ISA and Address-Space Extension of SH-X4 3.2 Flexible Engine/Generic ALU Array (FE–GA) 3.2.1 Architecture Overview 3.2.2 Arithmetic Blocks 3.2.3 Memory Blocks and Internal Network 3.2.4 Sequence Manager and Configuration Manager 3.2.5 Operation Flow of FE–GA 3.2.6 Software Development Environment 3.2.7 Implementation of Fast Fourier Transform on FE–GA 19 19 20 22 32 42 44 56 67 69 74 75 77 78 80 82 83 85 ix x Contents 3.3 Matrix Engine (MX) 3.3.1 MX-1 3.3.2 MX-2 3.4 Video Processing Unit 3.4.1 Introduction 3.4.2 Video Codec Architecture 3.4.3 Processor Elements 3.4.4 Implementation Results 3.4.5 Conclusion References 88 89 97 101 101 102 111 117 118 119 Chip Implementations 4.1 Multicore SoC with Highly Efficient Cores 4.2 RP-1 Prototype Chip 4.2.1 RP-1 Specifications 4.2.2 SH-X3 Cluster 4.2.3 Dynamic Power Management 4.2.4 Core Snoop Sequence Optimization 4.2.5 SuperHyway Bus 4.2.6 Chip Integration 4.2.7 Performance Evaluations 4.3 RP-2 Prototype Chip 4.3.1 RP-2 Specifications 4.3.2 Power Domain and Partial Power-Off 4.3.3 Synchronization Support Hardware 4.3.4 Interrupt Handling for Multicore 4.3.5 Chip Integration and Evaluation 4.4 RP-X Prototype Chip 4.4.1 RP-X Specifications 4.4.2 Dynamically Reconfigurable Processor FE–GA 4.4.3 Massively Parallel Processor MX-2 4.4.4 Programmable Video Processing Core VPU5 4.4.5 Global Clock Tree Optimization 4.4.6 Memory Interface Optimization 4.4.7 Chip Integration and Evaluation References 123 123 126 127 128 128 129 131 132 134 136 136 137 138 140 141 143 143 145 146 146 147 148 149 150 Software Environments 5.1 Linux® on Multicore Processor 5.1.1 Porting SMP Linux 5.1.2 Power-Saving Features 5.1.3 Physical Address Extension 5.2 Domain-Partitioning System 5.2.1 Introduction 5.2.2 Trends in Embedded Systems 153 153 153 157 161 165 165 166 210 Application Programs and Systems image of the original input image from the USB camera The size of the input image is 320 × 240 pixels A smoothed image is shown at the upper right The lower left image shows the edge detection effect, and the lower right image shows the corner detection effect The images are written on the frame buffer in Linux using the Xgraphics functions of the Xlib library 6.4 Video Image Search One example of the systems utilizing the multicore chip is a video image search system A detailed implementation of the system with the multicore chip or RP-X [17] is described in this chapter It offers a video-stream playback with a graphical operation interface, as well as a similar-image search [18] that recognizes faces while playing back video It makes the most use of the heterogeneous cores such as the video processing unit (VPU) in playing video streams and SH-4A in performing image recognition Figure 6.35 shows a block diagram of the implemented video image search system on the chip The system operates on two different operating systems, uITRON and Linux, over a hypervisor, to manage the physical resources of the chip The hypervisor is a programming layer lower than operating systems [19] The two operating systems use a common shared memory for their intercommunications Application Video image search OS uITRON GUI Face detection Face recognition SH-Linux (SMP) Hypervisor VPU VEU BEU LCDC SH-4A SH-4A SH-4A SH-4A SH-4A Chip Device interface MMU DVI I/F Device Display IPTV GUI VPU VEU BEU LCDC Memory : Internet Protocol television : Graphical User Interface : Video Processing Unit : Video Engine Unit : Blend Engine Unit : Display Controller MMU DVI SATA USB HDD HID SATA I/F USB I/F HDD HID Ethernet I/F : Memory Management Unit : Digital Visual Interface : Serial Advanced Technology Attachment : Universal Serial Bus : Hard Disk Drive : Human Interface Device Fig 6.35 Block diagram of developed video image search system 6.4 Video Image Search 211 Fig 6.36 Processing flow of image synthesis Programs running on uITRON process the playback of motion pictures by utilizing image processing cores such as the VPU, video engine unit (VEU), blend engine unit (BEU), and display controller (LCDC) on RP-X They also perform image synthesis of a graphics plane and the motion pictures and generate output images to a monitor connected to the digital visual interface (DVI) Programs operating on Linux perform a similar-image search of detected faces, user interface control, and graphic plane depiction that is synthesized with an output image of the similarimage search Figure 6.36 shows a processing flow of the image synthesis First, the decoded image of an input video stream is generated, and the size and position of the image are adjusted to create a video plane on uITRON Then, images used for the similar-image search and a mouse-pointer trail are generated to create a graphics plane on Linux The synthesis of the video plane and the graphics plane is based on an a plane that specifies transparent parts of the graphics plane synthesized with the video plane The a plane is created on Linux, and it is stored in the DDR3 memory shared with Linux 212 Application Programs and Systems Processing unit Memory MPEG-2 video stream MPEG-2 video decoding Frame buffer of decoding data (YrCbCr x4) Frame buffer of captured image (YrCbCr x1) VPU Frame buffer of decoded data (YrCbCr x1) Video image scaling VEU Frame buffer of scaled data (YrCbCr x1) 720 480 1024 Image-graphics synthesis Frame buffer of graphics (YrCbCr x1) BEU 768 Frame buffer of synthesized data (YrCbCr x1) 1024 768 Image output to DVI LCDC Display Data transfer Memory area dedicated for uITRON Memory area shared by both uITRON and Linux Fig 6.37 Data flow of uITRON system and utilized hardware IP cores 6.4.1 Implementation of Main Functions The system on uITRON plays back motion pictures, carries out the image scaling and synthesis, and outputs the image to a monitor, which are the main functions of the video image search Figure 6.37 illustrates the data flow of the system on uITRON It also shows the utilized hardware IP cores The VPU that decodes video streams supports multiple video codecs such as H.264, MPEG-2, and MPEG-4 The codec used by the system is MPEG-2 The VEU reads an image placed on the specified area of the memory, enlarges/reduces the size of the image, and writes it to the specified area of the memory The BEU reads three images placed on the specified areas of the memory, blends the three images, and writes them to the specified areas The implemented system uses BEU’s blending of two images The LCDC reads an image on the specified area of the memory and transmits it to a display device The system uses a DVI interface for the transmission The implementation details of the five main functions on the uITRON system are described as follows: MPEG-2 decoding Still-image capturing Image scaling Video image and graphics synthesizing Output image controlling 6.4 Video Image Search 213 First, the MPEG-2 decoding is processed on the VPU using a frame buffer of decoding data, whose size corresponds to four frames of the video image The VPU starts the decoding frame-by-frame when one frame of an input data stream is obtained from the memory, and it stores the decoded image to one of the four frames in the frame buffer The still-image capturing duplicates the decoded image to the frame buffer of the captured image at every decoding frame The buffer of the captured image is shared with uITRON and Linux in the memory; therefore, a program on Linux can obtain a decoded image any time The image scaling also duplicates the decoded image to the frame buffer of the decoded data at every decoding frame Since the adjusted size of images at the scaling is set to 720 × 480, scaling factors in both horizontal and vertical directions to the decoded image are calculated and set to the VEU For example, when the size of the image is 720 × 480, the scaling factors are set to 1.00 and 1.00 in the horizontal and vertical directions, respectively In the same manner, when the size is 960 × 540, the scaling factors are set to 0.75 and 0.89 When the size is 320 × 240, the factors are 2.25 and 2.00 After the start-up of the VEU, it reads an image from the frame buffer of the decoded data, adjusts the size of the image according to specified scaling factors, and writes the scaled image whose size is 720 × 480 to the frame buffer of the scaled data The video image and graphics synthesizing process uses image data in the frame buffer of the scaled data, as well as graphics data in the frame buffer of graphics and blends them in the BEU The size of the frame buffers is 1,024 × 768 When a scaled image is stored in the frame buffer of scaled data, the BEU starts the blending and writes the synthesized image to the frame buffer of the synthesized data The graphics frame buffer is placed in the memory area shared by both uITRON and Linux and can therefore be updated on Linux at any time Finally, the output image control sets up the LCDC and a DVI transmitter to convert the synthesized image stored in the frame buffer into video signals that are transmitted to the monitor via the DVI interface Figure 6.38 illustrates the processing flow of the uITRON system The process is repeated from supplying the video stream to copying the frame buffer of decoding data to that of a still-captured image 6.4.2 Implementation of Face Recognition and GUI Controls The system on Linux performs face recognition by utilizing the similar-image search, pointing device detection, and GUI controls to create a graphics plane generated by the face recognition Figure 6.39 depicts a block diagram of the Linux system that comprises the following five functions: Similar-image search Face detection Event processing Image object management Image processing 214 Application Programs and Systems Fig 6.38 Processing flow of uITRON system Start Initialize VEU/BEU/LCDC Start LCDC Initialize VPU Decode MPEG-2 Video on VPU Supply MPEG-2 video stream Copy frame buffer of decoding data to that of decoded data Start VEU Start BEU Copy frame buffer of decoding data to that of captured image uITRON (Video image search middlewares) Still image α plane Graphics Memory (DDR3-SDRAM) Image object management Still image α plane Mouse pointer Face region image depiction Event processing Graphics Similar image Image processing Interior event generation Trimming YUV-RGB conversion Mouse event detection Scaling Frame depiction Thumbnail images Face detection Face detection Face image Similar-image search Feature Registering calculation Image search Deletion Linux (Face detection, similar-image search and user operation) Fig 6.39 Block diagram of Linux system 6.4 Video Image Search 215 Start Initialization Mouse event detection Event occurs? No Yes Still-image capturing Still-image trimming Still-image display Face region display Similar-image trimming Registering Similar-image display Face detection Feature calc.& search Detectedfaces display Thumbnail display Deletion Fig 6.40 Processing flow of Linux application programs The similar-image search consists of feature calculation, in which the feature value of a face image is calculated; registering, in which faces are registered in a database created on a hard disk drive; deletion, where a face entry in the database is deleted; and an image search, where similar face images are searched for in the database The face detection utilizes a face detection function offered by Intel’s OpenCV [20], which is a general image processing library The event processing consists of mouse event detection that detects the operation of a pointing device and internal event generation that starts the face detection by the detected mouse event The image object management manages objects of the still image obtained from uITRON via the shared memory and the image generated by the face detection It also manages the depiction of mouse trails detected by the event processing and generation of the a plane that determines the synthesizing position of the video plane and the graphics plane Finally, the image processing performs trimming, which trims a specified range of an image; scaling, which enlarges or reduces the size of an image; YUV–RGB conversion, which converts the color format of an image; and frame depiction, which makes it possible to draw a shape on a face-detected area Figure 6.40 shows the processing flow of the Linux application programs First, image objects displayed on the graphics plane are initialized Then the operation of 216 Application Programs and Systems Table 6.7 Measured average execution time of Linux system processes Process Time consumed (s) Initialization 0.0506 Still-image capturing 0.0088 Face-region display 0.0381 Face detection 1.6072 Thumbnail display 0.0176 Similar-image display (top ten images) 0.0798 Similar-image database Registering 0.6348 access Feature calculation and search 0.5857 Deletion 0.2530 a mouse connected via the USB interface is detected by a device driver embedded in the Linux kernel The device driver outputs on/off values of each mouse button and the distance of mouse movement The mouse event detection classifies three events of mouse button operations: PUSH, REPEAT, and RELEASE Furthermore, it converts the distance into the axis of coordinates The internal event generation is processed in accordance with the values generated by the mouse detection The defined internal events include no events, still-image capturing, face detection, similar-image display, similar-image search, similar-image registering, and similarimage deletion When a mouse event is detected on a video-plane area, the still-image capturing event is generated, and a still image captured from decoded video images is obtained as a still-image object The graphics plane is updated in order to display the newly captured image Then the area range of the image selected by the mouse is trimmed The trimmed image is treated as a face-region image object, and the graphics plane is updated again The face detection uses the face-region image object, and a frame shape is drawn on the area of the detected face When a mouse event is detected on the still-image object or the similar-image object, the face detection is carried out by using these two objects When a mouse event is detected on the thumbnail image object, a thumbnail image shown in the event is displayed as a similar image When one is detected on a framed face of the face image object, the face-framed part of the image is trimmed The trimmed image is converted in the image format in order to calculate the feature value, and the calculation is performed Then the similar-image search is carried out by using the calculated feature value, and the top ten similar images are displayed When a mouse event is detected on the framed face, the face image is registered on the similar-image database When one is detected on a thumbnail image, an entry of the image is deleted from the database The execution time of each process on the Linux system was measured Table 6.7 lists the average time for the processes The face detection required is 1.6 s, and the access of the similar-image database took more than 0.5 s The time for such processes References 217 Fig 6.41 Appearance of developed video image search system depends on parameters related to the detection accuracy of faces, and a subjective evaluation of the system determined the parameter for practical use Figure 6.41 shows the appearance of the developed video image search system References ISO/IEC 13818-7:1997 (1997) Information technology—Generic coding of moving pictures and associated audio information—Part 7: advanced audio coding (AAC), ISO Kodama T, Tsunoda T, Takada M, Tanaka H, Akita Y, Sato M, Ito M (2006) Flexible engine: a dynamic reconfigurable accelerator with high performance and low power consumption Proc IEEE Symp Low-Power and High-Speed Chips (COOL Chips IX) pp 393–408 Yoshida Y, Kamei T, Hayase K, Shibahara S, Nishii O, Hattori T, Hasegawa A, Takada M, Irie N, Uchiyama K, Odaka T, Takada K, Kimura K, Kasahara H (2007) A 4320 MIPS fourprocessor core SMP/AMP with individually managed clock frequency for low power consumption IEEE Int Solid-State Circuits Conf (ISSCC) Dig Tech Papers, pp 100–101 Shikano H, Ito M, Todaka T, Tsunoda T, Kodama T, Onouchi M, Uchiyama K, Odaka T, Kamei T, Nagahama E, Kusaoke M, Wada Y, Kimura K, Kasahara H (2008) Heterogeneous multicore architecture that enables 54x AAC-LC stereo encoding IEEE J Solid-State Circuits 43(4) Sugimura T, et al (2008) High performance and low-power FFT on super parallel processor (MX) for mobile multimedia applications Digest of ISPACS2008, pp 146–149 Sato Y, et al (2009) Integral-image Based Implementation of U-SURF Algorithm for Embedded Super Parallel Processor Digest of ISPACS 2009, pp 485–488 218 Application Programs and Systems Yamazaki H, et al (2010) An energy-efficient massively parallel embedded processor core for realtime image processing SoC Proceedings of IEEE Symposium on Low-Power and HighSpeed Chips, pp 398–409 Kamijo S et al (2000) Traffic monitoring and accident detection at intersections IEEE Trans ITS 1(2):108–118 Andrew Lenharth (2003) Linux Scheduler in Kernel 2.4 and 2.5, May 26, (2003) 10 Ingo Molnar, http://people.redhat.com/mingo/O(1)-scheduler/README 11 Josh Aas (2005) Understanding the Linux 2.6.8.1 CPU scheduler, February 17, (2005) 12 APLBench http://rsim.cs.illinois.edu/alp/alpbench/ 13 xosview http://sourceforge.net/projects/xosview/ 14 Smith Stephen M, Michael Brady J (1997) SUSAN—A new approach to low level image processing Int J Computer Vision 23(1) 15 Guthaus MR, et al MiBench: A free, commercially representative embedded benchmark suite 16 luvcview, http://mxhaard.free.fr/spca50x/Investigation/uvc/luvcview-20070512.tar.gz 17 Yuyama Y, Ito M, Kiyoshige Y, Nitta Y, Matsui S, Nishii O, Hasegawa A, Ishikawa M, Yamada T, Miyakoshi J, Terada K, Nojiri T, Satoh M, Mizuno H, Uchiyama K, Wada Y, Kimura K, Kasahara H, Maejima H (2010) A 45 nm 37.3GOPS/W Heterogeneous Multi-Core SoC, IEEE International Solid-State Circuits Conference (ISSCC 2010), San Francisco, Feb (2010) 18 Matsubara D, Hiroike A (2009) High-speed similarity-based image retrieval with data-alignment optimization using self-organization algorithm Proc of the 11th IEEE International Multimedia (ISM ‘2009), pp 312–317 19 Nojiri T, Kondo Y, Irie N, Ito M, Sasaki H, Maejima H (2009) Domain partitioning technology for embedded multicore processors IEEE Micro 29(6):7–17 20 OpenCV library, http://sourcefouge.net/projects/opencvlibrary Index A AAC See Advanced audio codec (AAC) Access checklist (ACL), 172–175 ACL See Access checklist (ACL) Address extension, 153, 161–165 Advanced audio codec (AAC), 1, 179–187 Affine transformation, 54, 63 ALPBench, 200 ALU See Arithmetic logical unit (ALU) AMP See Asymmetric multiprocessor (AMP) ANSI/IEEE 754, 46, 57, 62 Area efficiency, 6, 19, 31, 41, 56, 65, 66, 73, 89, 91, 93, 94 Arithmetic logical unit (ALU), 6, 19, 25, 28, 35, 74–88, 90, 97–99, 117, 143, 145–147 Asymmetric multiprocessor (AMP), 22, 67, 69, 127 Atomic operation, 154–157 B BARR, 138, 139 BARW, 138, 139 BEU See Blend engine unit (BEU) BHT See Branch history table (BHT) Blend engine unit (BEU), 210–214 Bourne shell, 197 Branch history table (BHT), 32, 33, 37, 38 Branch prediction, 24, 32, 33, 36–38, 41, 43 Branch target buffer (BTB), 24, 32, 33 BTB See Branch target buffer (BTB) Butterfly calculation, 85–87 C CABAC See Context-adaptive binary arithmetic coding (CABAC) Cache coherency, 68, 127–129, 135–137, 193, 194 CAVLC See Context-adaptive variable-length coding (CAVLC) Centralized shared memory (CSM), 12–17, 127, 128, 136, 137, 139, 180, 182, 183 CFGM See Configuration manager (CFGM) CISC See Complicated instruction set computer (CISC) Clock gating, 38, 39, 43, 44, 110, 147, 150 Cluster, 13–16, 67, 69, 127, 128, 132, 133, 136, 139, 141, 142, 198 CODEC, 7, 15, 21, 101–111, 113, 117–119, 146, 147, 172, 212 Coherency, 68, 127–129, 135–137, 193, 194 Complicated instruction set computer (CISC), 23 COMS technology, Configuration manager (CFGM), 6, 74–77, 80–82, 145 Context-adaptive binary arithmetic coding (CABAC), 7, 101, 104, 106, 107, 113, 115 Context-adaptive variable-length coding (CAVLC), 101, 106, 107 Cooley–Tukey algorithm, 85 CPU, 1, 5–7, 14, 16, 57, 74–76, 78, 80–85, 89, 127, 137, 144, 154, 157, 165, 166, 170, 175, 176, 179, 180, 182–190, 194–195, 197, 200–202 K Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, DOI 10.1007/978-1-4614-0284-8, © Springer Science+Business Media New York 2012 219 220 CPU core, 4, 5, 7, 11, 13, 14, 16, 17, 19–21, 126, 135–138, 140, 142, 150, 154, 157, 159, 161, 165, 166, 168, 170, 171, 174, 176, 179, 180, 193–195, 198, 200 CPUfreq, 157–159, 199–201 CPU Hot Add, 158, 159, 200, 205 CPU Hot-plug, 157–159, 199–201, 205 CPU Remove, 158–160, 200, 205 CSM See Centralized shared memory (CSM) D DAA See Duplicated data array (DAA) Data flow graph (DFG), 84–86 Data transfer unit (DTU), 11–17, 68, 69, 73–74, 102, 144, 179, 180, 182–185, 187 DDR3, 14, 16, 148, 170, 211 DEB See Deblocking filters (DEB) Deblocking filters (DEB), 7, 102, 104, 110, 112, 115, 117 Delayed branch, 24 Delayed execution, 24, 33–36, 41, 61 Development efficiency, 20 DFG See Data flow graph (DFG) 3D graphics, 21, 44–47, 54–57, 60, 63–67 Dhrystone, 20, 30, 31, 40, 43, 56, 68, 69, 71, 123, 127, 136, 141, 143, 144, 149, 159 Digital convergence, 1–3, Digital visual interface (DVI), 210–213 Direct memory access controller (DMAC), 11, 13, 16, 20, 73, 74, 76, 113, 128, 147, 168, 171, 172, 174, 179, 180, 184, 185, 189 Display controller (LCDC), 210, 211 Display unit (DU), 127, 168, 171, 195, 196, 203 DMAC See Direct memory access controller (DMAC) Domain, 102–106, 110, 119, 126, 137–138, 146, 147, 194, 198, 201 Domain partitioning, 165–176 DSP, 1, 6, 89, 90, 101 DTU See Data transfer unit (DTU) DU See Display unit (DU) Duplicated data array (DAA), 68, 127, 128, 130, 131 DVD, 1, DVFS See Dynamic voltage and frequency scaling (DVFS) DVI See Digital visual interface (DVI) Dynamically reconfigurable processor, 6, 11, 15, 74, 144–146 Dynamic voltage and frequency scaling (DVFS), 194, 198–201 Index E Early-stage branch, 24, 25, 29–31, 43 ESI modes, 68 F Face recognition, 210, 213–217 Fast Fourier transform (FFT), 6, 71, 72, 85–88, 92, 135 FDL See Flexible-Engine Description Language (FDL) FE See Flexible engine (FE) FE–GA See Flexible engine/generic ALU array (FE–GA) FFT See Fast Fourier transform (FFT) Filter bank, 181, 182, 187 Fine motion estimator/compensator (FME), 7, 104, 112, 115, 117 FIPR See Floating-point inner-product instruction (FIPR) FIR, 6, 71, 72 Fixed-length ISA, 23–25, 44, 70, 71 Flexible engine (FE), 15, 16, 19, 25–28, 33, 47, 49, 51, 58 Flexible-Engine Description Language (FDL), 83, 84 Flexible engine/generic ALU array (FE–GA), 19, 69, 74–88, 143–146, 149, 150 Floating-point inner-product instruction (FIPR), 26, 45–49, 51, 52, 55, 56, 58–62, 64, 65 Floating-point multiply-accumulate (FMAC) instructions, 26, 44–46, 48–50, 53, 55, 58, 59, 64 Floating-point sine and cosine approximate (FSCA), 57–59, 61, 62, 65 Floating-point square-root reciprocal approximate (FSRRA), 57–62, 64, 65 Floating-point transform vector (FTRV), 26, 46–51, 55–59, 61, 62, 64, 65 Floating-point unit (FPU), 34, 44–62, 68, 69, 127, 136, 144 FMAC See Floating-point multiplyaccumulate (FMAC) instructions FME See Fine motion estimator/compensator (FME) Forwarding, 22, 28, 34–36, 47, 49, 50, 61, 79 FPU See Floating-point unit (FPU) Frequency and voltage controller (FVC), 12–14, 102 FSCA See Floating-point sine and cosine approximate (FSCA) FSRRA See Floating-point square-root reciprocal approximate (FSRRA) Index 221 FTRV See Floating-point transform vector (FTRV) Full HD, 7, 101–103, 105, 106, 109, 110, 118, 119, 147 FVC See Frequency and voltage controller (FVC) Interrupt controller (INTC), 20, 67, 140, 141, 171, 175 I/O device, 154, 163, 166 I/O space, 163 IOzone, 164 ISA See Instruction set architecture (ISA) G Giga operations per second (GOPS), 1, 2, 89, 143, 144, 150 Global history, 32, 33, 38 Golomb, 106, 113, 114 GOPS See Giga operations per second (GOPS) GUI control, 213–217 J JCT-VC See Joint Collaborative Team on Video Coding (JCT-VC) Joint Collaborative Team on Video Coding (JCT-VC), 119 H H.264, 1, 2, 7, 15, 101, 103–106, 108, 113, 117–119, 144, 147, 212 Hardware emulation, 47, 61 Harvard architecture, 24, 28, 31 H-ch See Horizontal channel (H-ch) Heterogeneous multicore, 3, 4, 7, 8, 11–17, 19, 69, 101–103, 143, 161, 166, 179, 187, 189 Heterogeneous parallelism, 3–8 HEVC See High Efficiency Video Coding (HEVC) High Efficiency Video Coding (HEVC), 119 HIGHMEM, 161–165 Horizontal channel (H-ch), 89–92, 146 Hypervisor, 169, 170, 175, 210 I ICIs See Inter-CPU interrupts (ICIs) Idle reduction, 157–161, 199–201 ILRAM See Instruction local RAM (ILRAM) Image filtering, 206–210 Image processing, 89, 97, 99, 102, 104, 106, 108, 109, 112, 115–118, 147, 189, 194, 206, 211, 213, 215 In-order, 23, 24, 32 Instruction categorization, 25, 33 Instruction local RAM (ILRAM), 14, 43, 127, 136, 179, 180 Instruction predecoding, 43 Instruction set architecture (ISA), 21, 23–26, 33, 44, 65, 68–74 INTC See Interrupt controller (INTC) Inter-CPU interrupts (ICIs), 194 Inter-frame parallelism, 185 L Latency, 11, 12, 24, 27, 33, 40, 44, 47, 51, 57–59, 61, 69, 102, 107, 108, 130, 132, 148, 165, 166, 176 LCDC See Display controller (LCDC) LCPG See Local clock pulse generator (LCPG) Leading nonzero (LNZ) detector, 50, 62 Leakage current, 3, 20, 137, 138 Legacy software, 126 Linux, 134, 135, 141, 142, 153–165, 175, 176, 193–215, 217 Linux kernel, 140, 142, 162, 164, 193, 195, 198, 199, 216 LL/SC instructions, 154–155, 157 LM See Local memory (LM) LMBench, 155–157, 164, 175, 176 LNZ See Leading nonzero (LNZ) detector Load balancing, 167, 169, 193–199 Local clock pulse generator (LCPG), 14, 15, 128, 129 Local memory (LM), 6, 11–17, 43, 74–76, 78–80, 83, 84, 86–88, 102, 107, 145, 179, 180, 182–185 Logical partitioning, 169–170 M Macroblock, 102–104, 106–110, 117–119 Magnetic resonance imaging (MRI), 194, 207 Matrix Engine (MX), 15, 16, 19, 69, 88–100, 187–193 Matrix processor array (MPA), 89, 98, 99, 191 Matrix processor controller (MPC), 89, 90, 98, 99, 191 Memory management unit (MMU), 21, 73, 111, 170, 210 MESI See Modified, Exclusive, Shared, Invalid (MESI) modes 222 MiBench, 207 Mid-side (M/S) stereo, 181, 182 Million instructions per second (MIPS), 20–23, 31, 40–42, 70–72, 126, 127, 136, 141, 149 MIPS See Million instructions per second (MIPS) MIPS/W, 4, 5, 21, 22, 32, 41, 42, 68 MMU See Memory management unit (MMU) Modified, Exclusive, Shared, Invalid (MESI) modes, 68, 194 Motion vector, 106, 107, 190, 191, 193 MP3, 1, 2, 144 MPA See Matrix processor array (MPA) MPC See Matrix processor controller (MPC) MPEG-2, 101, 118, 119, 200, 202, 204, 205, 212–214 MPEG-4, 15, 101, 118, 119, 212 MRI See Magnetic resonance imaging (MRI) M/S stereo See Mid-side (M/S) stereo Multicore, 3, 4, 7, 8, 11–17, 20, 67–69, 102, 103, 123–126, 136, 137, 140–141, 143, 153–175, 179, 187, 189, 193, 206, 208, 210 Multidomain embedded system, 165, 167, 170 Multimedia, 16, 19–22, 56, 89, 91, 165 Multiprocessor, 67–69, 127, 128, 138, 165, 175, 194 Multithread, 159, 194, 200 MX See Matrix Engine (MX) MX-1, 88–92, 94, 96–100 MX-2, 69, 88, 97–100, 143–146, 148–150, 193 O OLRAM, 14, 43, 127, 179 OpenCV, 215 Operating frequency, 3, 28, 32, 38, 41, 57, 65, 67–69, 91, 99, 102–104, 106, 109, 118, 119, 124, 126, 181, 193 Optimizing compiler, 30 Out-of-order, 23, 24, 32–34, 36–38, 41, 52, 55, 57, 58, 65, 114 P Packet, 131, 132 Page table entry (PTE), 163, 164 Parallel decoding, 23 Parallelization, 22–32, 44–57, 108, 126, 186 Parallel processing, 3, 11, 15, 98, 102, 134, 187, 190, 191 Paravirtualization, 170, 176 PCM See Pulse-code modulation (PCM) Index PE See Processing elements (PE) Physical partitioning, 168–169 Physical partitioning controller (PPC), 165, 170, 172–176 PID See Process ID (PID) PIPE See Programmable image processing elements (PIPE) Pipeline hazard, 24, 25, 33 Pitch, 52, 58, 59, 65, 116 PMB See Privileged mapping buffer (PMB) Pointer controlled pipeline, 38, 39 Pollack’s rule, 20, 123 Power domain, 137–138, 198, 201 Power efficiency, 4, 5, 20, 21, 32, 38, 41, 42, 56, 66, 68–70, 123, 126, 135, 143, 148, 150 Power gating, 198–200 Power management, 68, 110, 128–129, 198–206 Powersave, 158, 160, 199, 202, 207 Power wall, 123 PPC See Physical partitioning controller (PPC) Prefix, 69–72 Privileged mapping buffer (PMB), 73 Process ID (PID), 197, 198 Processing elements (PE), 6, 7, 89, 95, 115, 116 Processing unit (PU), 11–14, 16, 17, 103, 115, 117, 212 Programmable image processing elements (PIPE), 7, 112, 115–117, 147 PTE See Page table entry (PTE) PU See Processing unit (PU) Pulse-code modulation (PCM), 181, 184, 186 Q Quantization, 106, 181, 182, 187 R RAYTRACE, 159–161 Real-time image recognition, 187–193 Real-time operating system (RTOS), 165, 166 Reconfigurable processor, 6, 11, 15, 19, 74, 144–146 Reduced instruction set computer (RISC), 19–23, 25, 70, 117, 170 Register conflict, 22, 27, 28, 55, 64 Resource conflict, 22, 24, 25 Resume standby, 67, 68, 137 RISC See Reduced instruction set computer (RISC) Index RP-1 prototype chip, 19, 22, 67, 123, 125–136, 141, 153, 154, 175, 193–198 RP-2 prototype chip, 19, 22, 67, 123, 125, 136–143, 153, 157, 159, 193, 194, 198–206 RP-X prototype chip, 14, 19, 69, 123, 125, 143–150, 153, 161, 162, 164, 165, 193, 194, 206–211 RTOS See Real-time operating system (RTOS) S SAD See Sum of absolute difference (SAD) SEQM See Sequence manager (SEQM) Sequence manager (SEQM), 6, 75–77, 80–81, 83, 145, 146 SH-1, 4, 20, 21 SH-2, 20, 21 SH-3, 21, 31, 32, 40, 41 SH-4, 21–36, 40–42, 44–59, 61–63, 65–67 SH-5, 21 SH-4A, 21, 22, 67, 170, 179, 194, 210 SH core, 19, 67, 70, 179, 184 SH-3E, 44, 56, 58, 59, 65, 66 SH processor, 4, 20, 21, 69 SH-X, 21, 32–43, 56–67 SH-X2, 21, 42–44, 67 SH-X3, 22, 67–70, 126–128, 132, 133, 136, 141 SH-X4, 22, 69–74, 143, 144, 149, 150 SIMD See Single instruction multiple data (SIMD) Single instruction multiple data (SIMD), 6, 7, 16, 19, 45, 58, 60, 88–90, 92, 94, 116, 117, 146, 188 Smallest Univalue Segment Assimilating Nucleus (SUSAN), 207–209 SMP See Symmetric Multiprocessor (SMP) SNC See Snoop controller (SNC) Snoop, 68, 128–131, 135 Snoop controller (SNC), 67, 68, 127, 128, 130, 131, 135 SoC See System on a chip (SoC) Spatiotemporal Markov random field model (S-T MRF), 189–192 Special purpose processor (SPP), 11–17, 101, 102 SPLASH-2, 134, 135, 140–142 Split transaction, 131, 132 SPP See Special purpose processor (SPP) SRAM, 6, 89–91, 93, 96, 98, 99, 127, 128, 133, 136, 146 S-T MRF See Spatiotemporal Markov random field model (S-T MRF) 223 Store buffer, 33–35, 41 Store with extension (STX), 80, 113, 114 STX See Store with extension (STX) Sum of absolute difference (SAD), 94, 191–193 SuperHTM, 19–22, 68, 69 SuperHyway, 21, 74, 127, 128, 131–133, 136, 138, 144–146 Superpipeline, 24, 32–36, 38, 41, 43, 65 Superscalar, 22–27, 29–32, 55, 56, 65, 68 SUSAN See Smallest Univalue Segment Assimilating Nucleus (SUSAN) Symmetric multiprocessor (SMP), 22, 67, 68, 127–129, 134, 135, 141, 142, 153–156, 193–210 Synchronization, 116, 138–139 System on a chip (SoC), 1, 3, 20, 67, 76, 123–126, 131, 137, 143, 189, 190, 192 T TAS instruction, 154, 155, 157 Thread, 16, 80–84, 87, 88, 134, 135, 141, 142, 154, 158–160, 190, 191, 194, 200, 208 TLBs See Translation look aside buffer (TLBs) Transformer (TRF), 7, 104, 112, 115, 117 Translation look aside buffer (TLBs), 28, 154, 161, 164, 170 TRF See Transformer (TRF) U uITRON, 210–215 URAM See User RAM (URAM) USB, 171, 202, 203, 208, 210, 215 User RAM (URAM), 14, 127, 137, 138, 179, 180, 184, 185 V VC-1, 101, 118, 119, 144 V-ch See Vertical channel (V-ch) Vertical channel (V-ch), 89, 90, 92–94, 97, 146 VEU See Video engine unit (VEU) Video codec, 7, 21, 101–111, 113, 117–119, 212 Video engine unit (VEU), 210–213 Video image search, 210–217 Video processing unit (VPU), 15, 16, 19, 101–119, 143, 144, 146, 147, 149, 210–214 224 Index Virtualization, 170 Virtual Socket Interface (VSI), 131 VPU See Video processing unit (VPU) VSI See Virtual Socket Interface (VSI) X Xeyes, 204 Xosview, 205–206 XREG, 97–98 W Way prediction, 43 Z Zero-cycle transfer, 24, 28, 31, 47 ... the embedded systems Our heterogeneous multicore technology is based on an embedded processor core to achieve high power efficiency In the embedded processor field, increasing the performance per... memories, then checks and waits for the end of the data transfer, etc Some DTUs are capable of K Uchiyama et al., Heterogeneous Multicore Processor Technologies for Embedded Systems, DOI 10.1007/978-1-4614-0284-8_2,... Heterogeneous Multicore Processor Technologies for Embedded Systems, DOI 10.1007/978-1-4614-0284-8_3, © Springer Science+Business Media New York 2012 19 20 Processor Cores increasing on the processor with

Định dạng
Số trang	233
Dung lượng	7,18 MB